Parquet tools utility
If you would like to understand all the internal details of a parquet file , you can please refer my previous article here. In this article I have tried to explain all the inner storage details of Parquet and answered most of the questions around its usage, efficiency and performance.
Parquet tools - As we know parquet is a binary format, we need an interpreter to translate its contents so that we can understand them. And this is what parquet-tools utility project does for us. It lets us read parquet files.
Note - At the time of writing this blog, I used parquet-tools utility already installed on my Hadoop cluster. But you can always install parquet-tools on your system. I'll keep the installation details out of the scope of this article and would focus mainly on its usage.
Data used - For this exercise, let us take choose data from of OMS(Order Management System) system of an automotive valuation website(like KBB.com). This data consists of details of car advertisements, as posted by different car companies for sale on kbb.com. This parquet data would consist of the following fields:
1. advertiser_nbr - a unique identifier of the advertiser car company, which posted the ad
2. order_nbr - order number for ad(as every ad posted by any car company is an order for KBB)
3. creative_nbr - unique number of creative(ad)
4. creative_sz - size of ad slot on kbb.com
5. make - make of the car for which the ad is posted(eg Honda)
6. model - model of the car for which the ad is posted(eg Accord).
This is the same data I used in my previous article. This ad data is stored as parquet in my Hadoop cluster's HDFS location hdfs://user/akshay/ad/000001_0. Here 000001_0 is the underlying parquet file of this table.
On a Cloudera Hadoop Cluster, to read the data or metadata of parquet file directly from HDFS, we can use the parquet tools utility as follows:
hadoop parquet.tools.Main meta /user/akshay/ad/000001_0
This will show us all the metadata details of row groups, column chunks and page of a parquet file. Following is the output of above command:
To view the data of a parquet file we can use the cat argument as follows:
hadoop parquet.tools.Main cat /user/akshay/ad/000001_0
We can also view the details of a parquet file, by directly executing the parquet tools in local, with the required parameters as follows:
1. To read metadata - First you need get the parquet files into your local desired path where you want to read it from. I have my parquet files in HDFS and I am using my home directory(/home/akshay/) to place the parquet files in using traditional hadoop get command.
$ cd ~
get the parquet file in local using the filesystem get command:
$ hdfs dfs -get /user/akshay/ad/000001_0
Navigate to the directory where your parquet-tools utility script is installed(I have it at /opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4/bin)
$ cd /opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4/bin
$ ./parquet-tools meta /home/akshay/000001_0
2. To read data - Since we are already in the directory of parquet-tools utility, use the cat argument to read the actual data of parquet file.
$ ./parquet-tools cat /home/akshay/000001_0
3. To read schema -
$ ./parquet-tools schema /home/akshay/000001_0
4. To read top n rows(eg 10 rows)-
$ ./parquet-tools head -n 10 /home/akshay/000001_0
This is good info Akshay. Thanks! Do you know how to use parquet-tools to merge multiple parquet files into 1 and remove the original files?
ReplyDeleteHi I would sugggest using Spark or Hive themselves for doing such merging. If you are using Spark, please try using coalesce as it is really good at achieving this without unnecessary shuffling of data(which repartition does). To coalesce pass 1 as a parameter. If you are using Hive you can please refer my other article to achieve this merging: https://letsexplorehadoop.blogspot.com/2017/03/hive-merging-small-files-into-bigger.html
Delete