Parquet File format - Storage details




Columnar storage has revolutionized Big data processing, since its inception. Its power can be realized from the fact that Google Big Query, Hbase, Amazon Redshift, Azure SQL Data Warehouse and many more, all utilize columnar storage.

As you are here today reading this article, it is obvious that you are curious to learn how Columnar storage internally works. Well, just stay tuned for next few minutes, as I'll explain all the details of the most popular columnar file format called Parquet. 

We'll learn the details by actually opening a parquet file and by going deep, at the level of how things work under the hood at disc level. We'll also learn why Parquet would save cost, when it is used as underlying file format with AWS services such as Athena.

When to use Parquet format - 

Just check this less than 30 seconds conversation, going on between a Data Engineer(male) and a BI Analyst(female).


When to use parquet - Whenever our dataset has large number of attributes(columns) and our processing on that data (lets say ETL code or BI reporting queries), is focused only on very small number of attributes and we want this data processing to execute quickly. 

For instance, say we have a table with 500 columns, but most of our SQL queries execute on just 10 out of those 500 columns. Here Parquet, would bring back results very swiftly, when compared to any row format storage(such as Text, AVRO etc). 

Well what is so different about Parquet which brings such efficiency?


What is inside a Parquet file

Like any file format, parquet file consists of data. But along with the data, it also stores metadata about how the data has been stored on disc. Metadata is written after the data is successfully written. 

The main units of Parquet file format which contribute to its efficiency are, Row groups, Column chunks and Page. You might be thinking "What the hell are these row groups, column chunks, blah blah?" Don't worry if these 3 terms does not make sense, as these would make complete sense, once you have read this article till the end. 

Without going into the documentation details of Parquet, which you can anytime find here, lets pick a business requirement and open a parquet file and practically understand the reason for efficiency.

Our Parquet Data

We'll use the data of a hypothetical vehicle valuation company, such as Kelly Blue Book(aka KBB). Basically kbb.com provides a platform to know the value of any used as well as new cars. To earn money, KBB also let different car companies such as Toyota, Honda etc, post the ads of their cars on kbb.com. And to post ads, these car companies pay money to KBB. 

So our parquet data is the data of advertisements of cars, as posted by different car companies(such as Honda, Toyota etc) on kbb.com. 

Our data is composed of one single parquet file, with approximately 40 million rows. Our parquet file consists of the following 6 columns:

1. make  - Make of the car for which ad is posted(eg Honda).
2. model  - Model of the car for which ad is posted(eg Accord).
3. advertiser_nbr  - a unique id of the car company which posted the ad. 
4. order_nbr - order number of ad(as every ad posted by different car companies is an order for kbb).
5. creative_nbr  - Unique number of ad.
6. creative_sz  - Size of the ad slot on kbb.com.

The business is just interested in the details of the advertising companies, as they are the customers who generate the revenue. And hence we need to just fetch one column advertiser_nbr out of the 6 columns. Using this data we will learn, when we want to fetch just one column advertiser_nbr out of the 6 columns, how parquet format would return results quickly

Important Note - Parquet is a binary format and a parquet file cannot be just opened and read like a text file. Hence in the next section, I will explain how we can open and read a parquet file. If you are more interested in learning just about the parquet storage details, please feel free to skip the next section and directly jump to the Metadata of our parquet file section, where I have opened and explained the storage details.

Parquet data table 

I am using Hive Datawarehouse for storing this kbb's parquet data file. I have created a Hive table add_details and stored it as Parquet, choosing my home directory /user/akshay/add_details/ as data file location on HDFS.


DDL





The data of this table is in one single parquet file 000000_0 in HDFS. To read the file, we will pull the data file 000000_0 to local, using traditional hadoop get command as follows:




Count of rows in this table - Our parquet dataset has close to 39.5 million records as shown below:


















Lets see what is inside the Parquet file:

Parquet format is a binary format and a parquet file cannot be just opened and read like a text file. Hence we'll use parquet-tools(a utility built to read parquet files) and read the metadata of our file 000000_0, to see how parquet stores the columns of these 39.5 million records.

Steps to read parquet file:

parquet-tools project was specifically created, to support easy readability of parquet files. I have the parquet-tools utility script at /opt/cloudera/parcels/CDH-5.4.4-1.cdh5.4.4.p1406.1253/bin/ in my Cloudera Hadoop cluster. 

Note - I am keeping the details of parquet-tools out of the scope of this article and will be publishing a separate article on same.

To read parquet file's metadata on a Cloudera Hadoop cluster, you can use the following command:




To view the parquet file present in local directory(not in HDFS), follow the following steps:




As we want to view just the metadata of parquet file and not the actual data. Hence just execute the parquet-tools script with meta argument as follows:





Metadata of our parquet file:


On opening the metadata file, this is what we see


 


Click on image to enlarge


















Our Parquet file stores the 39.5 million rows, by dividing them into 3 row groups(highlighted in picture in blue box)
A row group is a logical partitioning of the data into group of rows.  

The 3 row groups have the following row count (highlighted in picture in green box): 
row group 1 - 18.6 million rows
row group 2 - 18.8 million rows
row group 3 - 2 million rows

Each row group consists of column chunks. A column chunk is a chunk of data values for any specific column. 
Each row group contains exactly one column chunk per column of the data schema. Since we have 6 columns in our data, hence each row group contains 6 column chunks. What is important to remember, is that a column chunk is guaranteed to be stored contiguously in the physical parquet file

A column chunk is further subdivided into pages. And a page is an indivisible unit. 

Now the business is just interested in getting the advertising company details and hence we need to just fetch one column advertiser_nbr.  So it is time to understand why fetching just this column out of the 6 columns, would fetch us results quickly.


Row group 1: 

On focussing on advertiser_nbr of row group 1, we see the following storage details(highlighted in picture in orange boxes):

1. BINARY - simply means it is binary data and is stored as byte array.

2. SNAPPY(in orange) - It is the compression algorithm. Data is compressed using SNAPPY algorithm.

3. FPO(in orange) - First Page offset. This is the exact first starting storage point(offset) on disc, from where the values of column advertiser_nbr in row group 1 are stored. 
FPO:4 means that the values for advertiser_nbr are stored starting from offset 4 on disc. 
 
4. SZ(in orange) - SZ:19022560. This is the size in bytes, of 18.6 million values of the  column advertiser_nbr. This means that 19022560 bytes of data for advertiser_nbr column's 18.6 million values, are stored on disc starting from offset 4. This is exactly where the efficiency comes from. 18.6 million values of one column advertiser_nbr are stored colocated on disc. Yes all stored contiguously at one place.  Hence when we run any SQL query or transformation code on just column advertiser_nbr, then the disc head seeking this column, gets all the 18.6 million values in single seek from one region on disc. Thus no need of several disc seeks. 

In SZ:19022560/24391353/1.28

19022560 is compressed size
24391353 is uncompressed size
1.28 is compression ratio

5. VC(in orange) - Value count, which is equal to the number of values of a column in the row group. It is 18.6 million for column advertiser_nbr in row group 1.

6. ENC(in orange) - Encoding used to store the values of this column in parquet file.


Time to answer key questions we asked in the beginning:

                          

There are 6 columns in our table. But if we write a select query fetching only one column advertiser_nbr out of the 6 columns, then:

Question 1 Why parquet format would return result faster, compared to a row format(such as text)?

AnswerTo read just the column advertiser_nbr from this parquet file, the disc head seeking this column, needs to only seek to page offset 4 and traverse till offset 19022563 (similarly seek other advertiser_nbr column chunks in Row group 2 and 3). 
And guess what - 18.6 million values (row group 1) for this column are stored colocated on disk at one place and also stored in encoded format. 
This reduces disk seeks, as the disc head(image below) seeking this one column gets all the 18.6 million values from one region on disc. Thus no need of several disc seeks. Lesser seeks means lesser processing time.
Where as in a row format, these 18 million values would be stored in row fashion, in separate rows at several distributed places on disc(in the worst case 18 million places). And to fetch these 18 million values from a row format file, the disc head would have to scan all the rows, which would require many more disk seeks and hence more processing time. 

What a disc seek looks like
What a disc seek looks like


Question 2 - How can we save cost by using Parquet with AWS Athena?

Answer - As millions of values of individual columns are stored colocated on disc, when data for specific column is fetched, we do not need to scan data for other columns unnecessarily. Thus data scan gets reduced. Whereas in a row format we would have scanned all the rows to fetch just one column, leading to more data scan. And since the pricing of Athena works on $5/TB scan(as of the date of publishing this article), Parquet by reducing scan, reduces the cost. Thus it proves to be really cost efficient.

Thank you for taking the time to read this article. I hope through this article, you have gained deeper understanding of the storage details of Parquet format and columnar storage in entirety. 

Comments

Popular posts from this blog

UPSERT in Hive(3 Step Process)

Parquet tools utility

Hive - Merging small files into bigger files.

Skew Join Optimization in Hive

Apache Spark - Sort Merge Join

Apache Spark - Explicit Broadcasting Optimization

Spark Optimization - Bucketing

Fraud Detection - Using Graph Database

Apache Spark - Broadcast Hash Join