Posts

Showing posts with the label Parquet

Parquet tools utility

Image
It is a well known fact, that parquet format is a binary format. And for humans, reading binary data is not as easy as opening a text file and understanding its contents. But no worries, today we'll learn how to read a parquet file . If you would like to understand all the internal details of a parquet file , you can please refer my previous article here . In this article I have tried to explain all the inner storage details of Parquet and answered most of the questions around its usage, efficiency and performance.                                                                                Parquet tools - As we know parquet is a binary format, we need an interpreter to translate its contents so that we can understand them. And this is what parquet-tools  utility project does...

Parquet File format - Storage details

Image
Columnar storage has revolutionized Big data processing, since its inception.  Its power can be realized from the fact that Google Big Query, Hbase, Amazon Redshift, Azure SQL Data Warehouse and many more, all utilize columnar storage. As you are here today reading this article, it is obvious that you are curious to learn how Columnar storage internally works. Well, just stay tuned for next few minutes, as  I'll explain all the details of the  most popular columnar file format called Parquet.  We'll learn the details  by  actually opening  a parquet file and by  going deep, at the level of how things work under the hood at disc level.  We'll also learn why Parquet would save cost, when it is used as underlying file format with A WS services such as Athena . When to use Parquet format  -  Just check this less than 30 seconds conversation, going on between a Data Engineer(male) and a  BI Analyst(female) . When to us...