Posts

Showing posts with the label hive

Data load status check for a table without need of a trigger file

Image
When designing data pipelines, it is a common requirement to know if the data required for processing within a particular timeframe is available or not. Mostly this timeframe is one day for daily incremental processes, where a daily ETL process wants to know, if the data for current date is even available for processing or not.  For instance a process executing daily, if the data for current date is loaded into the source directory where it reads from.  If you are using Hadoop's HDFS for your data storage, in this article I'll share a piece of code, to easily verify if the data for a desired date is present in a hive table(which always have an underlying directory in HDFS).  Let us take a real world scenario where we get the data loaded in automated fashion, into a certain hive table, which would always have an underlying directory in HDFS. This data load can be through a sourcing ingestion process or the load as a result of an ETL process or the outcome of an...

Hive - Merging small files into bigger files.

Image
Since Hadoop's inception, we have learnt that it is good for processing small number of large files, instead of large number of small files. Part of the reason pertaining to this behavior is, its MapReduce Paradigm which works on splits of input data.  In our day to day data processing, there can be various situations where our processes can produce large number of small files. Assume a process generated output data of 300 MB, but this data is distributed in approximately 3000 files of 99 KB each. Today we'll learn how to merge such large number of small files into bigger files. Before that let us quickly understand the problems, such large number of small files can cause. Problems posed by large number of small files -   1.   I f we store the data of this table in GZIP format(3000 files each of 99KB), then a downstream process running a hive query(MapReduce job) on this data, would spin up 3000 mappers to process individual splits for each file, as the GZIP format ...

Update in Hive

Image
Can I do row level updates in Hive? We are all aware of UPDATE not being a supported operation, based on past years of our experience of Hive. But guess what, it is no more a constraint, as UPDATE is available starting Hive 0.14 version.  In this post we'll learn about the details of UPDATE in Hive(a long awaited operation for hadoop platform, as required by most of the Big data Engineers). We will learn about the following details: 1. The prerequisites for hive to perform update. 2. Syntax of update. 3. Hive configuration settings to do update. 4. A close look at what happens at Hadoop file system level when update operation is performed. 5. Limitations to UPDATE operation in Hive For a hive table to be updated(as of the date of publishing this article) : 1. It should be stored as ORC file format . 2. It should be bucketed . 3. It should support transactions(ACID semantics enabled) . Standard Syntax: UPDATE tablename SET column = value [, column = v...

Skew Join Optimization in Hive

Image
Skew is a very common issue which most of the data engineers come across.  What is Skew - When in our data we have very large number of records associated with one(or more) particular key, then this data is said to be skewed on that key.  We'll understand this with a very simple example. Think about world's population data table, where each record is of one person and the number of records equals to total world population. As we all know China and India, each accounts for around 17-18% of world's population, hence such population table is referred to as skewed on keys country=China and country=India. If you focus on the following pie chart of world's population distribution by country, then this would make exact sense. Problem statement  - Assume we have a table which tracks the online visits to different online websites. In this table amazon.com  being very popular,  has 1 Billion rows whereas other not very popular sites ( www.abc.com , w...

UPSERT in Hive(3 Step Process)

Image
In this post we'll learn an efficient 3 step process, for performing UPSERT in hive on a large size table containing entire history. Just for the audience not aware of UPSERT - It is a combination of UPDATE and INSERT. If on a table containing history data, we receive new data which needs to be inserted as well as some data which is an UPDATE to the existing data, then we have to perform an UPSERT operation to achieve this. Prerequisite  – The table containing history being very large in size should be partitioned, which is also a best practice for efficient data storage, when storing large volume of data in any Big Data warehouse. Business scenario  – Lets take an example of click stream data of a website, as gathered from different browsers of visitors who visited the website. The site_view_hist table contains the clicks and page impressions counts from different browsers and the table is partitioned on hit_date(the date on which the visitor hitted or visited the websi...