Skew Join Optimization in Hive

By Akshay Agarwal June 02, 2016

Skew is a very common issue which most of the data engineers come across.

What is Skew - When in our data we have very large number of records associated with one(or more) particular key, then this data is said to be skewed on that key.

We'll understand this with a very simple example. Think about world's population data table, where each record is of one person and the number of records equals to total world population. As we all know China and India, each accounts for around 17-18% of world's population, hence such population table is referred to as skewed on keys country=China and country=India. If you focus on the following pie chart of world's population distribution by country, then this would make exact sense.

Problem statement -

Assume we have a table which tracks the online visits to different online websites. In this table amazon.com being very popular, has 1 Billion rows whereas other not very popular sites (www.abc.com, www.xyz.com..etc) in totality have 100K rows. Lets assume we have a column site_id which is a unique identifier of a website.

Now if we join this table with some other table on site_id using Hadoop's MapReduce, then one reducer working on amazon's rows, will have to process 1B records and other reducers will process the 100K rows. In this situation, it is common to see the reducers processing 100K records finishing quickly(say avg few minutes) as compared to the reducer for amazon.com taking a really long time(say few hours).

Solution -

In hive we can address this problem by setting the following configuration settings, in the job running the join query.

Configuration Settings:

hive.optimize.skewjoin

Default value = false

To enable skew join optimization and let hive server optimize the join where there is skew.

We need to set it to true.

hive.skewjoin.key

Default value = 100000

The value of this property determines which key is a skew key. During join for any particular key if the number of rows is more than this specified number, then that key is considered as a skew join key.

hive.skewjoin.mapjoin.map.tasks

Default value = 10000

To set the number of map tasks to use in the map join job to perform the skew join. This property needs to be used in association with the hive.skewjoin.mapjoin.min.split.

hive.skewjoin.mapjoin.min.split

Default value = 33554432

To set the minimum split size and hence calculate the maximum number of mappers to be used for the map join job for a skew join. This property should be used with hive.skewjoin.mapjoin.map.tasks for an efficient control.