Skew Join Optimization in Hive
Skew is a very common issue which most of the data
engineers come across.
What is Skew - When in our data we have very large number of records
associated with one(or more) particular key, then this data is said to be skewed on that key.
We'll understand this with a very simple example. Think about world's population data table, where each record is of one person and the number of records equals to total world population. As we all know China and India, each accounts for around 17-18% of world's population, hence such population table is referred to as skewed on keys country=China and country=India. If you focus on the following pie chart of world's population distribution by country, then this would make exact sense.
Problem statement -
Assume we have a table which tracks the online visits to
different online websites. In this table amazon.com being very popular, has
1 Billion rows whereas other not very popular sites (www.abc.com,
www.xyz.com..etc) in totality have 100K rows. Lets assume we have a column
site_id which is a unique identifier of a website.
Now if we join this table with some other table on
site_id using Hadoop's MapReduce, then one reducer working on amazon's rows, will have to process 1B records and other reducers
will process the 100K rows. In this situation, it is common to see the reducers
processing 100K records finishing quickly(say avg few minutes) as compared to the reducer
for amazon.com taking a really long time(say few hours).
Solution -
In hive we can address this problem by setting the following
configuration settings, in the job running the join query.
Configuration Settings:
hive.optimize.skewjoin
Default value = false
To enable skew join optimization and let hive server
optimize the join where there is skew.
We need to set it to true.
hive.skewjoin.key
Default value = 100000
The value of this property determines which key is a
skew key. During join for any particular key if the number of rows is more than this specified number, then that key is considered as a skew join key.
hive.skewjoin.mapjoin.map.tasks
Default value = 10000
To set the number of map tasks to use in the map join
job to perform the skew join. This property needs to be used in association
with the hive.skewjoin.mapjoin.min.split.
hive.skewjoin.mapjoin.min.split
Default value = 33554432
To set the minimum split size and hence calculate the
maximum number of mappers to be used for the map join job for a skew join. This
property should be used with hive.skewjoin.mapjoin.map.tasks for an efficient
control.
Thank you Tejuteju. Glad to know, that you found it helpful.
ReplyDelete