Hive - Merging small files into bigger files.

By Akshay Agarwal March 29, 2017

Since Hadoop's inception, we have learnt that it is good for processing small number of large files, instead of large number of small files. Part of the reason pertaining to this behavior is, its MapReduce Paradigm which works on splits of input data.

In our day to day data processing, there can be various situations where our processes can produce large number of small files. Assume a process generated output data of 300 MB, but this data is distributed in approximately 3000 files of 99 KB each.

Today we'll learn how to merge such large number of small files into bigger files. Before that let us quickly understand the problems, such large number of small files can cause.

Problems posed by large number of small files -

1. If we store the data of this table in GZIP format(3000 files each of 99KB), then a downstream process running a hive query(MapReduce job) on this data, would spin up 3000 mappers to process individual splits for each file, as the GZIP format is non splittable.

2. Secondly, if we want to move these 3000 files, each of just 99 KB size to S3 in AWS, then direct upload of 3000 files would be really very slow, versus if the data was organized in one single 300 MB file.

Solution - We can solve this using following hive configuration settings:

Property 1 - hive.merge.smallfiles.avgsize

Default value: 16000000(in bytes)

When the average size of output file of any job is less than this number, hive will kickoff a map reduce job to merge the output file into bigger file. The size of this bigger file can be controlled by property 2.

For Mapreduce job - This property will be affective only if hive.merge.mapredfiles is set to true.

For Map only job - This property will be affective only if hive.merge.mapfiles is set to true.

Property 2 - hive.merge.size.per.task

Default value 256000000(in bytes)

Through this property we can set the size of required merged files.

For our scenario, if we want the output data from our Mapreduce job to merge these files and produce at least 256 MB size file, then we can achieve this by setting the following properties:

set hive.merge.mapredfiles=true;

set hive.merge.smallfiles.avgsize=102400

Here since our average size of file is 99 KB which is less than 100 KB, we can choose 100KB for hive.merge.smallfiles.avgsize property to make the files eligible for merge.

I hope you found this helpful, in understanding problems which large number of small files can produce and how we can merge them into bigger files using Hive.