Posts

Showing posts with the label Bucketing

Spark Optimization - Bucketing

Image
Complex data computation is a part of every Big Data Project.  Do you have a business scenario requiring complex join on data having billions or trillions of values?  If you are using Apache Spark, stay tuned for next few minutes, as this article will address this expensive join problem, using Spark's optimization technique called Bucketing. Before reading this article, I recommend you to once go through my previous articles on Spark's Sort Merge Join and Broadcast Hash Join . They will set up your foundation for understanding bucketing technique. It is recommended but is not mandatory, as you would still be able to grasp the core concept. When to use Bucketing ?    For any business problem, where we need to perform a join on tables with very high cardinality on join column( I repeat very high ) in say millions, billions or even trillions and also this join needs to happen multiple times. Lets understand bucketing using the data of a hypothetical service ...