Posts

Showing posts with the label join optimization

Apache Spark - Explicit Broadcasting Optimization

Image
Join operation is an indispensable part of majority of Big Data Projects. If you are using Apache Spark for your Big Data processing and using join operation, in this article I'll explain a strategy to optimize joins. The technique I'll explain is how to explicitly broadcast a table, in any Spark join operation and convert the join operation to Broadcast Hash Join. Why Broadcasting - W hen we perform a join operation in Spark SQL, b roadcasting proves very efficient in reducing data shuffling and hence serialization and deserialization of data over the network, which happens during a Sort Merge Join operation. If you haven't read my previous article on Spark's Broadcast Hash join, I highly recommend you to first go through that. This would provide you most of the deeper details of Spark's Broadcast hash Join. You can please find it here :- Broadcast Hash Join - By Akshay Agarwal Business use case  -  To explain this optimization, let's take the business...

Apache Spark - Broadcast Hash Join

Image
In this technical article, we'll discuss about one of Apache Spark's efficient join technique called Broadcast Hash Join. I will explain in detail how this operation is implemented internally in Spark at DAG(Directed Acyclic Graph) level and also different phases of DAG. We'll also uncover the details of division of spark application into jobs and stages.  Data  -  To explain this join concept, I'll take customer data of a hypothetical online service provider organization with international presence. Different customers across the world using its services holds accounts with this company. Moreover the customers uses different devices to access this company's website. Hence to fetch the details of different  customers(business_accounts) and the corresponding device used by them, we'll use two datasets: Business Accounts data   -  Data consisting of details of the customer's business account, created with this service providing organizatio...

Apache Spark - Sort Merge Join

Image
Apache Spark is a well known open source cluster computation system, for large scale data processing. If you are into Big Data and Machine Learning, Apache Spark may not need any introduction. Its capabilities include In-memory processing, Distributed SQL Analytics, Real Time stream processing, Graph processing, Machine Learning,  support for several programming languages  and many more.  " What happens under the hood, when we execute a Spark application? ".    thinking????? This is a very common question, which crops up in every Apache Spark user's mind. If you have the same question, I 'll answer this using Spark SQL by executing one of the most common data processing operation i.e. "Join". Spark being a vast technology, it is nearly impossible for anyone to cover all of Spark's details in a single article.  However, the details I'll cover here, should be enough to give you an idea of what goes on behind the scenes.  Prerequisite - This article ...