Technical insight into Big Data, Cloud, Graph & Artificial Intelligence technologies

Posts

Showing posts with the label Broadcast Hash Join

Apache Spark - Explicit Broadcasting Optimization

By Akshay Agarwal September 25, 2018

Join operation is an indispensable part of majority of Big Data Projects. If you are using Apache Spark for your Big Data processing and using join operation, in this article I'll explain a strategy to optimize joins. The technique I'll explain is how to explicitly broadcast a table, in any Spark join operation and convert the join operation to Broadcast Hash Join. Why Broadcasting - W hen we perform a join operation in Spark SQL, b roadcasting proves very efficient in reducing data shuffling and hence serialization and deserialization of data over the network, which happens during a Sort Merge Join operation. If you haven't read my previous article on Spark's Broadcast Hash join, I highly recommend you to first go through that. This would provide you most of the deeper details of Spark's Broadcast hash Join. You can please find it here :- Broadcast Hash Join - By Akshay Agarwal Business use case - To explain this optimization, let's take the business...

Apache Spark - Broadcast Hash Join

By Akshay Agarwal September 16, 2018

In this technical article, we'll discuss about one of Apache Spark's efficient join technique called Broadcast Hash Join. I will explain in detail how this operation is implemented internally in Spark at DAG(Directed Acyclic Graph) level and also different phases of DAG. We'll also uncover the details of division of spark application into jobs and stages. Data - To explain this join concept, I'll take customer data of a hypothetical online service provider organization with international presence. Different customers across the world using its services holds accounts with this company. Moreover the customers uses different devices to access this company's website. Hence to fetch the details of different customers(business_accounts) and the corresponding device used by them, we'll use two datasets: Business Accounts data - Data consisting of details of the customer's business account, created with this service providing organizatio...