broadcast Archives - 24 Tutorials

Joins in Spark SQL- Shuffle Hash, Sort Merge, BroadCast

Sai Kumar April 22, 2020 Comments Closed

Apache Spark SQL component comes with catalyst optimizer which smartly optimizes the jobs by re-arranging the order of transformations and by implementing some special joins according to datasets. Spark performs these joins internally or you can force it to perform them. It’s worthwhile to know this topic, so that it comes to rescue when optimizing the jobs according to your use case. Shuffle Hash Join Shuffle hash join shuffles the data based on join keys and then perform the join. The shuffled hash join ensures that data on each partition will contain the same keys by partitioning the second dataset with the same default partitioner as the first, so that the keys with the same hash value from both datasets are in the same partition. It follows the classic map-reduce pattern: First ...

Joins in Spark SQL- Shuffle Hash, Sort Merge, BroadCast

Login

Lost Password

Register