partitioning Archives - 24 Tutorials

Deep dive into Partitioning in Spark – Hash Partitioning and Range Partitioning

Sai Kumar April 8, 2018 No Comments

Partitions- The data within an RDD is split into several partitions. Properties of partitions: – Partitions never span multiple machines, i.e., tuples in the same partition are guaranteed to be on the same machine. – Each machine in the cluster contains one or more partitions. – The number of partitions to use is configurable. By default, it equals the total number of cores on all executor nodes. Two kinds of partitioning available in Spark: – Hash partitioning – Range partitioning Customizing a partitioning is only possible on Pair RDDs. Hash partitioning- Given a Pair RDD that should be grouped: val purchasesPerCust = purchasesRdd.map(p -> (p.customerId, p.price)) // Pair RDD .groupByKey() groupByKey first computes per tuple (k, v) its partition p: p = k....

Deep dive into Partitioning in Spark – Hash Partitioning and Range Partitioning

Login

Lost Password

Register