• Bucketing decomposes data sets into more manageable parts
• Users can specify the number of buckets for their data set
• Specifying bucketing does not guarantee that table is properly populated
• The number of bucket does not vary with data
• Bucketing is best suited for sampling
• Map-side joins can be done well with bucketing
In the below sample code , a hash function will be done on the ‘emplid’ and similar ids will be placed in the same bucket
SET hive.enforce.bucketing = true; or
Set mapred.reduce.tasks = <<number of buckets>>
CREATE TABLE empdata(emplid INT, fname STRING, lname STRING)
PARTITIONED BY (join_dt STRING)
CLUSTERED BY (emplid) INTO 64 BUCKETS;