How are Partitioning and Bucketing different from each other in Apache Hive

Question

I know both are applied on columns in the table but how are each of these operations different from the other?

nitinrawat895 · Answer 1 · Apr 15, 2019

To understand how partitioning and bucketing works, we should look at how data is stored in the hive. Let's assume we have a table of employees which has their details for STD-ID, STD-NAME, COUNTRY, REG-NUM, TIME-ZONE, DEPARTMENT and etc

For instance, if you have a 'country' field, the countries in the world are about 300, so cardinality would be nearly 300. In general, when choosing a field for partitioning, it should not have high cardinality, because it will end up with too many directories in your file system.

Clustering or bucketing, on the other hand, will result with a fixed number of files, since we will specify the number of buckets we need. The Hive will take the field and calculates a hash and assigns a record to the particular bucket.

So, bucketing works well when the field has high cardinality and data is evenly distributed among buckets. Partitioning works best when the cardinality of the partitioning field is not too high.

Image result for bucketing and partitioning in hive

answered Apr 15, 2019 by nitinrawat895
• 11,380 points

nitinrawat895 · Answer 2 · Apr 15, 2019

Let us consider a student database table to understand this question better.

Assume you have successfully loaded the Student table on to the HDFS and now you are about to partition it.

Care must be taken the way you apply the partition. for example, you cannot simply apply the partition on the basis of student-ID as it would end up creating a huge number of directories. It would be ideal if you apply partitions on the basis of Departments.

It all depends on Cardinality, The term refers to the number of possible value of a field that you can make.

If the cardinality is ignored then you may end up in creating many useless directories that unnecessarily consume the storage.

Clustering or bucketing, on the other hand, will result with a fixed number of files since you will specify the number of buckets. The hive will take the field, calculate a hash and assign a record to that bucket.

So, bucketing works well when the field has high cardinality and data is evenly distributed among buckets. Partitioning works best when the cardinality of the partitioning field is not too high.

Image result for partitioning and bucketing in hive