How are Partitioning and Bucketing different from each other in Apache Hive?

0 votes
I know both are applied on columns in the table but how are each of these operations different from the other?
Apr 15 in Big Data Hadoop by nitinrawat895
• 10,760 points
536 views

2 answers to this question.

0 votes

To understand how partitioning and bucketing works, we should look at how data is stored in the hive. Let's assume we have a table of employees which has their details for STD-ID, STD-NAME, COUNTRY, REG-NUM, TIME-ZONE, DEPARTMENT and etc

For instance, if you have a 'country' field, the countries in the world are about 300, so cardinality would be nearly 300. In general, when choosing a field for partitioning, it should not have high cardinality, because it will end up with too many directories in your file system.

Clustering or bucketing, on the other hand, will result with a fixed number of files, since we will specify the number of buckets we need. The Hive will take the field and calculates a hash and assigns a record to the particular bucket.

So, bucketing works well when the field has high cardinality and data is evenly distributed among buckets. Partitioning works best when the cardinality of the partitioning field is not too high.

Image result for bucketing and partitioning in hive

answered Apr 15 by nitinrawat895
• 10,760 points
0 votes

Let us consider a student database table to understand this question better.

Assume you have successfully loaded the Student table on to the HDFS and now you are about to partition it.

Care must be taken the way you apply the partition. for example, you cannot simply apply the partition on the basis of student-ID as it would end up creating a huge number of directories. It would be ideal if you apply partitions on the basis of Departments.

It all depends on Cardinality, The term refers to the number of possible value of a field that you can make.

If the cardinality is ignored then you may end up in creating many useless directories that unnecessarily consume the storage. 

Clustering or bucketing, on the other hand, will result with a fixed number of files since you will specify the number of buckets. The hive will take the field, calculate a hash and assign a record to that bucket.

So, bucketing works well when the field has high cardinality and data is evenly distributed among buckets. Partitioning works best when the cardinality of the partitioning field is not too high.

Image result for partitioning and bucketing in hive

answered Apr 15 by nitinrawat895
• 10,760 points

Related Questions In Big Data Hadoop

0 votes
1 answer

How are HDFS and HBase different from each other?

Apache Hadoop project includes four key modules Hadoop ...READ MORE

answered Jun 25 in Big Data Hadoop by ravikiran
• 4,580 points
42 views
0 votes
1 answer
0 votes
1 answer

How to create a parquet table in hive and store data in it from a hive table?

Please use the code attached below for ...READ MORE

answered Jan 28 in Big Data Hadoop by Omkar
• 68,180 points
3,647 views
0 votes
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,760 points
3,531 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,760 points
433 views
+1 vote
11 answers

hadoop fs -put command?

put syntax: put <localSrc> <dest> copy syntax: copyFr ...READ MORE

answered Dec 7, 2018 in Big Data Hadoop by Aditya
17,963 views
0 votes
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,280 points
1,298 views
0 votes
1 answer
0 votes
1 answer

How Impala is fast compared to Hive in terms of query response?

Impala provides faster response as it uses MPP(massively ...READ MORE

answered Mar 21, 2018 in Big Data Hadoop by nitinrawat895
• 10,760 points
384 views