How are Partitioning and Bucketing different from each other in Apache Hive?

0 votes
I know both are applied on columns in the table but how are each of these operations different from the other?
Apr 15 in Big Data Hadoop by nitinrawat895
• 9,350 points
192 views

2 answers to this question.

Your answer

Your name to display (optional):
Privacy: Your email address will only be used for sending these notifications.
0 votes

To understand how partitioning and bucketing works, we should look at how data is stored in the hive. Let's assume we have a table of employees which has their details for STD-ID, STD-NAME, COUNTRY, REG-NUM, TIME-ZONE, DEPARTMENT and etc

For instance, if you have a 'country' field, the countries in the world are about 300, so cardinality would be nearly 300. In general, when choosing a field for partitioning, it should not have high cardinality, because it will end up with too many directories in your file system.

Clustering or bucketing, on the other hand, will result with a fixed number of files, since we will specify the number of buckets we need. The Hive will take the field and calculates a hash and assigns a record to the particular bucket.

So, bucketing works well when the field has high cardinality and data is evenly distributed among buckets. Partitioning works best when the cardinality of the partitioning field is not too high.

Image result for bucketing and partitioning in hive

answered Apr 15 by nitinrawat895
• 9,350 points
0 votes

Let us consider a student database table to understand this question better.

Assume you have successfully loaded the Student table on to the HDFS and now you are about to partition it.

Care must be taken the way you apply the partition. for example, you cannot simply apply the partition on the basis of student-ID as it would end up creating a huge number of directories. It would be ideal if you apply partitions on the basis of Departments.

It all depends on Cardinality, The term refers to the number of possible value of a field that you can make.

If the cardinality is ignored then you may end up in creating many useless directories that unnecessarily consume the storage. 

Clustering or bucketing, on the other hand, will result with a fixed number of files since you will specify the number of buckets. The hive will take the field, calculate a hash and assign a record to that bucket.

So, bucketing works well when the field has high cardinality and data is evenly distributed among buckets. Partitioning works best when the cardinality of the partitioning field is not too high.

Image result for partitioning and bucketing in hive

answered Apr 15 by nitinrawat895
• 9,350 points

Related Questions In Big Data Hadoop

0 votes
1 answer

What is fork Keyword in Oozie? How Fork and Join keyword are related t o each other in Oozie?

Hey, Basically, when we want to run multiple jobs ...READ MORE

answered 6 days ago in Big Data Hadoop by Gitika
• 13,870 points
31 views
0 votes
1 answer

How to create a parquet table in hive and store data in it from a hive table?

Please use the code attached below for ...READ MORE

answered Jan 28 in Big Data Hadoop by Omkar
• 66,880 points
366 views
0 votes
1 answer

Bucketing vs Partitioning in HIve

Partition divides large amount of data into ...READ MORE

answered Jul 9, 2018 in Big Data Hadoop by Data_Nerd
• 2,340 points
1,878 views
0 votes
1 answer

How to create smaller table from big table in HIVE?

You could probably best use Hive's built-in sampling ...READ MORE

answered Sep 24, 2018 in Big Data Hadoop by digger
• 27,620 points
86 views
0 votes
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 9,350 points
1,825 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 9,350 points
152 views
0 votes
10 answers

hadoop fs -put command?

copy command can be used to copy files ...READ MORE

answered Dec 7, 2018 in Big Data Hadoop by Sujay
9,034 views
0 votes
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,260 points
648 views
0 votes
1 answer
0 votes
1 answer

How Impala is fast compared to Hive in terms of query response?

Impala provides faster response as it uses MPP(massively ...READ MORE

answered Mar 21, 2018 in Big Data Hadoop by nitinrawat895
• 9,350 points
194 views

© 2018 Brain4ce Education Solutions Pvt. Ltd. All rights Reserved.
"PMP®","PMI®", "PMI-ACP®" and "PMBOK®" are registered marks of the Project Management Institute, Inc. MongoDB®, Mongo and the leaf logo are the registered trademarks of MongoDB, Inc.