How are Partitioning and Bucketing different from each other in Apache Hive

0 votes
I know both are applied on columns in the table but how are each of these operations different from the other?
Apr 15, 2019 in Big Data Hadoop by nitinrawat895
• 11,380 points
3,831 views

2 answers to this question.

0 votes

To understand how partitioning and bucketing works, we should look at how data is stored in the hive. Let's assume we have a table of employees which has their details for STD-ID, STD-NAME, COUNTRY, REG-NUM, TIME-ZONE, DEPARTMENT and etc

For instance, if you have a 'country' field, the countries in the world are about 300, so cardinality would be nearly 300. In general, when choosing a field for partitioning, it should not have high cardinality, because it will end up with too many directories in your file system.

Clustering or bucketing, on the other hand, will result with a fixed number of files, since we will specify the number of buckets we need. The Hive will take the field and calculates a hash and assigns a record to the particular bucket.

So, bucketing works well when the field has high cardinality and data is evenly distributed among buckets. Partitioning works best when the cardinality of the partitioning field is not too high.

Image result for bucketing and partitioning in hive

answered Apr 15, 2019 by nitinrawat895
• 11,380 points
0 votes

Let us consider a student database table to understand this question better.

Assume you have successfully loaded the Student table on to the HDFS and now you are about to partition it.

Care must be taken the way you apply the partition. for example, you cannot simply apply the partition on the basis of student-ID as it would end up creating a huge number of directories. It would be ideal if you apply partitions on the basis of Departments.

It all depends on Cardinality, The term refers to the number of possible value of a field that you can make.

If the cardinality is ignored then you may end up in creating many useless directories that unnecessarily consume the storage. 

Clustering or bucketing, on the other hand, will result with a fixed number of files since you will specify the number of buckets. The hive will take the field, calculate a hash and assign a record to that bucket.

So, bucketing works well when the field has high cardinality and data is evenly distributed among buckets. Partitioning works best when the cardinality of the partitioning field is not too high.

Image result for partitioning and bucketing in hive

answered Apr 15, 2019 by nitinrawat895
• 11,380 points

Related Questions In Big Data Hadoop

0 votes
1 answer

How are HDFS and HBase different from each other?

Apache Hadoop project includes four key modules Hadoop ...READ MORE

answered Jun 25, 2019 in Big Data Hadoop by ravikiran
• 4,620 points
852 views
0 votes
1 answer

What is fork Keyword in Oozie? How Fork and Join keyword are related t o each other in Oozie?

Hey, Basically, when we want to run multiple jobs ...READ MORE

answered Jun 12, 2019 in Big Data Hadoop by Gitika
• 65,890 points
2,810 views
0 votes
1 answer

How to create a parquet table in hive and store data in it from a hive table?

Please use the code attached below for ...READ MORE

answered Jan 28, 2019 in Big Data Hadoop by Omkar
• 69,230 points
18,467 views
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
10,871 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
2,411 views
+2 votes
11 answers

hadoop fs -put command?

Hi, You can create one directory in HDFS ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
107,868 views
–1 vote
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,390 points
4,502 views
0 votes
1 answer

How Sqoop and Tera Data Connector for Hadoop differ from each other?

In order to make use of TD ...READ MORE

answered May 2, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
1,405 views
0 votes
1 answer

How Impala is fast compared to Hive in terms of query response?

Impala provides faster response as it uses MPP(massively ...READ MORE

answered Mar 21, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
2,070 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP