What is Custom partitioner in Hadoop How to write partition function

0 votes
I am trying to write a new Hadoop job for input data that is somewhat skewed. An analogy for this would be the word count example in Hadoop tutorial except lets say one particular word is present lot of times.

I want to have a partition function where this one key will be mapped to multiple reducers and remaining keys according to their usual hash paritioning. Is this possible?
Sep 18, 2018 in Big Data Hadoop by Neha
• 6,300 points
1,543 views

1 answer to this question.

0 votes

Don't think that in Hadoop the same key can be mapped to multiple reducers. But, the keys can be partitioned so that the reducers are more or less evenly loaded. For this, the input data should be sampled and the keys be partitioned appropriately. 

Check the Yahoo Paper for more details on the custom partitioner. 

The Yahoo Sort code is in the org.apache.hadoop.examples.terasort package.

Lets say Key A has 10 rows, B has 20 rows, C has 30 rows and D has 60 rows in the input. Then keys A,B,C can be sent to reducer 1 and key D can be sent to reducer 2 to make the load on the reducers evenly distributed. To partition the keys, input sampling has to be done to know how the keys are distributed.

Here are some more suggestions to make the Job complete faster.

Specify a Combiner on the JobConf to reduce the number of keys sent to the reducer. 

This also reduces the network traffic between the mapper and the reducer tasks. Although, there is no guarantee that the combiner will be invoked by the Hadoop framework.

Also, since the data is skewed (some of the keys are repeated again and again, lets say 'tools'), 

you might want to increase the # of reduce tasks to complete the Job faster. This ensures that while a reducer is processing 'tools', the other data is getting processed by other reducers in parallel.

answered Sep 18, 2018 by Frankie
• 9,830 points

Related Questions In Big Data Hadoop

0 votes
11 answers
0 votes
1 answer

How to delete a directory from Hadoop cluster which is having comma(,) in its name?

Just try the following command: hadoop fs -rm ...READ MORE

answered May 7, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
3,046 views
+1 vote
1 answer

What is the technique to know the Default scheduler in hadoop?

Default scheduler in hadoop is JobQueueTaskScheduler, which is ...READ MORE

answered Oct 31, 2018 in Big Data Hadoop by Frankie
• 9,830 points
1,587 views
0 votes
1 answer

Hadoop Spark: What is version to find SparkSession in library Spark?

you need both core and SQL artifacts <repositories> ...READ MORE

answered Nov 13, 2018 in Big Data Hadoop by Omkar
• 69,220 points
2,076 views
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
10,987 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
2,503 views
+2 votes
11 answers

hadoop fs -put command?

Hi, You can create one directory in HDFS ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
108,568 views
–1 vote
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,390 points
4,589 views
0 votes
1 answer

What is Modeling data in Hadoop and how to do it?

I suggest spending some time with Apache ...READ MORE

answered Sep 19, 2018 in Big Data Hadoop by Frankie
• 9,830 points
1,770 views
0 votes
1 answer

What is -cp command in hadoop? How it works?

/user/cloudera/data1 is not a directory, it is ...READ MORE

answered Oct 17, 2018 in Big Data Hadoop by Frankie
• 9,830 points
4,152 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP