Output Splitting problem in Hadoop

0 votes

I ran the following script with two files as input, the output was split into two file part-m-00000 and part-m-00001. I couldn't understand why, please assist me? Note: The size in only 8.2 MB for each file.

REGISTER PIG/PigUDF.jar;

A = LOAD "PIG/HealthCare/Input/healthcare_Sample_dataset1.csv" USING PigStorage(",") AS (patientID:int, name:chararray, date:chararray, phoneNumber:chararray, eMail:chararray, SSN:chararray, gender:chararray, disease:chararray, age:chararray);

B = LOAD "PIG/HealthCare/Input/healthcare_Sample_dataset2.csv" USING PigStorage(",") AS (patientID:int, name:chararray, date:chararray, phoneNumber:chararray, eMail:chararray, SSN:chararray, gender:chararray, disease:chararray, age:chararray);

C = UNION A, B;

D = FOREACH C GENERATE patientID, com.kamran.pig.udf.encryptField(name,"12345678abcdefgh"), com.kamran.pig.udf.encryptField(date,"12345678abcdefgh"), com.kamran.pig.udf.encryptField(phoneNumber,"12345678abcdefgh"), com.kamran.pig.udf.encryptField(eMail,"12345678abcdefgh"), com.kamran.pig.udf.encryptField(SSN,"12345678abcdefgh"), com.kamran.pig.udf.encryptField(gender,"12345678abcdefgh"), com.kamran.pig.udf.encryptField(disease,"12345678abcdefgh"), age;

STORE D INTO "PIG/HealthCare/Output/HealthCareOutput.csv";
Jul 16, 2019 in Big Data Hadoop by Rasheed
199 views

1 answer to this question.

0 votes

When you are loading two different files, it is not mandatory that the files are getting loaded into the same data block. It might get loaded into different data blocks and for each block, separate mappers might be running on them. Since the data might be present in different nodes, it can easily create different part files.

You can check by loading a small file to pig and try processing it, this is going to create a single part file in the output.

Refer below:

A = load 'weatherPIG.txt' using TextLoader as (date:chararray);

AF = foreach A generate TRIM(SUBSTRING(data, 6, 14)), TRIM(SUBSTRING(data, 46, 53)), TRIM(SUBSTRING(data, 38, 45));

store AF into 'pigudf32' using PigStorage(',');

You can check pigudf32, this folder is supposed to consist of a single part file.

answered Jul 16, 2019 by Sayni

Related Questions In Big Data Hadoop

0 votes
1 answer

Getting error in Hadoop: Output file already exist

When you executed your code earlier, you ...READ MORE

answered Apr 19, 2018 in Big Data Hadoop by Shubham
• 13,480 points
4,233 views
0 votes
1 answer

How to format the output being written by MapReduce in Hadoop?

Here is a simple code demonstrate the ...READ MORE

answered Sep 5, 2018 in Big Data Hadoop by Frankie
• 9,810 points
765 views
0 votes
1 answer

In Hadoop MapReduce, how can i set an Object as the Value for Map output?

Try this and see if it works: public ...READ MORE

answered Nov 20, 2018 in Big Data Hadoop by Omkar
• 69,090 points
211 views
0 votes
1 answer

Hadoop: How to get the column name along with the output in Hive?

You can get the column names by ...READ MORE

answered Nov 20, 2018 in Big Data Hadoop by Omkar
• 69,090 points
2,480 views
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
6,819 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
1,092 views
+2 votes
11 answers

hadoop fs -put command?

Hi, You can create one directory in HDFS ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
47,964 views
–1 vote
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,390 points
2,588 views
0 votes
7 answers

How to run a jar file in hadoop?

I used this command to run my ...READ MORE

answered Dec 10, 2018 in Big Data Hadoop by Dasinto
13,742 views
+1 vote
2 answers

How to authenticate username & password while using Connector for Cloudera Hadoop in Tableau?

Hadoop server installed was kerberos enabled server. ...READ MORE

answered Aug 21, 2018 in Big Data Hadoop by Priyaj
• 58,100 points
547 views