Output Splitting problem in Hadoop

0 votes

I ran the following script with two files as input, the output was split into two file part-m-00000 and part-m-00001. I couldn't understand why, please assist me? Note: The size in only 8.2 MB for each file.

REGISTER PIG/PigUDF.jar;

A = LOAD "PIG/HealthCare/Input/healthcare_Sample_dataset1.csv" USING PigStorage(",") AS (patientID:int, name:chararray, date:chararray, phoneNumber:chararray, eMail:chararray, SSN:chararray, gender:chararray, disease:chararray, age:chararray);

B = LOAD "PIG/HealthCare/Input/healthcare_Sample_dataset2.csv" USING PigStorage(",") AS (patientID:int, name:chararray, date:chararray, phoneNumber:chararray, eMail:chararray, SSN:chararray, gender:chararray, disease:chararray, age:chararray);

C = UNION A, B;

D = FOREACH C GENERATE patientID, com.kamran.pig.udf.encryptField(name,"12345678abcdefgh"), com.kamran.pig.udf.encryptField(date,"12345678abcdefgh"), com.kamran.pig.udf.encryptField(phoneNumber,"12345678abcdefgh"), com.kamran.pig.udf.encryptField(eMail,"12345678abcdefgh"), com.kamran.pig.udf.encryptField(SSN,"12345678abcdefgh"), com.kamran.pig.udf.encryptField(gender,"12345678abcdefgh"), com.kamran.pig.udf.encryptField(disease,"12345678abcdefgh"), age;

STORE D INTO "PIG/HealthCare/Output/HealthCareOutput.csv";
Jul 16 in Big Data Hadoop by Rasheed
16 views

1 answer to this question.

0 votes

When you are loading two different files, it is not mandatory that the files are getting loaded into the same data block. It might get loaded into different data blocks and for each block, separate mappers might be running on them. Since the data might be present in different nodes, it can easily create different part files.

You can check by loading a small file to pig and try processing it, this is going to create a single part file in the output.

Refer below:

A = load 'weatherPIG.txt' using TextLoader as (date:chararray);

AF = foreach A generate TRIM(SUBSTRING(data, 6, 14)), TRIM(SUBSTRING(data, 46, 53)), TRIM(SUBSTRING(data, 38, 45));

store AF into 'pigudf32' using PigStorage(',');

You can check pigudf32, this folder is supposed to consist of a single part file.

answered Jul 16 by Sayni

Related Questions In Big Data Hadoop

0 votes
1 answer

Getting error in Hadoop: Output file already exist

When you executed your code earlier, you ...READ MORE

answered Apr 19, 2018 in Big Data Hadoop by Shubham
• 13,290 points
1,223 views
0 votes
1 answer

How to format the output being written by MapReduce in Hadoop?

Here is a simple code demonstrate the ...READ MORE

answered Sep 5, 2018 in Big Data Hadoop by Frankie
• 9,810 points
93 views
0 votes
1 answer

In Hadoop MapReduce, how can i set an Object as the Value for Map output?

Try this and see if it works: public ...READ MORE

answered Nov 20, 2018 in Big Data Hadoop by Omkar
• 67,290 points
34 views
0 votes
1 answer

Hadoop: How to get the column name along with the output in Hive?

You can get the column names by ...READ MORE

answered Nov 20, 2018 in Big Data Hadoop by Omkar
• 67,290 points
188 views
0 votes
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,510 points
2,408 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,510 points
246 views
0 votes
10 answers

hadoop fs -put command?

put syntax: put <localSrc> <dest> copy syntax: copyFr ...READ MORE

answered Dec 7, 2018 in Big Data Hadoop by Aditya
12,237 views
0 votes
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,240 points
898 views
0 votes
7 answers

How to run a jar file in hadoop?

I used this command to run my ...READ MORE

answered Dec 10, 2018 in Big Data Hadoop by Dasinto
4,486 views
+1 vote
2 answers

How to authenticate username & password while using Connector for Cloudera Hadoop in Tableau?

Hadoop server installed was kerberos enabled server. ...READ MORE

answered Aug 21, 2018 in Big Data Hadoop by Priyaj
• 56,520 points
167 views