How do I parse pdf file using MapReduce

0 votes

I have pdf documents and I want to parse them using MapReduce program. I have written a java program for parsing PDF files. I am using Apache PDFBox to parse them. I am loading the PDF file using

String inputFile = "/home/edureka/pdf/abc.pdf";
File inputFile = new File(inputFile);
pdf = PDDocument.load(inputFile);

Now, I have to write a Map-reduce program to parse the pdf document. I am planning to use WholeFileInputFormat to pass the entire document as a single split.

How can i use PDFBox with SequenceFileFormat or WholeFileInputFormat?

Apr 11, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
1,669 views

1 answer to this question.

0 votes
If you have your own custom InputFormat(WholeFileInputFormat). I would suggest to use PDDocument Object as your value to Map, and load the whole content of pdf into PDDocument in nextKeyValue() of WholeFileRecordReader(custom Reader).

Also make sure that ur isSplitable() returns false so that whole pdf will be loaded.
answered Apr 11, 2018 by Shubham
• 13,490 points

Related Questions In Big Data Hadoop

0 votes
1 answer

How do I join 2 tables in PIG using 2 fields?

Here, we have two tables: Tab1 having columns ...READ MORE

answered Dec 13, 2018 in Big Data Hadoop by Omkar
• 69,220 points
2,150 views
0 votes
1 answer

How can I get the respective Bitcoin value for an input in USD when using c#

Simply make call to server and parse ...READ MORE

answered Mar 25, 2018 in Big Data Hadoop by charlie_brown
• 7,720 points
1,073 views
0 votes
1 answer

How to get started with Hadoop and do some development using Eclipse IDE?

Alright, there are couple of things that ...READ MORE

answered Apr 4, 2018 in Big Data Hadoop by Ashish
• 2,650 points
2,157 views
0 votes
1 answer

How do I print hadoop properties in command line?

You can use the following command to get ...READ MORE

answered Apr 6, 2018 in Big Data Hadoop by kurt_cobain
• 9,350 points
1,841 views
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
11,064 views
+2 votes
11 answers

hadoop fs -put command?

Hi, You can create one directory in HDFS ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
109,006 views
–1 vote
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,350 points
4,632 views
0 votes
1 answer
0 votes
1 answer

How do I include all the Hadoop dependencies using Maven?

This is a dependency mismatch error. I ...READ MORE

answered Apr 10, 2018 in Big Data Hadoop by Shubham
• 13,490 points
6,541 views
0 votes
1 answer

How do I connect my Spark based HDInsight cluster to my blob storage?

Go through this blog: https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage#access-blobs I went through this ...READ MORE

answered Apr 15, 2018 in Big Data Hadoop by Shubham
• 13,490 points
2,107 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP