How do I parse pdf file using MapReduce?

0 votes

I have pdf documents and I want to parse them using MapReduce program. I have written a java program for parsing PDF files. I am using Apache PDFBox to parse them. I am loading the PDF file using

String inputFile = "/home/edureka/pdf/abc.pdf";
File inputFile = new File(inputFile);
pdf = PDDocument.load(inputFile);

Now, I have to write a Map-reduce program to parse the pdf document. I am planning to use WholeFileInputFormat to pass the entire document as a single split.

How can i use PDFBox with SequenceFileFormat or WholeFileInputFormat?

Apr 11, 2018 in Big Data Hadoop by nitinrawat895
• 10,670 points
184 views

1 answer to this question.

0 votes
If you have your own custom InputFormat(WholeFileInputFormat). I would suggest to use PDDocument Object as your value to Map, and load the whole content of pdf into PDDocument in nextKeyValue() of WholeFileRecordReader(custom Reader).

Also make sure that ur isSplitable() returns false so that whole pdf will be loaded.
answered Apr 11, 2018 by Shubham
• 13,290 points

Related Questions In Big Data Hadoop

0 votes
1 answer

How do I join 2 tables in PIG using 2 fields?

Here, we have two tables: Tab1 having columns ...READ MORE

answered Dec 13, 2018 in Big Data Hadoop by Omkar
• 67,480 points
84 views
0 votes
1 answer

How can I get the respective Bitcoin value for an input in USD when using c#

Simply make call to server and parse ...READ MORE

answered Mar 25, 2018 in Big Data Hadoop by charlie_brown
• 7,720 points
59 views
0 votes
1 answer

How to get started with Hadoop and do some development using Eclipse IDE?

Alright, there are couple of things that ...READ MORE

answered Apr 4, 2018 in Big Data Hadoop by Ashish
• 2,630 points
94 views
0 votes
1 answer

How do I print hadoop properties in command line?

You can use the following command to get ...READ MORE

answered Apr 6, 2018 in Big Data Hadoop by kurt_cobain
• 9,240 points
206 views
0 votes
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,670 points
2,734 views
0 votes
10 answers

hadoop fs -put command?

put syntax: put <localSrc> <dest> copy syntax: copyFr ...READ MORE

answered Dec 7, 2018 in Big Data Hadoop by Aditya
13,538 views
0 votes
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,240 points
991 views
0 votes
1 answer
0 votes
1 answer

How do I include all the Hadoop dependencies using Maven?

This is a dependency mismatch error. I ...READ MORE

answered Apr 10, 2018 in Big Data Hadoop by Shubham
• 13,290 points
591 views
0 votes
1 answer

How do I connect my Spark based HDInsight cluster to my blob storage?

Go through this blog: https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-blob-storage#access-blobs I went through this ...READ MORE

answered Apr 15, 2018 in Big Data Hadoop by Shubham
• 13,290 points
612 views