Write a Splitted PDF Back to HDFS using Python Insecure Client

0 votes

I have used the PdfFileReader to read the file from the Data Lake and my requirement is to split the read PDF into individual pages and write back individual files back to a different folder in HDFS.

For reading files i have used below code and is working.

 from PyPDF2 import PdfFileWriter, PdfFileReader
        from io import BytesIO
        from hdfs import InsecureClient
        client = InsecureClient('http://datalake:50070')
        import requests
        from json import dumps
        
        client.status("/")
        fnames=client.list('/shared/Team5162')
        with client.read('/shared/Team5162/DemoCompany/Green Energy Limited.pdf') as reader:
                input_pdf = PdfFileReader(BytesIO(reader.read()))
        print(input_pdf.getNumPages()) 


Now i want to split the read PDF and write back.Using this code am able to create 136 individual pages.However it has no content embedded and i gets no error as well.


for i in range(input_pdf.getNumPages()):
    out_pdf  = PdfFileWriter()
    output   = out_pdf.addPage(input_pdf.getPage(i))
    #output   = out_pdf.appendPagesFromReader(input_pdf)
    filename = "/shared/Team5162/demopdf/"+"document-page%s.pdf" % i
    with client.write(filename) as writeStream:
            writeStream.write(output)


Could you please comment.

Nov 25, 2021 in Python by Kannan
• 120 points
410 views

No answer to this question. Be the first to respond.

Your answer

Your name to display (optional):
Privacy: Your email address will only be used for sending these notifications.

Related Questions In Python

0 votes
2 answers

How can I write a program to add two numbers using functions in python?

there is sum() function as a built ...READ MORE

answered Oct 25, 2020 in Python by anonymous
23,217 views
+1 vote
1 answer
+1 vote
0 answers

Sum the values of column matching and not matching from a .txt file and write output to a two different files using Python

Name                                                    value DR_CNDAOFSZAPZP_GPFS_VOL.0 139264 DR_CNDAOFSZAPZP_GPFS_VOL.1 15657 DR_CNDAOFSZAPZP_GPFS_VOL.0 139264 DR_CNDAOFSZAPZP_GPFS_VOL.1 156579 DR_CNDAOFSZAPZP_GPFS_VOL.2 156579 DR_CNDAOFSZAPZP_GPFS_VOL.3 ...READ MORE

Nov 20, 2019 in Python by Sagar
• 130 points
966 views
0 votes
1 answer

How to a write reg expression that confirms an email id using the python reg expression module “re”?

Hey, @Roshni, Python has a regular expression module ...READ MORE

answered Jun 26, 2020 in Python by Gitika
• 65,910 points
656 views
0 votes
0 answers

try except is not working while using hdfs command

Hi,  I am trying to run following things ...READ MORE

Mar 6, 2019 in Python by anonymous
929 views
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
10,523 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
2,166 views
+2 votes
11 answers

hadoop fs -put command?

Hi, You can create one directory in HDFS ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
103,823 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP