Issue with Python Read file as stream from HDFS

0 votes

I have a file in my HDFS such huge that it is unable to fit in the memory. Can I find out a way such that I can clear cache and read the file line by line as any normal file?

I tried this.

for line in open("myfile", "r"):
    # do some processing

I am looking to see if there is an easy way to get this done right without using external libraries. I can probably make it work with libpyhdfs or python-hdfs but I'd like if possible to avoid introducing new dependencies and untested libs in the system, especially since both of these don't seem heavily maintained and state that they shouldn't be used in production.

Does using standard Hadoop command line tools using python subprocess make a difference?

Is there a way to apply Python functions as right operands of the pipes using the subprocess module? Or even better, open it like a file as a generator so I could process each line easily?

cat = subprocess.Popen(["hadoop", "fs", "-cat", "/path/to/myfile"], stdout=subprocess.PIPE)

If there is another way to achieve what I described above without using an external library, I'm also pretty open.

Help me out on this one.

Jun 26, 2019 in Big Data Hadoop by nitinrawat895
• 11,380 points
1,561 views

1 answer to this question.

0 votes

The easiest way is using the following method

import pydoop.hdfs as hdfs
with hdfs.open('/user/myuser/filename') as f:
    for line in f:
        do_something(line)

If you wish to avoid external dependencies then you can go to PyDoop which is currently developed and is used in CRS4 for Computational Biology Applications.

Hope this was helpful,

Happy Learning.

answered Jun 26, 2019 by ravikiran
• 4,620 points

Related Questions In Big Data Hadoop

0 votes
2 answers

Not Able to read the file from hdfs location

Please make sure you connect to spark2-shell ...READ MORE

answered Jul 14, 2020 in Big Data Hadoop by Shantanu
• 190 points
296 views
0 votes
1 answer
0 votes
1 answer

Copy file from HDFS to the local file system

There are two possible ways to copy ...READ MORE

answered Mar 27, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
12,651 views
0 votes
1 answer

Error while copying the file from local to HDFS

Well, the reason you are getting such ...READ MORE

answered May 2, 2018 in Big Data Hadoop by Ashish
• 2,650 points
1,945 views
0 votes
0 answers

try except is not working while using hdfs command

Hi,  I am trying to run following things ...READ MORE

Mar 6, 2019 in Python by anonymous
245 views
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
6,852 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
1,098 views
+2 votes
11 answers

hadoop fs -put command?

Hi, You can create one directory in HDFS ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
48,404 views
0 votes
1 answer

Python read file as stream from HDFS

I could redirect to a Python library ...READ MORE

answered May 30, 2019 in Big Data Hadoop by ravikiran
• 4,620 points
669 views
0 votes
1 answer