Issue with Python Read file as stream from HDFS.

0 votes

I have a file in my HDFS such huge that it is unable to fit in the memory. Can I find out a way such that I can clear cache and read the file line by line as any normal file?

I tried this.

for line in open("myfile", "r"):
    # do some processing

I am looking to see if there is an easy way to get this done right without using external libraries. I can probably make it work with libpyhdfs or python-hdfs but I'd like if possible to avoid introducing new dependencies and untested libs in the system, especially since both of these don't seem heavily maintained and state that they shouldn't be used in production.

Does using standard Hadoop command line tools using python subprocess make a difference?

Is there a way to apply Python functions as right operands of the pipes using the subprocess module? Or even better, open it like a file as a generator so I could process each line easily?

cat = subprocess.Popen(["hadoop", "fs", "-cat", "/path/to/myfile"], stdout=subprocess.PIPE)

If there is another way to achieve what I described above without using an external library, I'm also pretty open.

Help me out on this one.

Jun 26 in Big Data Hadoop by nitinrawat895
• 10,730 points
106 views

1 answer to this question.

0 votes

The easiest way is using the following method

import pydoop.hdfs as hdfs
with hdfs.open('/user/myuser/filename') as f:
    for line in f:
        do_something(line)

If you wish to avoid external dependencies then you can go to PyDoop which is currently developed and is used in CRS4 for Computational Biology Applications.

Hope this was helpful,

Happy Learning.

answered Jun 26 by ravikiran
• 4,560 points

Related Questions In Big Data Hadoop

0 votes
1 answer

Not Able to read the file from hdfs location

You have to mention the hdfs path, ...READ MORE

answered Jul 23 in Big Data Hadoop by Esha
36 views
0 votes
1 answer
0 votes
1 answer

Copy file from HDFS to the local file system

There are two possible ways to copy ...READ MORE

answered Mar 27, 2018 in Big Data Hadoop by nitinrawat895
• 10,730 points
6,426 views
0 votes
1 answer

Error while copying the file from local to HDFS

Well, the reason you are getting such ...READ MORE

answered May 2, 2018 in Big Data Hadoop by Ashish
• 2,630 points
568 views
0 votes
0 answers

try except is not working while using hdfs command

Hi,  I am trying to run following things ...READ MORE

Mar 6 in Python by anonymous
54 views
0 votes
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,730 points
3,363 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,730 points
404 views
0 votes
10 answers

hadoop fs -put command?

put syntax: put <localSrc> <dest> copy syntax: copyFr ...READ MORE

answered Dec 7, 2018 in Big Data Hadoop by Aditya
16,678 views
0 votes
1 answer

Python read file as stream from HDFS

I could redirect to a Python library ...READ MORE

answered May 30 in Big Data Hadoop by ravikiran
• 4,560 points
84 views
0 votes
1 answer