Using PHP to get all file names in a folder stored in HDFS

0 votes

How can get all the file names of files in a specific folder in my HDFS. Technically, i should be looping all the files in the folder and then catching the file names. Is there a way to do this? 
I know i should be using PHP cURL to access the webHDFS but i can't find an appropriate code.

For e.g. i wish to get the below filenames from the folder folder11 and store them in variables in PHP:

Mar 9, 2019 in Big Data Hadoop by Bhavish
• 370 points
3,608 views

2 answers to this question.

+1 vote
Best answer

So i found a workaround for the above problem, it's basically another scenario.

What i did is, instead of uploading files to hadoop using copyFromLocal, i used PHP cURL. I will try to explain this step by step.
So you create a php script and insert the below codes:

function call_curl($headers, $method, $url, $data,$file,$size) {

    $handle = curl_init();

    curl_setopt($handle, CURLOPT_URL, $url);

    curl_setopt($handle, CURLOPT_HTTPHEADER, $headers);

    curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);

    curl_setopt($handle, CURLOPT_SSL_VERIFYHOST, false);

    curl_setopt($handle, CURLOPT_SSL_VERIFYPEER, false);

    curl_setopt($handle, CURLOPT_SAFE_UPLOAD, true); 

    switch($method) {

            case 'GET':

            break;

            case 'POST':

                curl_setopt($handle, CURLOPT_POST, true);

                curl_setopt($handle, CURLOPT_POSTFIELDS, $data);

            break;

            case 'PUT': 

                curl_setopt($handle, CURLOPT_CUSTOMREQUEST, 'PUT');

                curl_setopt($handle, CURLOPT_POSTFIELDS, $data);

                curl_setopt($handle, CURLOPT_INFILE, $file);

                curl_setopt($handle, CURLOPT_INFILESIZE, $size);

            break;

            case 'DELETE':

                curl_setopt($handle, CURLOPT_CUSTOMREQUEST, 'DELETE');

            break;

    }

    $response = curl_exec($handle);

    $code = curl_getinfo($handle, CURLINFO_HTTP_CODE);

    curl_close($handle);

    return $response;

}

The above function will make a request to hadoop and perform operation that we will define below. In our case that will be PUT since we are uploading to hadoop.

Now i will define a path to the folder containing the files that i wish to upload. 

$dir = '/var/www/html/myData';

Now i will create a for loop that will loop through all the files to get all the filenames and afterwards get other details such as file size, file path etc (depends on what you need, basically in my case, i am storing each file details in my database).

foreach (new DirectoryIterator($dir) as $fileInfo) {    

    if($fileInfo->isDot()) continue;    

    $filename = $fileInfo->getFilename();

     

//start curl php session to connect and upload file on hdfs

$header = array('Content-Type: application/octet-stream');

$method = "PUT";


//Path of zip folder 

$filepath=$dir."/".$filename;

//echo $filepath;

//size of zip folder

$size = filesize($filepath);

//echo $size;

//verify method of storing the zip folder

if ($size<=(1024*1024*1024)){

$rep = "3WayReplication";

}else{

$rep="erasure";

}

$url="http://chbpc-VirtualBox:9864/webhdfs/v1/".$username."/".$rep."/".$filename."?op=CREATE&namenoderpcaddress=localhost:9000&createflag=&createparent=true&overwrite=false";

$file=fopen($filepath, 'r');

//echo $file;

$filedata =  fread($file,$size);

//echo $filedata;

$data = array($filedata, $target_file);

//echo "<br/>data: ".$data;

//echo "<br>";

//echo "url: " . $url;

call_curl($header, $method, $url, $data,$file,$size);


//Store file details into database

$m = new MongoClient();

$collection = $m->ecoss->fileInfo;

$document = array(

"rootFolder" => $username,

"fileName" => $filename,

"filePath" => $filepath,

"fileSize" => ($size/1024)."kb",

"replicationType" => $rep,

"uploadDate" => $date,

"uploadTime" => $time

);

$collection->insert($document);

//echo "Document inserted successfully";


//Allow permission chmod 777 in root folder for deletion of files

if (!unlink($filepath)) {

  //echo ("Error deleting ".$filename);

}

else{

  //echo ("Deleted ".$filename);

}

}

As you can see above, i am using mongoDB to store my file details. This is why i posted the above question since i was using copyFromLocal , there is no way i would get the file information that was being uploaded to my hdfs. Using php cURL, i have my file names stored in the variable $filename and i am able to store in my database.  You can just skip the mongoDB codes if you don't need it. 

Now there is something very important, that is the $url. You can just copy paste the above codes, that is absolutely fine, but the url in your case would be different. To get your own $url, please make reference to this: https://hadoop.apache.org/docs/r1.0.4/webhdfs.html#MKDIRS (Create and Write to a File).

Basically you need to have curl installed on your machine. If not, pls follow this link: https://stackoverflow.com/questions/38800606/how-to-install-php-curl-in-ubuntu-16-04

Now, open your terminal (as per the previous hadoop apache link), type your curl command: 

curl -i -X PUT "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE

where <HOST> is your hostname that you can find by typing on the terminal:

 hostname

 <PORT> is the port you use to connect to hadoop and <PATH> is the path to the folder where you wish to upload your files to. After issuing this first command, you will get a response like:

After this, issue another curl command:

curl -i -X PUT -T <LOCAL_FILE>

Here <LOCAL_FILE> is  the path to the file that you wish to upload to hdfs. After that, copy the location that you received after having issued the first command. Im my case, the location is: 

http://chbpc-VirtualBox:9864/webhdfs/v1/test4/datafile?op=CREATE&namenoderpcaddress=localhost:9000&createflag=&createparent=true&overwrite=false

So basically, the above location, is what you need to put in your $url variable.

$url = "http://chbpc-VirtualBox:9864/webhdfs/v1/test4/datafile?op=CREATE&namenoderpcaddress=localhost:9000&createflag=&createparent=true&overwrite=false";

Naturally, before testing your php script, you can try to upload a file to hdfs through the terminal to test if everything is fine. You just run the second curl command which should be something like this after adding the location:

curl -i -X PUT -T /var/www/html/myData/21\ February\ 2019\ 11_31_55\ PM "http://chbpc-VirtualBox:9864/webhdfs/v1/test4/datafile?op=CREATE&namenoderpcaddress=localhost:9000&createflag=&createparent=true&overwrite=false"

Check in your hdfs if the file has been uploaded. If the file has been uploaded, then your $url should be good.

The code below will just delete all the files in my folder that has already been uploaded to my hdfs.

//Allow permission chmod 777 in root folder for deletion of files

if (!unlink($filepath)) {

  //echo ("Error deleting ".$filename);

}

else{

  //echo ("Deleted ".$filename);

}

Before you run the php script, delete the files that has been uploaded using the command line curl (just in case you are uploading the same file with php curl). Now run your file and check if it has been uploaded to your hdfs.

Please excuse me if this is a long reply, i have tried to give max details for this solution since it was quite some struggle for me to make this work.
I hope you can use this as a good reference.

answered Mar 13, 2019 by Bhavish
• 370 points

edited Mar 13, 2019 by Omkar
+1 vote

Hi @Bhavish. 

Did some research and found this API for PHP on github: https://github.com/adprofy/Php-Hadoop-Hdfs

I think this part is what you are interested in: 

See the link above for documentation on setting up the API for PHP.

Then you can try something like this:

$ hdfsDir = <path to directory whose files you want to list>
$hdfs -> readDir ($hdfsDir)

Please see if this works for you. 

answered Mar 11, 2019 by Omkar
• 69,210 points

Hello @Omkar,

I am using php version 5.6. As per the link, it is specified that i require php 5.3+, so i should be good. But it is not working out well.

Below is my php script (hdfsapi.php):

<?php

$host = "localhost";

$port = "9870";

$hdfs = new \Hdfs\Web();

$hdfs->configure($host, $port);

$hdfsDir = "/test4";

$dir = $hdfs -> readDir ($hdfsDir);

echo $dir;

?>

And this is the response:

PHP Fatal error:  Class 'Hdfs\Web' not found in /var/www/html/hdfsapi.php on line 6

I have found other sources where there is another class for webHDFS. I have tried it and i get the exact same response as above.
https://github.com/xaviered/php-WebHDFS
https://github.com/Yujiro3/WebHDFS

Any reason why you did not initialize $user?

Seems like problem with the php server. I don't know if this works but as much as I know, all the php files should be placed in /var/www/html/. Try doing that and see if it works.

Hi @Karan Not sure what user should i put for $user. Is it the user of my local machine? When starting hadoop or doing any specific tasks, i never had to create a user.

Hi @Suman, it is indeed. All my php files are located in that directory for my web server. Since my web server does not always show errors, i usually execute my php scripts on the command line.

Ya @Bhavish. There's confusion here. I checked the docs, they have not mentioned what the $user should be. I thought the error could be due to this but I am not sure. Try with the user name of the local machine. 

Errr. This is frustrating. Can you post the contents on the hdfsapi.php?
@Karan not sure if that would change anything, but i found a work around for this. I will post the solution soon. It is something quite different but has solved my problem.

@Suman, sorry i did not mention, but the above php script is the hdfsapi.php. I have edited the question accordingly. Thanks

@Bhavish. Please post the solution. I am stuck with a similar problem.
@Karan sorry for late reply. I just posted the solution, i hope this helps.
Thanks for the solution @Bhavish. I will try this.

Related Questions In Big Data Hadoop

0 votes
1 answer

How to write a file in HDFS using Java Programming language?

Define the HADOOP_CONF_DIR environment variable to your Hadoop configuration ...READ MORE

answered Sep 28, 2018 in Big Data Hadoop by Frankie
• 9,830 points
2,616 views
0 votes
1 answer

How to unzip a zipped file stored in Hadoop hdfs?

hadoop fs -text /hdfs-path-to-zipped-file.gz | hadoop fs ...READ MORE

answered Dec 12, 2018 in Big Data Hadoop by Omkar
• 69,210 points
11,782 views
0 votes
1 answer

How to create a Hive table from sequence file stored in HDFS?

There are two SerDe for SequenceFile as ...READ MORE

answered Dec 18, 2018 in Big Data Hadoop by Omkar
• 69,210 points
4,522 views
0 votes
1 answer

How to print the content of a file in console present in HDFS?

Yes, you can use hdfs dfs command ...READ MORE

answered Apr 19, 2018 in Big Data Hadoop by Shubham
• 13,490 points
5,329 views
0 votes
2 answers

hadoop copy a local file system folder to HDFS

There's a typo in your command: "hadopp". ...READ MORE

answered Feb 4, 2019 in Big Data Hadoop by Lohith
24,473 views
0 votes
1 answer

How to count lines in a file on hdfs command?

Use the below commands: Total number of files: hadoop ...READ MORE

answered Aug 10, 2018 in Big Data Hadoop by Neha
• 6,300 points
26,941 views
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
10,555 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
2,184 views
+2 votes
11 answers

hadoop fs -put command?

Hi, You can create one directory in HDFS ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
104,199 views
–1 vote
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,390 points
4,260 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP