Using PHP to get all file names in a folder stored in HDFS

0 votes

How can get all the file names of files in a specific folder in my HDFS. Technically, i should be looping all the files in the folder and then catching the file names. Is there a way to do this? 
I know i should be using PHP cURL to access the webHDFS but i can't find an appropriate code.

For e.g. i wish to get the below filenames from the folder folder11 and store them in variables in PHP:

Mar 9 in Big Data Hadoop by Bhavish
• 370 points
307 views

2 answers to this question.

+1 vote
Best answer

So i found a workaround for the above problem, it's basically another scenario.

What i did is, instead of uploading files to hadoop using copyFromLocal, i used PHP cURL. I will try to explain this step by step.
So you create a php script and insert the below codes:

function call_curl($headers, $method, $url, $data,$file,$size) {

    $handle = curl_init();

    curl_setopt($handle, CURLOPT_URL, $url);

    curl_setopt($handle, CURLOPT_HTTPHEADER, $headers);

    curl_setopt($handle, CURLOPT_RETURNTRANSFER, true);

    curl_setopt($handle, CURLOPT_SSL_VERIFYHOST, false);

    curl_setopt($handle, CURLOPT_SSL_VERIFYPEER, false);

    curl_setopt($handle, CURLOPT_SAFE_UPLOAD, true); 

    switch($method) {

            case 'GET':

            break;

            case 'POST':

                curl_setopt($handle, CURLOPT_POST, true);

                curl_setopt($handle, CURLOPT_POSTFIELDS, $data);

            break;

            case 'PUT': 

                curl_setopt($handle, CURLOPT_CUSTOMREQUEST, 'PUT');

                curl_setopt($handle, CURLOPT_POSTFIELDS, $data);

                curl_setopt($handle, CURLOPT_INFILE, $file);

                curl_setopt($handle, CURLOPT_INFILESIZE, $size);

            break;

            case 'DELETE':

                curl_setopt($handle, CURLOPT_CUSTOMREQUEST, 'DELETE');

            break;

    }

    $response = curl_exec($handle);

    $code = curl_getinfo($handle, CURLINFO_HTTP_CODE);

    curl_close($handle);

    return $response;

}

The above function will make a request to hadoop and perform operation that we will define below. In our case that will be PUT since we are uploading to hadoop.

Now i will define a path to the folder containing the files that i wish to upload. 

$dir = '/var/www/html/myData';

Now i will create a for loop that will loop through all the files to get all the filenames and afterwards get other details such as file size, file path etc (depends on what you need, basically in my case, i am storing each file details in my database).

foreach (new DirectoryIterator($dir) as $fileInfo) {    

    if($fileInfo->isDot()) continue;    

    $filename = $fileInfo->getFilename();

     

//start curl php session to connect and upload file on hdfs

$header = array('Content-Type: application/octet-stream');

$method = "PUT";


//Path of zip folder 

$filepath=$dir."/".$filename;

//echo $filepath;

//size of zip folder

$size = filesize($filepath);

//echo $size;

//verify method of storing the zip folder

if ($size<=(1024*1024*1024)){

$rep = "3WayReplication";

}else{

$rep="erasure";

}

$url="http://chbpc-VirtualBox:9864/webhdfs/v1/".$username."/".$rep."/".$filename."?op=CREATE&namenoderpcaddress=localhost:9000&createflag=&createparent=true&overwrite=false";

$file=fopen($filepath, 'r');

//echo $file;

$filedata =  fread($file,$size);

//echo $filedata;

$data = array($filedata, $target_file);

//echo "<br/>data: ".$data;

//echo "<br>";

//echo "url: " . $url;

call_curl($header, $method, $url, $data,$file,$size);


//Store file details into database

$m = new MongoClient();

$collection = $m->ecoss->fileInfo;

$document = array(

"rootFolder" => $username,

"fileName" => $filename,

"filePath" => $filepath,

"fileSize" => ($size/1024)."kb",

"replicationType" => $rep,

"uploadDate" => $date,

"uploadTime" => $time

);

$collection->insert($document);

//echo "Document inserted successfully";


//Allow permission chmod 777 in root folder for deletion of files

if (!unlink($filepath)) {

  //echo ("Error deleting ".$filename);

}

else{

  //echo ("Deleted ".$filename);

}

}

As you can see above, i am using mongoDB to store my file details. This is why i posted the above question since i was using copyFromLocal , there is no way i would get the file information that was being uploaded to my hdfs. Using php cURL, i have my file names stored in the variable $filename and i am able to store in my database.  You can just skip the mongoDB codes if you don't need it. 

Now there is something very important, that is the $url. You can just copy paste the above codes, that is absolutely fine, but the url in your case would be different. To get your own $url, please make reference to this: https://hadoop.apache.org/docs/r1.0.4/webhdfs.html#MKDIRS (Create and Write to a File).

Basically you need to have curl installed on your machine. If not, pls follow this link: https://stackoverflow.com/questions/38800606/how-to-install-php-curl-in-ubuntu-16-04

Now, open your terminal (as per the previous hadoop apache link), type your curl command: 

curl -i -X PUT "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=CREATE

where <HOST> is your hostname that you can find by typing on the terminal:

 hostname

 <PORT> is the port you use to connect to hadoop and <PATH> is the path to the folder where you wish to upload your files to. After issuing this first command, you will get a response like:

After this, issue another curl command:

curl -i -X PUT -T <LOCAL_FILE>

Here <LOCAL_FILE> is  the path to the file that you wish to upload to hdfs. After that, copy the location that you received after having issued the first command. Im my case, the location is: 

http://chbpc-VirtualBox:9864/webhdfs/v1/test4/datafile?op=CREATE&namenoderpcaddress=localhost:9000&createflag=&createparent=true&overwrite=false

So basically, the above location, is what you need to put in your $url variable.

$url = "http://chbpc-VirtualBox:9864/webhdfs/v1/test4/datafile?op=CREATE&namenoderpcaddress=localhost:9000&createflag=&createparent=true&overwrite=false";

Naturally, before testing your php script, you can try to upload a file to hdfs through the terminal to test if everything is fine. You just run the second curl command which should be something like this after adding the location:

curl -i -X PUT -T /var/www/html/myData/21\ February\ 2019\ 11_31_55\ PM "http://chbpc-VirtualBox:9864/webhdfs/v1/test4/datafile?op=CREATE&namenoderpcaddress=localhost:9000&createflag=&createparent=true&overwrite=false"

Check in your hdfs if the file has been uploaded. If the file has been uploaded, then your $url should be good.

The code below will just delete all the files in my folder that has already been uploaded to my hdfs.

//Allow permission chmod 777 in root folder for deletion of files

if (!unlink($filepath)) {

  //echo ("Error deleting ".$filename);

}

else{

  //echo ("Deleted ".$filename);

}

Before you run the php script, delete the files that has been uploaded using the command line curl (just in case you are uploading the same file with php curl). Now run your file and check if it has been uploaded to your hdfs.

Please excuse me if this is a long reply, i have tried to give max details for this solution since it was quite some struggle for me to make this work.
I hope you can use this as a good reference.

answered Mar 12 by Bhavish
• 370 points

edited Mar 13 by Omkar
+1 vote

Hi @Bhavish. 

Did some research and found this API for PHP on github: https://github.com/adprofy/Php-Hadoop-Hdfs

I think this part is what you are interested in: 

See the link above for documentation on setting up the API for PHP.

Then you can try something like this:

$ hdfsDir = <path to directory whose files you want to list>
$hdfs -> readDir ($hdfsDir)

Please see if this works for you. 

answered Mar 11 by Omkar
• 68,480 points
@Karan not sure if that would change anything, but i found a work around for this. I will post the solution soon. It is something quite different but has solved my problem.

@Suman, sorry i did not mention, but the above php script is the hdfsapi.php. I have edited the question accordingly. Thanks

@Bhavish. Please post the solution. I am stuck with a similar problem.
@Karan sorry for late reply. I just posted the solution, i hope this helps.
Thanks for the solution @Bhavish. I will try this.

Related Questions In Big Data Hadoop

0 votes
1 answer

How to write a file in HDFS using Java Programming language?

Define the HADOOP_CONF_DIR environment variable to your Hadoop configuration ...READ MORE

answered Sep 28, 2018 in Big Data Hadoop by Frankie
• 9,810 points
382 views
0 votes
1 answer

How to unzip a zipped file stored in Hadoop hdfs?

hadoop fs -text /hdfs-path-to-zipped-file.gz | hadoop fs ...READ MORE

answered Dec 12, 2018 in Big Data Hadoop by Omkar
• 68,480 points
2,922 views
0 votes
1 answer

How to create a Hive table from sequence file stored in HDFS?

There are two SerDe for SequenceFile as ...READ MORE

answered Dec 17, 2018 in Big Data Hadoop by Omkar
• 68,480 points
711 views
0 votes
1 answer

How to print the content of a file in console present in HDFS?

Yes, you can use hdfs dfs command ...READ MORE

answered Apr 19, 2018 in Big Data Hadoop by Shubham
• 13,350 points
732 views
0 votes
2 answers

hadoop copy a local file system folder to HDFS

There's a typo in your command: "hadopp". ...READ MORE

answered Feb 4 in Big Data Hadoop by Lohith
4,277 views
0 votes
1 answer

How to count lines in a file on hdfs command?

Use the below commands: Total number of files: hadoop ...READ MORE

answered Aug 10, 2018 in Big Data Hadoop by Neha
• 6,280 points
3,961 views
0 votes
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,800 points
3,570 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 10,800 points
452 views
+1 vote
11 answers

hadoop fs -put command?

put syntax: put <localSrc> <dest> copy syntax: copyFr ...READ MORE

answered Dec 7, 2018 in Big Data Hadoop by Aditya
18,385 views
0 votes
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,280 points
1,331 views