Spark on Yarn

Question

I am trying to understand how spark runs on YARN cluster/client. I have the following queries.

Is it necessary that spark is installed on all the nodes in the yarn cluster? I think it should because worker nodes in cluster execute a task and should be able to decode the code(spark APIs) in spark application sent to cluster by the driver?
It says in the documentation "Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client-side) configuration files for the Hadoop cluster". Why does the client node have to install Hadoop when it is sending the job to cluster?

ravikiran · Answer 1 · Jul 18, 2019

If you just want to get your HDFS back to the normal state and don't worry much about the data, then

This will list the corrupt HDFS blocks:

hdfs fsck -list-corruptfileblocks

This will delete the corrupted HDFS blocks:

hdfs fsck / -delete

Note that, you might have to use

 sudo -u hdfs

if you are not the sudo user (assuming "hdfs" is name of the sudo user)

answered Jul 18, 2019 by ravikiran
• 4,620 points

Your comment on this question: