Spark - load CSV file as DataFrame

0 votes

I would like to read a CSV in spark and convert it as DataFrame and store it in HDFS with df.registerTempTable("table_name")

I have tried:

scala> val df = sqlContext.load("hdfs:///csv/file/dir/file.csv")

Error which I got:

java.lang.RuntimeException: hdfs:///csv/file/dir/file.csv is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [49, 59, 54, 10]
    at parquet.hadoop.ParquetFileReader.readFooter(
    at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:277)
    at org.apache.spark.sql.parquet.ParquetRelation2$MetadataCache$$anonfun$refresh$6.apply(newParquet.scala:276)
    at scala.collection.parallel.mutable.ParArray$Map.leaf(ParArray.scala:658)
    at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:54)
    at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
    at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:53)
    at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:56)
    at scala.collection.parallel.mutable.ParArray$Map.tryLeaf(ParArray.scala:650)
    at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:165)
    at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:514)
    at scala.concurrent.forkjoin.RecursiveAction.exec(
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(

What is the right command to load CSV file as DataFrame in Apache Spark?

Sep 25, 2018 in Big Data Hadoop by digger
• 26,720 points

1 answer to this question.

0 votes

spark-csv is part of core Spark functionality and doesn't require a separate library. So you could just do for example

df ="csv").option("header", "true").load("csvfile.csv")
answered Sep 25, 2018 by slayer
• 29,310 points

Related Questions In Big Data Hadoop

0 votes
1 answer

How to save Spark dataframe as dynamic partitioned table in Hive?

Hey, you can try something like this: df.write.partitionBy('year', ...READ MORE

answered Nov 6, 2018 in Big Data Hadoop by Omkar
• 69,150 points
0 votes
1 answer to create RDD into Dataframe

You can use a case class and ...READ MORE

answered Jan 22, 2019 in Big Data Hadoop by Omkar
• 69,150 points
0 votes
1 answer

Writing File into HDFS using spark scala

The reason you are not able to ...READ MORE

answered Apr 6, 2018 in Big Data Hadoop by kurt_cobain
• 9,390 points
0 votes
1 answer
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
+2 votes
11 answers

hadoop fs -put command?

Hi, You can create one directory in HDFS ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
–1 vote
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,390 points
+1 vote
1 answer
0 votes
2 answers

Which of these will vanish: Flink vs Spark?

At first glance, Flink and Spark would ...READ MORE

answered Aug 13, 2018 in Big Data Hadoop by kurt_cobain
• 9,390 points
0 votes
2 answers

How to convert .txt file to Hadoop's sequence file format

import; import; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import; import ...READ MORE

answered Oct 12, 2018 in Big Data Hadoop by Sanjay