How do I get number of columns in each line from a delimited file

+1 vote

When I execute the below in the spark shell, I'm expecting the file content is split based on "\n" and put in lines.

val lines = sc.textFile("/user/test.txt").map(l => l.split("\n"));

When I do a collect on lines like this 

lines.collect()

The output is as below

scala> lines.collect()
res76: Array[Array[String]] = Array(Array(~@00~@51~@DCS~@000009746~@1~@20190116~@170106), Array(~@51~@00~@1~@4397537~@3~@1~@1~@11~@16607475037~@272~@1521~@0~@0~@9~@AB2111756~@37~@20190112~@162954~@00000000~@1~@2000176746~@1~@88918773002073~@1~@3~@0~@0~@1~@008~@1~@889~@1~@000~@0~@0~@04), Array(~@51~@00~@1~@4397611~@3~@1~@1~@11~@16607475037~@272~...
scala>

Each line in the file is displayed as an array of arrays??? 

Now i need to know the number of column in each line delimited with '~@'

How do I do this???

Mar 8, 2019 in Apache Spark by Vijay Dixon
• 190 points
4,958 views

2 answers to this question.

+2 votes
Instead of spliting on '\n'. You should define a case class for each fields in a line,

follow below steps sequentially:

use sc.textfile to create an rdd of the file.

call Map tranformation on top of rdd, within map transformation split it on '~' and bind it with each of the fields defined in the case class.
answered Aug 7, 2019 by ashish
0 votes
You can use df.foreach().println
answered Apr 4, 2020 by SaiSowhit

Related Questions In Apache Spark

+1 vote
1 answer

How can I write a text file in HDFS not from an RDD, in Spark program?

Yes, you can go ahead and write ...READ MORE

answered May 29, 2018 in Apache Spark by Shubham
• 13,490 points
7,886 views
0 votes
1 answer
0 votes
1 answer

How to get the number of elements in partition?

rdd.mapPartitions(iter => Array(iter.size).iterator, true) This command will ...READ MORE

answered May 8, 2018 in Apache Spark by kurt_cobain
• 9,390 points
1,928 views
0 votes
1 answer
0 votes
1 answer

How to get ID of a map task in Spark?

you can access task information using TaskContext: import org.apache.spark.TaskContext sc.parallelize(Seq[Int](), ...READ MORE

answered Nov 20, 2018 in Apache Spark by Frankie
• 9,830 points
3,038 views
0 votes
1 answer

where can i get spark-terasort.jar and not .scala file, to do spark terasort in windows.

Hi! I found 2 links on github where ...READ MORE

answered Feb 13, 2019 in Apache Spark by Omkar
• 69,210 points
1,120 views
+1 vote
1 answer

Hadoop Mapreduce word count Program

Firstly you need to understand the concept ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
10,521 views
0 votes
1 answer

hadoop.mapred vs hadoop.mapreduce?

org.apache.hadoop.mapred is the Old API  org.apache.hadoop.mapreduce is the ...READ MORE

answered Mar 16, 2018 in Data Analytics by nitinrawat895
• 11,380 points
2,165 views
+2 votes
11 answers

hadoop fs -put command?

Hi, You can create one directory in HDFS ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by nitinrawat895
• 11,380 points
103,807 views
–1 vote
1 answer

Hadoop dfs -ls command?

In your case there is no difference ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by kurt_cobain
• 9,390 points
4,232 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP