Spark comparing two big data files using scala

0 votes
Hi
I have problem statement like this
I have two files
I need to check if record in 1 file also exists in another. The key which I need to check in file2 is made-up of concatinating  two columns of file 1.  Is it possible to check without using sparksql
Mar 29, 2019 in Apache Spark by Swaroop

recategorized Mar 29, 2019 by Omkar 6,646 views
Hey can you please mention what type of files you want to implement this on?
It's  CSV files

1 answer to this question.

0 votes

Try this and see if this does what you want:

scala> file1.join(file2, Seq("key")).show
scala> file1.join(file2, Seq("key"), "left_outer").show
scala> file1.join(file2, Seq("key"), "right_outer").show
scala> file1.join(file2, Seq("key"), "outer").show
scala> val diff = ​file1.join(file2, Seq("key"), "right_outer").filter($"file1.value" isNull).drop($"file1.value")
scala> diff.show
scala> diff.write.csv("diff.csv")


Hope this helps!

If you need to know more about Scala, join Spark course today and become the expert.

Thanks!!

answered Apr 2, 2019 by Omkar
• 69,210 points

Related Questions In Apache Spark

+1 vote
1 answer

Need to load 40 GB data to elasticsearch using spark

Did you find any documents or example ...READ MORE

answered Nov 5, 2019 in Apache Spark by Begum
1,087 views
0 votes
1 answer

Load .xlsx files to hive tables with spark scala

This should work: def readExcel(file: String): DataFrame = ...READ MORE

answered Jul 22, 2019 in Apache Spark by Kishan
4,055 views
+1 vote
1 answer

Scala: Convert text file data into ORC format using data frame

Converting text file to Orc: Using Spark, the ...READ MORE

answered Aug 1, 2019 in Apache Spark by Esha
3,336 views
0 votes
1 answer

Scala: save filtered data row by row using saveAsTextFile

Try this code, it worked for me: val ...READ MORE

answered Aug 2, 2019 in Apache Spark by Karan
1,611 views
+1 vote
2 answers
0 votes
1 answer

Setting textinputformat.record.delimiter in spark

I got this working with plain uncompressed ...READ MORE

answered Oct 10, 2018 in Big Data Hadoop by Omkar
• 69,210 points
2,143 views
0 votes
3 answers

Spark Scala: How to list all folders in directory

val spark = SparkSession.builder().appName("Demo").getOrCreate() val path = new ...READ MORE

answered Dec 5, 2018 in Big Data Hadoop by Mark
16,561 views
0 votes
1 answer

Spark and Scale Auxiliary constructor doubt

println("Slayer") is an anonymous block and gets ...READ MORE

answered Jan 8, 2019 in Apache Spark by Omkar
• 69,210 points
527 views
0 votes
1 answer

Error while using Spark SQL filter API

You have to use "===" instead of ...READ MORE

answered Feb 4, 2019 in Apache Spark by Omkar
• 69,210 points
556 views
0 votes
1 answer

where can i get spark-terasort.jar and not .scala file, to do spark terasort in windows.

Hi! I found 2 links on github where ...READ MORE

answered Feb 13, 2019 in Apache Spark by Omkar
• 69,210 points
1,140 views
webinar REGISTER FOR FREE WEBINAR X
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP