Spark comparing two big data files using scala

0 votes
Hi
I have problem statement like this
I have two files
I need to check if record in 1 file also exists in another. The key which I need to check in file2 is made-up of concatinating  two columns of file 1.  Is it possible to check without using sparksql
Mar 29 in Apache Spark by Swaroop

recategorized Mar 29 by Omkar 837 views
Hey can you please mention what type of files you want to implement this on?
It's  CSV files

1 answer to this question.

0 votes

Try this and see if this does what you want:

scala> file1.join(file2, Seq("key")).show
scala> file1.join(file2, Seq("key"), "left_outer").show
scala> file1.join(file2, Seq("key"), "right_outer").show
scala> file1.join(file2, Seq("key"), "outer").show
scala> val diff = ​file1.join(file2, Seq("key"), "right_outer").filter($"file1.value" isNull).drop($"file1.value")
scala> diff.show
scala> diff.write.csv("diff.csv")
answered Apr 2 by Omkar
• 68,180 points

Related Questions In Apache Spark

+1 vote
1 answer

Need to load 40 GB data to elasticsearch using spark

Did you find any documents or example ...READ MORE

answered Nov 5 in Apache Spark by Begum
152 views
0 votes
1 answer

Load .xlsx files to hive tables with spark scala

This should work: def readExcel(file: String): DataFrame = ...READ MORE

answered Jul 22 in Apache Spark by Kishan
315 views
0 votes
1 answer

Scala: Convert text file data into ORC format using data frame

Converting text file to Orc: Using Spark, the ...READ MORE

answered Aug 1 in Apache Spark by Esha
303 views
0 votes
1 answer

Scala: save filtered data row by row using saveAsTextFile

Try this code, it worked for me: val ...READ MORE

answered Aug 2 in Apache Spark by Karan
85 views
+1 vote
1 answer
0 votes
1 answer

Setting textinputformat.record.delimiter in spark

I got this working with plain uncompressed ...READ MORE

answered Oct 10, 2018 in Big Data Hadoop by Omkar
• 68,180 points
505 views
0 votes
3 answers

Spark Scala: How to list all folders in directory

val spark = SparkSession.builder().appName("Demo").getOrCreate() val path = new ...READ MORE

answered Dec 4, 2018 in Big Data Hadoop by Mark
2,650 views
0 votes
1 answer

Spark and Scale Auxiliary constructor doubt

println("Slayer") is an anonymous block and gets ...READ MORE

answered Jan 8 in Apache Spark by Omkar
• 68,180 points
47 views
0 votes
1 answer

Error while using Spark SQL filter API

You have to use "===" instead of ...READ MORE

answered Feb 4 in Apache Spark by Omkar
• 68,180 points
40 views
0 votes
1 answer

where can i get spark-terasort.jar and not .scala file, to do spark terasort in windows.

Hi! I found 2 links on github where ...READ MORE

answered Feb 13 in Apache Spark by Omkar
• 68,180 points
153 views