Spark comparing two big data files using scala

Question

Hi
I have problem statement like this
I have two files
I need to check if record in 1 file also exists in another. The key which I need to check in file2 is made-up of concatinating two columns of file 1. Is it possible to check without using sparksql

Hey can you please mention what type of files you want to implement this on? — Mar 29, 2019

Omkar · Answer 1 · Apr 2, 2019

Try this and see if this does what you want:

scala> file1.join(file2, Seq("key")).show
scala> file1.join(file2, Seq("key"), "left_outer").show
scala> file1.join(file2, Seq("key"), "right_outer").show
scala> file1.join(file2, Seq("key"), "outer").show
scala> val diff = file1.join(file2, Seq("key"), "right_outer").filter($"file1.value" isNull).drop($"file1.value")
scala> diff.show
scala> diff.write.csv("diff.csv")

Hope this helps!

If you need to know more about Scala, join Spark course today and become the expert.

Thanks!!