Spark comparing two big data files using scala

0 votes
Hi
I have problem statement like this
I have two files
I need to check if record in 1 file also exists in another. The key which I need to check in file2 is made-up of concatinating  two columns of file 1.  Is it possible to check without using sparksql
Mar 29 in Apache Spark by Swaroop

recategorized Mar 29 by Omkar 98 views
Hey can you please mention what type of files you want to implement this on?
It's  CSV files

1 answer to this question.

0 votes

Try this and see if this does what you want:

scala> file1.join(file2, Seq("key")).show
scala> file1.join(file2, Seq("key"), "left_outer").show
scala> file1.join(file2, Seq("key"), "right_outer").show
scala> file1.join(file2, Seq("key"), "outer").show
scala> val diff = ​file1.join(file2, Seq("key"), "right_outer").filter($"file1.value" isNull).drop($"file1.value")
scala> diff.show
scala> diff.write.csv("diff.csv")
answered Apr 2 by Omkar
• 67,160 points

Related Questions In Apache Spark

0 votes
0 answers

Need to load 40 GB data to elasticsearch using spark

I am working in psedo distributed spark ...READ MORE

Jul 16 in Apache Spark by Amit
• 120 points
17 views
0 votes
1 answer

Load .xlsx files to hive tables with spark scala

This should work: def readExcel(file: String): DataFrame = ...READ MORE

answered 1 day ago in Apache Spark by Kishan
3 views
0 votes
1 answer

How RDD persist the data in Spark?

There are two methods to persist the ...READ MORE

answered Jun 18, 2018 in Apache Spark by nitinrawat895
• 10,210 points
133 views
0 votes
1 answer

Minimizing Data Transfers in Spark

Minimizing data transfers and avoiding shuffling helps ...READ MORE

answered Jun 19, 2018 in Apache Spark by Data_Nerd
• 2,360 points
99 views
0 votes
0 answers
0 votes
1 answer

Setting textinputformat.record.delimiter in spark

I got this working with plain uncompressed ...READ MORE

answered Oct 10, 2018 in Big Data Hadoop by Omkar
• 67,160 points
351 views
0 votes
3 answers

Spark Scala: How to list all folders in directory

val spark = SparkSession.builder().appName("Demo").getOrCreate() val path = new ...READ MORE

answered Dec 4, 2018 in Big Data Hadoop by Mark
1,172 views
0 votes
1 answer

Spark and Scale Auxiliary constructor doubt

println("Slayer") is an anonymous block and gets ...READ MORE

answered Jan 8 in Apache Spark by Omkar
• 67,160 points
28 views
0 votes
1 answer

Error while using Spark SQL filter API

You have to use "===" instead of ...READ MORE

answered Feb 4 in Apache Spark by Omkar
• 67,160 points
21 views
0 votes
1 answer