Spark comparing two big data files using scala

0 votes
Hi
I have problem statement like this
I have two files
I need to check if record in 1 file also exists in another. The key which I need to check in file2 is made-up of concatinating  two columns of file 1.  Is it possible to check without using sparksql
Mar 29 in Apache Spark by Swaroop

recategorized Mar 29 by Omkar 427 views
Hey can you please mention what type of files you want to implement this on?
It's  CSV files

1 answer to this question.

0 votes

Try this and see if this does what you want:

scala> file1.join(file2, Seq("key")).show
scala> file1.join(file2, Seq("key"), "left_outer").show
scala> file1.join(file2, Seq("key"), "right_outer").show
scala> file1.join(file2, Seq("key"), "outer").show
scala> val diff = ​file1.join(file2, Seq("key"), "right_outer").filter($"file1.value" isNull).drop($"file1.value")
scala> diff.show
scala> diff.write.csv("diff.csv")
answered Apr 2 by Omkar
• 67,620 points

Related Questions In Apache Spark

0 votes
0 answers

Need to load 40 GB data to elasticsearch using spark

I am working in psedo distributed spark ...READ MORE

Jul 16 in Apache Spark by Amit
• 120 points
61 views
0 votes
1 answer

Load .xlsx files to hive tables with spark scala

This should work: def readExcel(file: String): DataFrame = ...READ MORE

answered Jul 22 in Apache Spark by Kishan
102 views
0 votes
1 answer

Scala: Convert text file data into ORC format using data frame

Converting text file to Orc: Using Spark, the ...READ MORE

answered Aug 1 in Apache Spark by Esha
121 views
0 votes
1 answer

Scala: save filtered data row by row using saveAsTextFile

Try this code, it worked for me: val ...READ MORE

answered Aug 2 in Apache Spark by Karan
36 views
0 votes
1 answer
0 votes
1 answer

Setting textinputformat.record.delimiter in spark

I got this working with plain uncompressed ...READ MORE

answered Oct 10, 2018 in Big Data Hadoop by Omkar
• 67,620 points
438 views
0 votes
3 answers

Spark Scala: How to list all folders in directory

val spark = SparkSession.builder().appName("Demo").getOrCreate() val path = new ...READ MORE

answered Dec 4, 2018 in Big Data Hadoop by Mark
2,061 views
0 votes
1 answer

Spark and Scale Auxiliary constructor doubt

println("Slayer") is an anonymous block and gets ...READ MORE

answered Jan 8 in Apache Spark by Omkar
• 67,620 points
36 views
0 votes
1 answer

Error while using Spark SQL filter API

You have to use "===" instead of ...READ MORE

answered Feb 4 in Apache Spark by Omkar
• 67,620 points
30 views
0 votes
1 answer

where can i get spark-terasort.jar and not .scala file, to do spark terasort in windows.

Hi! I found 2 links on github where ...READ MORE

answered Feb 13 in Apache Spark by Omkar
• 67,620 points
116 views