Filter, Option or FlatMap in spark

0 votes

This is the code in Spark:

dsPhs.filter(filter(_))
.map(convert)
.coalesce(partitions)
.filter(additionalFilter.IsValid(_))

At convert function I get more complex object - MyObject, so I need to prefilter basic object. I have 3 options:

  1. Make map return option(MyObject) and filter it at additionalFilter
  2. Replace map with flatMap and return empty array when filtered
  3. Use filter before map function, to filter out RawObject, before converting it to MyObject.
Nov 9, 2018 in Apache Spark by Neha
• 6,280 points
542 views

1 answer to this question.

0 votes

If, for option 2, you mean have convert return an empty array, there's another option: have convert return an Option[MyObject] and use flatMap instead of map. This has the best of options 1 and 2. Without knowing more about your use case, I can't say for sure whether this is better than option 3, but here are some considerations:

  1. Should convert contain input validation logic? If so, consider modifying it to return an Option.
    • If convert is used, or will be used, in other places, could they benefit from this validation?
    • As a side note, this might be a good time to consider what convert currently does when passed an invalid argument.
  2. Can you easily change convert and its signature? If not, consider using a filter.
answered Nov 9, 2018 by Frankie
• 9,810 points

Related Questions In Apache Spark

0 votes
1 answer

What's the difference between 'filter' and 'where' in Spark SQL?

Both 'filter' and 'where' in Spark SQL ...READ MORE

answered May 23, 2018 in Apache Spark by nitinrawat895
• 10,730 points
7,602 views
0 votes
1 answer

Which is better in term of speed, Shark or Spark?

Spark is a framework for distributed data ...READ MORE

answered Jun 25, 2018 in Apache Spark by nitinrawat895
• 10,730 points
43 views
0 votes
2 answers

map() vs flatMap() in Spark

Spark map function expresses a one-to-one transformation. ...READ MORE

answered Jun 17 in Apache Spark by vishal
• 160 points
3,591 views
0 votes
1 answer

What is Map and flatMap in Spark?

Hi, The map is a specific line or ...READ MORE

answered Jul 3 in Apache Spark by Gitika
• 25,360 points
214 views
0 votes
1 answer

What do we exactly mean by “Hadoop” – the definition of Hadoop?

The official definition of Apache Hadoop given ...READ MORE

answered Mar 16, 2018 in Big Data Hadoop by Shubham
227 views
+1 vote
1 answer
0 votes
3 answers

Can we run Spark without using Hadoop?

No, you can run spark without hadoop. ...READ MORE

answered May 7 in Big Data Hadoop by pradeep
207 views
0 votes
1 answer

Joining Multiple Spark Dataframes

You can run the below code to ...READ MORE

answered Mar 26, 2018 in Big Data Hadoop by Bharani
• 4,550 points
601 views
0 votes
1 answer

How to get ID of a map task in Spark?

you can access task information using TaskContext: import org.apache.spark.TaskContext sc.parallelize(Seq[Int](), ...READ MORE

answered Nov 20, 2018 in Apache Spark by Frankie
• 9,810 points
473 views
0 votes
1 answer

What is Executor Memory in a Spark application?

Every spark application has same fixed heap ...READ MORE

answered Jan 4 in Apache Spark by Frankie
• 9,810 points
641 views