map vs mapValues in Spark

0 votes
I'm newbie to Spark and working on developing custom machine learning algorithms. My question is what is the difference between .map() and .mapValues() in RDD and what are cases where which one I have to use?

Thanks in advance!
Jun 29, 2018 in Apache Spark by Shubham
• 12,030 points
1,377 views

1 answer to this question.

Your answer

Your name to display (optional):
Privacy: Your email address will only be used for sending these notifications.
0 votes

There is a difference between the two:

mapValues is only applicable for PairRDDs, meaning RDDs of the form RDD[(A, B)]. In that case, mapValues operates on the value only (the second part of the tuple), while map operates on the entire record (tuple of key and value).

In other words, given f: B => C and rdd: RDD[(A, B)], these two are identical

val result: RDD[(A, C)] = rdd.map { case (k, v) => (k, f(v)) }

val result: RDD[(A, C)] = rdd.mapValues(f)
The latter is simply shorter and clearer, so when you just want to transform the values and keep the keys as-is, it's recommended to use mapValues.

On the other hand, if you want to transform the keys too (e.g. you want to apply f: (A, B) => C), you simply can't use mapValues because it would only pass the values to your function.

The last difference concerns partitioning: if you applied any custom partitioning to your RDD (e.g. using partitionBy), using map would "forget" that paritioner (the result will revert to default partitioning) as the keys might have changed; mapValues, however, preserves any partitioner set on the RDD.

Hope this will answer your query to some extent.

answered Jun 29, 2018 by nitinrawat895
• 9,030 points
Can you provide working examples?

Related Questions In Apache Spark

0 votes
1 answer

map() vs flatMap() in Spark

Both map() and flatMap() are used for ...READ MORE

answered Mar 8 in Apache Spark by Raj
156 views
0 votes
1 answer

How to get ID of a map task in Spark?

you can access task information using TaskContext: import org.apache.spark.TaskContext sc.parallelize(Seq[Int](), ...READ MORE

answered Nov 20, 2018 in Apache Spark by Frankie
• 9,570 points
116 views
0 votes
1 answer

Cache() vs persist() in Spark

The cache() is used only the default storage level ...READ MORE

answered Mar 8 in Apache Spark by Raj
51 views
0 votes
1 answer

Changing Column position in spark dataframe

Yes, you can reorder the dataframe elements. You need ...READ MORE

answered Apr 19, 2018 in Apache Spark by Ashish
• 2,630 points
2,576 views
+1 vote
1 answer
0 votes
1 answer

Writing File into HDFS using spark scala

The reason you are not able to ...READ MORE

answered Apr 5, 2018 in Big Data Hadoop by kurt_cobain
• 9,260 points
3,304 views
0 votes
1 answer

Is there any way to check the Spark version?

There are 2 ways to check the ...READ MORE

answered Apr 19, 2018 in Apache Spark by nitinrawat895
• 9,030 points
465 views
0 votes
1 answer

Is it better to have one large parquet file or lots of smaller parquet files?

Ideally, you would use snappy compression (default) ...READ MORE

answered May 23, 2018 in Apache Spark by nitinrawat895
• 9,030 points
888 views
0 votes
5 answers

groupByKey vs reduceByKey in Apache Spark.

Below Images are self explainatry for reducebykey ...READ MORE

answered Apr 22 in Apache Spark by Gunjan Kumar
3,443 views
0 votes
1 answer

What's the difference between 'filter' and 'where' in Spark SQL?

Both 'filter' and 'where' in Spark SQL ...READ MORE

answered May 23, 2018 in Apache Spark by nitinrawat895
• 9,030 points
2,828 views

© 2018 Brain4ce Education Solutions Pvt. Ltd. All rights Reserved.
"PMP®","PMI®", "PMI-ACP®" and "PMBOK®" are registered marks of the Project Management Institute, Inc. MongoDB®, Mongo and the leaf logo are the registered trademarks of MongoDB, Inc.