map vs mapValues in Spark

Question

I'm newbie to Spark and working on developing custom machine learning algorithms. My question is what is the difference between .map() and .mapValues() in RDD and what are cases where which one I have to use?

Thanks in advance!

nitinrawat895 · Answer 1 · Jun 29, 2018

There is a difference between the two:

mapValues is only applicable for PairRDDs, meaning RDDs of the form RDD[(A, B)]. In that case, mapValues operates on the value only (the second part of the tuple), while map operates on the entire record (tuple of key and value).

In other words, given f: B => C and rdd: RDD[(A, B)], these two are identical

val result: RDD[(A, C)] = rdd.map { case (k, v) => (k, f(v)) }

val result: RDD[(A, C)] = rdd.mapValues(f)
The latter is simply shorter and clearer, so when you just want to transform the values and keep the keys as-is, it's recommended to use mapValues.

On the other hand, if you want to transform the keys too (e.g. you want to apply f: (A, B) => C), you simply can't use mapValues because it would only pass the values to your function.

The last difference concerns partitioning: if you applied any custom partitioning to your RDD (e.g. using partitionBy), using map would "forget" that paritioner (the result will revert to default partitioning) as the keys might have changed; mapValues, however, preserves any partitioner set on the RDD.

Hope this will answer your query to some extent.

answered Jun 29, 2018 by nitinrawat895
• 11,380 points

Can you provide working examples?

commented Jun 29, 2018 by Data_Nerd
• 2,390 points

x = sc.parallelize([("a", ["apple", "banana", "lemon"]), ("b", ["grapes"])])
x.mapValues(lambda f : len(f)).collect()
[('a', 3), ('b', 1)]

commented Mar 15, 2020 by Sourav
• 120 points

map vs mapValues in Spark

Your comment on this question:

1 answer to this question.

Your answer

Your comment on this answer:

Related Questions In Apache Spark

map() vs flatMap() in Spark

How to get ID of a map task in Spark?

Cache() vs persist() in Spark

What is Map and flatMap in Spark?

I installed Spark but while executing command, I am getting ‘hadoop’ command not found error?

Writing File into HDFS using spark scala

Is there any way to check the Spark version?

Is it better to have one large parquet file or lots of smaller parquet files?

groupByKey vs reduceByKey in Apache Spark.

What's the difference between 'filter' and 'where' in Spark SQL?

Subscribe to our Newsletter, and get personalized recommendations.

TRENDING CERTIFICATION COURSES

TRENDING MASTERS COURSES

COMPANY

WORK WITH US

DOWNLOAD APP

CATEGORIES

CATEGORIES

TRENDING BLOG ARTICLES

TRENDING BLOG ARTICLES