mapValues
is only applicable for PairRDDs, meaning RDDs of the form RDD[(A, B)]
. In that case, mapValues
operates on the value only (the second part of the tuple), while map
operates on the entire record (tuple of key and value).
In other words, given f: B => C
and rdd: RDD[(A, B)]
, these two are identical (almost – see comment at the bottom):
val result: RDD[(A, C)] = rdd.map { case (k, v) => (k, f(v)) } val result: RDD[(A, C)] = rdd.mapValues(f)
The latter is simply shorter and clearer, so when you just want to transform the values and keep the keys as-is, it’s recommended to use mapValues
.
On the other hand, if you want to transform the keys too (e.g. you want to apply f: (A, B) => C
), you simply can’t use mapValues
because it would only pass the values to your function.
The last difference concerns partitioning: if you applied any custom partitioning to your RDD (e.g. using partitionBy
), using map
would “forget” that paritioner (the result will revert to default partitioning) as the keys might have changed; mapValues
, however, preserves any partitioner set on the RDD.