map vs mapValues in Spark

mapValues is only applicable for PairRDDs, meaning RDDs of the form RDD[(A, B)]. In that case, mapValues operates on the value only (the second part of the tuple), while map operates on the entire record (tuple of key and value). In other words, given f: B => C and rdd: RDD[(A, B)], these two are identical (almost – see comment at the bottom): The latter is simply shorter … Read more

Task not serializable: java.io.NotSerializableException when calling function outside closure only on classes not objects

RDDs extend the Serialisable interface, so this is not what’s causing your task to fail. Now this doesn’t mean that you can serialise an RDD with Spark and avoid NotSerializableException Spark is a distributed computing engine and its main abstraction is a resilient distributed dataset (RDD), which can be viewed as a distributed collection. Basically, RDD’s elements are … Read more

The value of “spark.yarn.executor.memoryOverhead” setting?

Is just the max value .The goal is to calculate OVERHEAD as a percentage of real executor memory, as used by RDDs and DataFrames controls the executor heap size, but JVMs can also use some memory off heap, for example for interned Strings and direct byte buffers. The value of the spark.yarn.executor.memoryOverhead property is added to the … Read more