fs.hdfs.impl.disable.cache caused SparkSQL very slow

This is a question related to this question: Hive/Hadoop intermittent failure: Unable to move source to destination We found that we could avoid the problem of “Unable to move source … Filesystem closed” by setting fs.hdfs.impl.disable.cache to true However, we also observed that the SparkSQL queries became very slow — queries that used to finish … Read more

The value of “spark.yarn.executor.memoryOverhead” setting?

Is just the max value .The goal is to calculate OVERHEAD as a percentage of real executor memory, as used by RDDs and DataFrames controls the executor heap size, but JVMs can also use some memory off heap, for example for interned Strings and direct byte buffers. The value of the spark.yarn.executor.memoryOverhead property is added to the … Read more

How to change dataframe column names in pyspark?

There are many ways to do that: Option 1. Using selectExpr. data = sqlContext.createDataFrame([(“Alberto”, 2), (“Dakota”, 2)], [“Name”, “askdaosdka”]) data.show() data.printSchema() # Output #+——-+———-+ #| Name|askdaosdka| #+——-+———-+ #|Alberto| 2| #| Dakota| 2| #+——-+———-+ #root # |– Name: string (nullable = true) # |– askdaosdka: long (nullable = true) df = data.selectExpr(“Name as name”, “askdaosdka as age”) … Read more

What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?

From the answer here, spark.sql.shuffle.partitions configures the number of partitions that are used when shuffling data for joins or aggregations. spark.default.parallelism is the default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set explicitly by the user. Note that spark.default.parallelism seems to only be working for raw RDD and is ignored when working with dataframes. If the task you are performing … Read more

How does createOrReplaceTempView work in Spark?

createOrReplaceTempView creates (or replaces if that view name already exists) a lazily evaluated “view” that you can then use like a hive table in Spark SQL. It does not persist to memory unless you cache the dataset that underpins the view. The data is cached fully only after the .count call. Here’s proof it’s been cached: Related SO: spark createOrReplaceTempView vs … Read more