apache-spark-sql - Read For Learn

How to convert rdd object to dataframe in spark

SparkSession has a number of createDataFrame methods that create a DataFrame given an RDD. I imagine one of these will work for your context. For example: Creates a DataFrame from an RDD containing Rows using the given schema.

fs.hdfs.impl.disable.cache caused SparkSQL very slow

This is a question related to this question: Hive/Hadoop intermittent failure: Unable to move source to destination We found that we could avoid the problem of “Unable to move source … Filesystem closed” by setting fs.hdfs.impl.disable.cache to true However, we also observed that the SparkSQL queries became very slow — queries that used to finish … Read more

The value of “spark.yarn.executor.memoryOverhead” setting?

Is just the max value .The goal is to calculate OVERHEAD as a percentage of real executor memory, as used by RDDs and DataFrames controls the executor heap size, but JVMs can also use some memory off heap, for example for interned Strings and direct byte buffers. The value of the spark.yarn.executor.memoryOverhead property is added to the … Read more

How to join on multiple columns in Pyspark?

You should use & / | operators and be careful about operator precedence (== has lower precedence than bitwise AND and OR):

How to join on multiple columns in Pyspark?

You should use & / | operators and be careful about operator precedence (== has lower precedence than bitwise AND and OR):

How to change dataframe column names in pyspark?

How to delete columns in pyspark dataframe

Reading the Spark documentation I found an easier solution. Since version 1.4 of spark there is a function drop(col) which can be used in pyspark on a dataframe. You can use it in two ways df.drop(‘age’) df.drop(df.age) Pyspark Documentation – Drop

What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?

From the answer here, spark.sql.shuffle.partitions configures the number of partitions that are used when shuffling data for joins or aggregations. spark.default.parallelism is the default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set explicitly by the user. Note that spark.default.parallelism seems to only be working for raw RDD and is ignored when working with dataframes. If the task you are performing … Read more

How does createOrReplaceTempView work in Spark?

createOrReplaceTempView creates (or replaces if that view name already exists) a lazily evaluated “view” that you can then use like a hive table in Spark SQL. It does not persist to memory unless you cache the dataset that underpins the view. The data is cached fully only after the .count call. Here’s proof it’s been cached: Related SO: spark createOrReplaceTempView vs … Read more