apache-spark - Read For Learn

How to join on multiple columns in Pyspark?

You should use & / | operators and be careful about operator precedence (== has lower precedence than bitwise AND and OR):

Pyspark: Exception: Java gateway process exited before sending the driver its port number

One possible reason is JAVA_HOME is not set because java is not installed. I encountered the same issue. It says at sc = pyspark.SparkConf(). I solved it by running which is from https://www.digitalocean.com/community/tutorials/how-to-install-java-with-apt-get-on-ubuntu-16-04

How to change dataframe column names in pyspark?

How to delete columns in pyspark dataframe

Reading the Spark documentation I found an easier solution. Since version 1.4 of spark there is a function drop(col) which can be used in pyspark on a dataframe. You can use it in two ways df.drop(‘age’) df.drop(df.age) Pyspark Documentation – Drop

How to change dataframe column names in pyspark?

Spark RDD to DataFrame python

See, There are two ways to convert an RDD to DF in Spark. toDF() and createDataFrame(rdd, schema) I will show you how you can do that dynamically. toDF() The toDF() command gives you the way to convert an RDD[Row] to a Dataframe. The point is, the object Row() can receive a **kwargs argument. So, there is an easy way to do that. This way you … Read more

What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?

From the answer here, spark.sql.shuffle.partitions configures the number of partitions that are used when shuffling data for joins or aggregations. spark.default.parallelism is the default number of partitions in RDDs returned by transformations like join, reduceByKey, and parallelize when not set explicitly by the user. Note that spark.default.parallelism seems to only be working for raw RDD and is ignored when working with dataframes. If the task you are performing … Read more

What is spark.driver.maxResultSize?

assuming that a worker wants to send 4G of data to the driver, then having spark.driver.maxResultSize=1G, will cause the worker to send 4 messages (instead of 1 with unlimited spark.driver.maxResultSize). No. If estimated size of the data is larger than maxResultSize given job will be aborted. The goal here is to protect your application from driver loss, … Read more

How does createOrReplaceTempView work in Spark?

createOrReplaceTempView creates (or replaces if that view name already exists) a lazily evaluated “view” that you can then use like a hive table in Spark SQL. It does not persist to memory unless you cache the dataset that underpins the view. The data is cached fully only after the .count call. Here’s proof it’s been cached: Related SO: spark createOrReplaceTempView vs … Read more

How to kill a running Spark application?

copy paste the application Id from the spark scheduler, for instance application_1428487296152_25597 connect to the server that have launch the job yarn application -kill application_1428487296152_25597

+ More