Simple dataframe creation:
df = spark.createDataFrame( [ (1, "foo"), # create your data here, be consistent in the types. (2, "bar"), ], ["id", "label"] # add your column names here ) df.printSchema() root |-- id: long (nullable = true) |-- label: string (nullable = true) df.show() +---+-----+ | id|label| +---+-----+ | 1| foo| | 2| bar| +---+-----+
According to official doc:
- when schema is a list of column names, the type of each column will be inferred from data. (example above ↑)
- When schema is
pyspark.sql.types.DataType
or a datatype string, it must match the real data. (examples below ↓)
# Example with a datatype string df = spark.createDataFrame( [ (1, "foo"), # Add your data here (2, "bar"), ], "id int, label string", # add column names and types here ) # Example with pyspark.sql.types from pyspark.sql import types as T df = spark.createDataFrame( [ (1, "foo"), # Add your data here (2, "bar"), ], T.StructType( # Define the whole schema within a StructType [ T.StructField("id", T.IntegerType(), True), T.StructField("label", T.StringType(), True), ] ), ) df.printSchema() root |-- id: integer (nullable = true) # type is forced to Int |-- label: string (nullable = true)
Additionally, you can create your dataframe from Pandas dataframe, schema will be inferred from Pandas dataframe’s types :
import pandas as pd import numpy as np pdf = pd.DataFrame( { "col1": [np.random.randint(10) for x in range(10)], "col2": [np.random.randint(100) for x in range(10)], } ) df= spark.createDataFrame(pdf) df.show() +----+----+ |col1|col2| +----+----+ | 6| 4| | 1| 39| | 7| 4| | 7| 95| | 6| 3| | 7| 28| | 2| 26| | 0| 4| | 4| 32| +----+----+