Manually create a pyspark dataframe

Simple dataframe creation:

df = spark.createDataFrame(
        (1, "foo"),  # create your data here, be consistent in the types.
        (2, "bar"),
    ["id", "label"]  # add your column names here

 |-- id: long (nullable = true)
 |-- label: string (nullable = true)
| id|label|
|  1|  foo|
|  2|  bar|

According to official doc:

  • when schema is a list of column names, the type of each column will be inferred from data. (example above ↑)
  • When schema is pyspark.sql.types.DataType or a datatype string, it must match the real data. (examples below ↓)
# Example with a datatype string
df = spark.createDataFrame(
        (1, "foo"),  # Add your data here
        (2, "bar"),
    "id int, label string",  # add column names and types here

# Example with pyspark.sql.types
from pyspark.sql import types as T
df = spark.createDataFrame(
        (1, "foo"),  # Add your data here
        (2, "bar"),
    T.StructType(  # Define the whole schema within a StructType
            T.StructField("id", T.IntegerType(), True),
            T.StructField("label", T.StringType(), True),

 |-- id: integer (nullable = true)  # type is forced to Int
 |-- label: string (nullable = true)

Additionally, you can create your dataframe from Pandas dataframe, schema will be inferred from Pandas dataframe’s types :

import pandas as pd
import numpy as np

pdf = pd.DataFrame(
        "col1": [np.random.randint(10) for x in range(10)],
        "col2": [np.random.randint(100) for x in range(10)],

df= spark.createDataFrame(pdf)
|   6|   4|
|   1|  39|
|   7|   4|
|   7|  95|
|   6|   3|
|   7|  28|
|   2|  26|
|   0|   4|
|   4|  32|

Leave a Comment