pyspark.sql.SparkSession.createDataFrame#
- SparkSession.createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)[source]#
Creates a
DataFrame
from anRDD
, a list, apandas.DataFrame
or anumpy.ndarray
.New in version 2.0.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- data
RDD
or iterable an RDD of any kind of SQL data representation (
Row
,tuple
,int
,boolean
,dict
, etc.), orlist
,pandas.DataFrame
ornumpy.ndarray
.- schema
pyspark.sql.types.DataType
, str or list, optional a
pyspark.sql.types.DataType
or a datatype string or a list of column names, default is None. The data type string format equals topyspark.sql.types.DataType.simpleString
, except that top level struct type can omit thestruct<>
.When
schema
is a list of column names, the type of each column will be inferred fromdata
.When
schema
isNone
, it will try to infer the schema (column names and types) fromdata
, which should be an RDD of eitherRow
,namedtuple
, ordict
.When
schema
ispyspark.sql.types.DataType
or a datatype string, it must match the real data, or an exception will be thrown at runtime. If the given schema is notpyspark.sql.types.StructType
, it will be wrapped into apyspark.sql.types.StructType
as its only field, and the field name will be “value”. Each record will also be wrapped into a tuple, which can be converted to row later.- samplingRatiofloat, optional
the sample ratio of rows used for inferring. The first few rows will be used if
samplingRatio
isNone
. This option is effective only when the input isRDD
.- verifySchemabool, optional
verify data types of every row against schema. Enabled by default. When the input is
pandas.DataFrame
and spark.sql.execution.arrow.pyspark.enabled is enabled, this option is not effective. It follows Arrow type coercion. This option is not supported with Spark Connect.New in version 2.1.0.
- data
- Returns
Notes
Usage with spark.sql.execution.arrow.pyspark.enabled=True is experimental.
Examples
Create a DataFrame from a list of tuples.
>>> spark.createDataFrame([('Alice', 1)]).show() +-----+---+ | _1| _2| +-----+---+ |Alice| 1| +-----+---+
Create a DataFrame from a list of dictionaries.
>>> d = [{'name': 'Alice', 'age': 1}] >>> spark.createDataFrame(d).show() +---+-----+ |age| name| +---+-----+ | 1|Alice| +---+-----+
Create a DataFrame with column names specified.
>>> spark.createDataFrame([('Alice', 1)], ['name', 'age']).show() +-----+---+ | name|age| +-----+---+ |Alice| 1| +-----+---+
Create a DataFrame with the explicit schema specified.
>>> from pyspark.sql.types import * >>> schema = StructType([ ... StructField("name", StringType(), True), ... StructField("age", IntegerType(), True)]) >>> spark.createDataFrame([('Alice', 1)], schema).show() +-----+---+ | name|age| +-----+---+ |Alice| 1| +-----+---+
Create a DataFrame with the schema in DDL formatted string.
>>> spark.createDataFrame([('Alice', 1)], "name: string, age: int").show() +-----+---+ | name|age| +-----+---+ |Alice| 1| +-----+---+
Create an empty DataFrame. When initializing an empty DataFrame in PySpark, it’s mandatory to specify its schema, as the DataFrame lacks data from which the schema can be inferred.
>>> spark.createDataFrame([], "name: string, age: int").show() +----+---+ |name|age| +----+---+ +----+---+
Create a DataFrame from Row objects.
>>> from pyspark.sql import Row >>> Person = Row('name', 'age') >>> df = spark.createDataFrame([Person("Alice", 1)]) >>> df.show() +-----+---+ | name|age| +-----+---+ |Alice| 1| +-----+---+
Create a DataFrame from a pandas DataFrame.
>>> spark.createDataFrame(df.toPandas()).show() +-----+---+ | name|age| +-----+---+ |Alice| 1| +-----+---+ >>> spark.createDataFrame(pandas.DataFrame([[1, 2]])).collect() +---+---+ | 0| 1| +---+---+ | 1| 2| +---+---+