pyspark.pandas.read_spark_io#
- pyspark.pandas.read_spark_io(path=None, format=None, schema=None, index_col=None, **options)[source]#
Load a DataFrame from a Spark data source.
- Parameters
- pathstring, optional
Path to the data source.
- formatstring, optional
Specifies the output data source format. Some common ones are:
‘delta’
‘parquet’
‘orc’
‘json’
‘csv’
- schemastring or StructType, optional
Input schema. If none, Spark tries to infer the schema automatically. The schema can either be a Spark StructType, or a DDL-formatted string like col0 INT, col1 DOUBLE.
- index_colstr or list of str, optional, default: None
Index column of table in Spark.
- optionsdict
All other options passed directly into Spark’s data source.
See also
DataFrame.read_table
DataFrame.read_delta
DataFrame.read_parquet
Examples
>>> ps.range(1).spark.to_spark_io('%s/read_spark_io/data.parquet' % path) >>> ps.read_spark_io( ... '%s/read_spark_io/data.parquet' % path, format='parquet', schema='id long') id 0 0
>>> ps.range(10, 15, num_partitions=1).spark.to_spark_io('%s/read_spark_io/data.json' % path, ... format='json', lineSep='__') >>> ps.read_spark_io( ... '%s/read_spark_io/data.json' % path, format='json', schema='id long', lineSep='__') id 0 10 1 11 2 12 3 13 4 14
You can preserve the index in the roundtrip as below.
>>> ps.range(10, 15, num_partitions=1).spark.to_spark_io('%s/read_spark_io/data.orc' % path, ... format='orc', index_col="index") >>> ps.read_spark_io( ... path=r'%s/read_spark_io/data.orc' % path, format="orc", index_col="index") ... id index 0 10 1 11 2 12 3 13 4 14