pyspark.sql.DataFrame.dropDuplicates¶
-
DataFrame.
dropDuplicates
(subset=None)[source]¶ Return a new
DataFrame
with duplicate rows removed, optionally only considering certain columns.For a static batch
DataFrame
, it just drops duplicate rows. For a streamingDataFrame
, it will keep all data across triggers as intermediate state to drop duplicates rows. You can usewithWatermark()
to limit how late the duplicate data can be and system will accordingly limit the state. In addition, too late data older than watermark will be dropped to avoid any possibility of duplicates.drop_duplicates()
is an alias fordropDuplicates()
.New in version 1.4.0.
Examples
>>> from pyspark.sql import Row >>> df = sc.parallelize([ \ ... Row(name='Alice', age=5, height=80), \ ... Row(name='Alice', age=5, height=80), \ ... Row(name='Alice', age=10, height=80)]).toDF() >>> df.dropDuplicates().show() +-----+---+------+ | name|age|height| +-----+---+------+ |Alice| 5| 80| |Alice| 10| 80| +-----+---+------+
>>> df.dropDuplicates(['name', 'height']).show() +-----+---+------+ | name|age|height| +-----+---+------+ |Alice| 5| 80| +-----+---+------+