【发布时间】:2019-05-14 18:17:15
【问题描述】:
如何计算 pyspark 数据框每一列中唯一元素的数量:
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = pd.DataFrame([[1, 100], [1, 200], [2, 300], [3, 100], [4, 100], [4, 300]], columns=['col1', 'col2'])
df_spark = spark.createDataFrame(df)
print(df_spark.show())
# +----+----+
# |col1|col2|
# +----+----+
# | 1| 100|
# | 1| 200|
# | 2| 300|
# | 3| 100|
# | 4| 100|
# | 4| 300|
# +----+----+
# Some transformations on df_spark here
# How to get a number of unique elements (just a number) in each columns?
我只知道以下非常慢的解决方案,这两条线的计算时间相同:
col1_num_unique = df_spark.select('col1').distinct().count()
col2_num_unique = df_spark.select('col2').distinct().count()
df_spark 中大约有 1000 万行。
【问题讨论】:
-
Spark DataFrame: count distinct values of every column 的可能重复项。基本上你可以做
df_spark.select(*[countDistinct(c).alias(c) for c in df_spark.columns]),但你应该记住这是一个昂贵的操作,并考虑pyspark.sql.functions.approxCountDistinct()是否适合你。
标签: python apache-spark dataframe pyspark apache-spark-sql