【发布时间】:2021-09-07 15:50:48
【问题描述】:
假设我有以下df
df = spark.createDataFrame([
("a", "apple"),
("a", "pear"),
("b", "pear"),
("c", "carrot"),
("c", "apple"),
], ["id", "fruit"])
+---+-------+
| id| fruit|
+---+-------+
| a| apple|
| a| pear|
| b| pear|
| c| carrot|
| c| apple|
+---+-------+
我现在想为每个在水果列fruit 中至少有一列带有"pear" 的ID 创建一个布尔标志TRUE。
所需的输出如下所示:
+---+-------+------+
| id| fruit| flag|
+---+-------+------+
| a| apple| True|
| a| pear| True|
| b| pear| True|
| c| carrot| False|
| c| apple| False|
+---+-------+------+
对于 pandas,我找到了 groupby().transform() here 的解决方案,但我不明白如何将其转换为 PySpark。
【问题讨论】:
标签: python dataframe pyspark group-by