【问题标题】:Check whether a value is found within a group in a PySpark dataframe检查是否在 PySpark 数据框中的组中找到值
【发布时间】:2021-09-07 15:50:48
【问题描述】:

假设我有以下df

df = spark.createDataFrame([
  ("a", "apple"),
  ("a", "pear"),
  ("b", "pear"),
  ("c", "carrot"),
  ("c", "apple"),
], ["id", "fruit"])

+---+-------+
| id|  fruit|
+---+-------+
|  a|  apple|
|  a|   pear|
|  b|   pear|
|  c| carrot|
|  c|  apple| 
+---+-------+

我现在想为每个在水果列fruit 中至少有一列带有"pear" 的ID 创建一个布尔标志TRUE

所需的输出如下所示:

+---+-------+------+
| id|  fruit|  flag|
+---+-------+------+
|  a|  apple|  True|
|  a|   pear|  True|
|  b|   pear|  True|
|  c| carrot| False|
|  c|  apple| False|
+---+-------+------+

对于 pandas,我找到了 groupby().transform() here 的解决方案,但我不明白如何将其转换为 PySpark。

【问题讨论】:

    标签: python dataframe pyspark group-by


    【解决方案1】:

    使用max窗口函数:

    df.selectExpr("*", "max(fruit = 'pear') over (partition by id) as flag").show()
    
    +---+------+-----+
    | id| fruit| flag|
    +---+------+-----+
    |  c|carrot|false|
    |  c| apple|false|
    |  b|  pear| true|
    |  a| apple| true|
    |  a|  pear| true|
    +---+------+-----+
    

    如果您需要检查多个水果,您可以使用in 运算符。例如检查carrotapple

    df.selectExpr("*", "max(fruit in ('carrot', 'apple')) over (partition by id) as flag").show()
    +---+------+-----+
    | id| fruit| flag|
    +---+------+-----+
    |  c|carrot| true|
    |  c| apple| true|
    |  b|  pear|false|
    |  a| apple| true|
    |  a|  pear| true|
    +---+------+-----+
    

    如果你更喜欢 python 语法:

    from pyspark.sql.window import Window
    import pyspark.sql.functions as f
    
    df.select("*", 
      f.max(
        f.col('fruit').isin(['carrot', 'apple'])
      ).over(Window.partitionBy('id')).alias('flag')
    ).show()
    +---+------+-----+
    | id| fruit| flag|
    +---+------+-----+
    |  c|carrot| true|
    |  c| apple| true|
    |  b|  pear|false|
    |  a| apple| true|
    |  a|  pear| true|
    +---+------+-----+
    

    【讨论】:

    • 如果我有一份水果清单而不仅仅是“梨”,那该怎么办?
    • 您可以使用in 运算符进行多值检查。
    • 你也可以使用python列表吗?
    • Yes 提供了一个 python 语法替代方案。只需将示例 ['carrot', 'apple'] 替换为您的实际列表即可。
    猜你喜欢
    • 2020-09-24
    • 2011-09-15
    • 1970-01-01
    • 2018-02-12
    • 1970-01-01
    • 2018-07-19
    • 2021-06-26
    • 2021-07-16
    • 2023-03-26
    相关资源
    最近更新 更多