【问题标题】:Use Group By and Aggregate Function in pyspark?在 pyspark 中使用 Group By 和聚合函数?
【发布时间】:2022-11-19 22:59:32
【问题描述】:

我正在寻找如何在 Pyspark 中一起使用 Group by Aggregate Functions 的解决方案? 我的数据框看起来像这样:

df = sc.parallelize([
    ('23-09-2020', 'CRICKET'),
    ('25-11-2020', 'CRICKET'),
    ('13-09-2021', 'FOOTBALL'),
    ('20-11-2021', 'BASKETBALL'),
    ('12-12-2021', 'FOOTBALL')]).toDF(['DATE', 'SPORTS_INTERESTED'])

我想在 SPORTS_INTERESTED Column 上应用 group by 并选择 MIN of date From DATE Column 。 以下是我正在使用的查询

from pyspark.sql.functions import  min
df=df.groupby('SPORTS_INTERESTED').agg(count('SPORTS_INTERESTED').alias('FIRST_COUNT'),(F.min('DATE').alias('MIN_OF_DATE_COLUMN'))).filter((col('FIRST_COUNT')> 1))

但是当我应用上面的查询时,我不知道为什么它在输出值中给出 MAX 日期而不是 MIN 日期 期望的输出

## +-----------------+-------------------+
## |SPORTS_INTERESTED| MIN_OF_DATE_COLUMN|    
## +------+----------+-------------------+
## |  CRICKET        |23-09-2020         |
## +------+----------+-------------------+
## | FOOTBALL        |13-09-2021         |
   +-----------------+-------------------+

我得到的输出:

 ## +-----------------+----------------------+
    ## |SPORTS_INTERESTED| MIN_OF_DATE_COLUMN|    
    ## +------+----------+-------------------+
    ## |  CRICKET        |25-11-2020         |
    ## +------+----------+-------------------+
    ## | FOOTBALL        |12-12-2021         |
       +-----------------+-------------------+

两列都是字符串数据类型

【问题讨论】:

    标签: python apache-spark pyspark databricks


    【解决方案1】:

    首先,将字符串转换为日期格式,然后应用 min:

    import pyspark.sql.functions as F
    
    df = spark.createDataFrame(data=[
        ('23-09-2020', 'CRICKET'),
        ('25-11-2020', 'CRICKET'),
        ('13-09-2021', 'FOOTBALL'),
        ('20-11-2021', 'BASKETBALL'),
        ('12-12-2021', 'FOOTBALL')    
    ], schema=['DATE', 'SPORTS_INTERESTED'])
    
    df = df.withColumn("DATE", F.to_date("DATE", format="dd-MM-yyyy"))
    df = df.groupBy("SPORTS_INTERESTED").agg(F.min("DATE").alias("MIN_OF_DATE"))
    
    [Out]:
    +-----------------+-----------+
    |SPORTS_INTERESTED|MIN_OF_DATE|
    +-----------------+-----------+
    |BASKETBALL       |2021-11-20 |
    |FOOTBALL         |2021-09-13 |
    |CRICKET          |2020-09-23 |
    +-----------------+-----------+
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2014-08-24
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2010-10-24
      • 1970-01-01
      • 1970-01-01
      • 2013-12-31
      相关资源
      最近更新 更多