【问题标题】:Problem using 'window' function to group by day in PySpark在 PySpark 中使用“窗口”功能按天分组的问题
【发布时间】:2019-02-07 22:33:13
【问题描述】:

我有一个需要重新采样的数据集。为此,我需要按天对它进行分组,同时计算每个传感器的中值。我正在使用window 函数,但是它只返回一个样本。

这是数据集:

+--------+-------------+-------------------+------+------------------+
|Variable|  Sensor Name|          Timestamp| Units|             Value|
+--------+-------------+-------------------+------+------------------+
|     NO2|aq_monitor914|2018-10-07 23:15:00|ugm -3|0.9945200000000001|
|     NO2|aq_monitor914|2018-10-07 23:30:00|ugm -3|1.1449200000000002|
|     NO2|aq_monitor914|2018-10-07 23:45:00|ugm -3|           1.13176|
|     NO2|aq_monitor914|2018-10-08 00:00:00|ugm -3|            0.9212|
|     NO2|aq_monitor914|2018-10-08 00:15:00|ugm -3|           1.39872|
|     NO2|aq_monitor914|2018-10-08 00:30:00|ugm -3|           1.51528|
|     NO2|aq_monitor914|2018-10-08 00:45:00|ugm -3|           1.61116|
|     NO2|aq_monitor914|2018-10-08 01:00:00|ugm -3|           1.59612|
|     NO2|aq_monitor914|2018-10-08 01:15:00|ugm -3|           1.12612|
|     NO2|aq_monitor914|2018-10-08 01:30:00|ugm -3|           1.04528|
+--------+-------------+-------------------+------+------------------+

我需要按天重新采样,计算每天“值”列的中位数。我正在使用以下代码:

magic_percentile = psf.expr('percentile_approx(Value, 0.5)') #Calculates median of the 'Value' column 

data = data.groupby('Variable','Sensor Name',window('Timestamp', "1 day")).agg(magic_percentile.alias('Value')

但是,问题来了,这只是返回给我以下 DataFrame:

+--------+-------------+--------------------+-------+
|Variable|  Sensor Name|              window|  Value|
+--------+-------------+--------------------+-------+
|     NO2|aq_monitor914|[2018-10-07 21:00...|1.13176|
+--------+-------------+--------------------+-------+

详细说明“窗口”列:

window=Row(start=datetime.datetime(2018, 10, 7, 21, 0), end=datetime.datetime(2018, 10, 8, 21, 0))

在我对window的理解中,它应该为当前时间戳创建一个一天的窗口,例如: 2018-10-07 23:15:00 应该变成: 2018-10-07 并按变量、传感器名称和当天对传感器进行分组,然后计算它的中位数。我真的很困惑如何做到这一点。

【问题讨论】:

    标签: python apache-spark pyspark pyspark-sql


    【解决方案1】:

    我相信你不需要使用Window 来实现你想要的。例如,如果您想对每个给定日期之前的天数进行一些聚合,您将需要这个。在您的示例中,您只需解析 datetime 迄今为止的列并在 groupBy 语句中使用它就足够了。下面给出了一个工作示例,希望对您有所帮助!

    import pyspark.sql.functions as psf
    
    df = sqlContext.createDataFrame(
        [
         ('NO2','aq_monitor914','2018-10-07 23:15:00',0.9945200000000001),
         ('NO2','aq_monitor914','2018-10-07 23:30:00',1.1449200000000002),
         ('NO2','aq_monitor914','2018-10-07 23:45:00',1.13176),
         ('NO2','aq_monitor914','2018-10-08 00:00:00',0.9212),
         ('NO2','aq_monitor914','2018-10-08 00:15:00',1.39872),
         ('NO2','aq_monitor914','2018-10-08 00:30:00',1.51528)
        ],
        ("Variable","Sensor Name","Timestamp","Value")
    )
    df = df.withColumn('Timestamp',psf.to_timestamp("Timestamp", "yyyy-MM-dd HH:mm:ss"))
    df.show()
    
    magic_percentile = psf.expr('percentile_approx(Value, 0.5)')
    df_agg = df.groupBy('Variable','Sensor Name',psf.to_date('Timestamp').alias('Day')).agg(magic_percentile.alias('Value'))
    df_agg.show()
    

    输入:

    +--------+-------------+-------------------+------------------+
    |Variable|  Sensor Name|          Timestamp|             Value|
    +--------+-------------+-------------------+------------------+
    |     NO2|aq_monitor914|2018-10-07 23:15:00|0.9945200000000001|
    |     NO2|aq_monitor914|2018-10-07 23:30:00|1.1449200000000002|
    |     NO2|aq_monitor914|2018-10-07 23:45:00|           1.13176|
    |     NO2|aq_monitor914|2018-10-08 00:00:00|            0.9212|
    |     NO2|aq_monitor914|2018-10-08 00:15:00|           1.39872|
    |     NO2|aq_monitor914|2018-10-08 00:30:00|           1.51528|
    +--------+-------------+-------------------+------------------+
    

    输出:

    +--------+-------------+----------+-------+
    |Variable|  Sensor Name|       Day|  Value|
    +--------+-------------+----------+-------+
    |     NO2|aq_monitor914|2018-10-07|1.13176|
    |     NO2|aq_monitor914|2018-10-08|1.39872|
    +--------+-------------+----------+-------+
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-05-06
      • 2021-10-18
      • 2023-02-23
      • 2018-12-27
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2023-03-22
      相关资源
      最近更新 更多