【发布时间】:2020-12-05 20:09:43
【问题描述】:
我有:
from pyspark.sql import functions as F
from pyspark.sql.window import Window
df = spark.createDataFrame([(17, "2017-03-10T15:27:18+00:00",'Store 1'),
(13, "2017-04-15T12:27:18+00:00",'Store 1'),
(25, "2017-05-18T11:27:18+00:00",'Store 1'),
(18, "2017-05-19T11:27:18+00:00",'Store 1'),
(13, "2017-03-15T12:27:18+00:00",'Store 2'),
(25, "2017-05-18T11:27:18+00:00",'Store 2'),
(25, "2017-08-18T11:27:18+00:00",'Store 2')],
["dollars", "timestampGMT",'Store'])
df = df.withColumn('timestampGMT', df.timestampGMT.cast('timestamp'))
dollars timestampGMT Store
17 2017-03-10 15:27:18 Store 1
13 2017-04-15 12:27:18 Store 1
25 2017-05-18 11:27:18 Store 1
18 2017-05-19 11:27:18 Store 1
13 2017-03-15 12:27:18 Store 2
25 2017-05-18 11:27:18 Store 2
25 2017-08-18 11:27:18 Store 2
我想按过去 3 个月取平均值(如果存在最近 3 个月,否则为 0),按商店分组。 结束:
dollars timestampGMT Store Last_3_months_Average
17 2017-03-10 15:27:18 Store 1 0
13 2017-04-15 12:27:18 Store 1 0
25 2017-05-18 11:27:18 Store 1 18.25
18 2017-05-19 11:27:18 Store 1 18.25
13 2017-03-15 12:27:18 Store 2 0
25 2017-05-18 11:27:18 Store 2 0
25 2017-08-18 11:27:18 Store 2 0
25 2017-08-19 11:27:18 Store 2 0
不知道如何解决这个问题。我应该先按月分组吗?
【问题讨论】:
-
@Lamanus 不幸的是,这还不够。这将做滚动平均值,但如果我没有连续 3 个月的数据,我不能用它来设置为 0。