Pandas 数据透视表中的滚动中位数答案

【问题标题】：Rolling Median in a Pandas Pivot TablePandas 数据透视表中的滚动中位数
【发布时间】：2021-08-24 22:05:23
【问题描述】：

我正在尝试将滚动中位数计算为 pandas 数据帧上的聚合函数。以下是一些示例数据：

import pandas as pd
import numpy as np

d = {'date': ['2020-01-01','2020-02-01','2020-03-01','2020-01-01','2020-02-01','2020-02-01','2020-03-01','2020-02-01','2020-03-01','2020-03-01','2020-03-01','2020-03-01','2020-03-01'],
     'count': [1,1,1,2,2,3,3,3,4,3,3,3,1], 
     'type': ['type1','type2','type3','type1','type3','type1','type2','type2','type2','type3','type1','type2','type1'],
     'salary':[1000,2000,3000,10000,15000,30000,100000,50000,25000,10000,25000,30000,40000]}
df: pd.DataFrame = pd.DataFrame(data=d)

df_pvt: pd.DataFrame = df.pivot_table(index='date',
                                      columns='type',
                                      aggfunc={'salary': np.median})
df_pvt.head(5)

我想使用 pandas rolling(2).median() 函数对工资进行滚动中位数。

如何将这种类型的窗口函数插入到数据透视表的聚合函数中？

我的目标是按日期聚合大量数字数据，并采用可变长度的滚动中位数并在生成的数据透视表中报告。我不完全确定如何将此函数插入aggfunc 或类似的。

预期输出按日期升序排列，并获取与两个月相关的所有观察结果并找到中位数。

对于 type1，我们有：


date    count   type    salary
0   2020-01-01  1   type1   1000
3   2020-01-01  2   type1   10000
5   2020-02-01  3   type1   30000
10  2020-03-01  3   type1   25000
12  2020-03-01  1   type1   40000

因此，对于 type1，rolling(2) 的预期输出将是：


             salary
type         type1  
date            
2020-01-01  NaN 
2020-02-01  10000.0
2020-03-01  30000.0

逻辑如下，在前 2 个月的滚动窗口中，我们将有数据点 1000、10000 和 30000 并产生中位数 10000。

对于 2020-03-01，滚动 2 将包括 30000、25000、40000，因此中值结果应为 30000。

【问题讨论】：

也许有一种更简单但不那么花哨的方法。您可以尝试 1. 获取所有不同的日期值，2. 然后为每个日期过滤匹配日期的薪水，3. 计算中位数和 4. 列出要添加到 pandas 表的中位数。
@Ben.T 更新了更详细的答案并更新了数据以使其更简单。

标签： python pandas dataframe numpy median

【解决方案1】：

不确定是否可以直接使用参数aggfunc 完成。因此，一种解决方法可能是创建一个日期列偏移一个月的数据的双倍。请注意，此方法不能真正扩展到更大的滚动窗口。可以，但您最终可能会得到太多数据。

# first convert to datetime
df['date'] = pd.to_datetime(df['date'])

# append the data shifted of a month to df and perform the pivot_table
res = (
    df
    .append(df.assign(date=lambda x: x['date']+pd.DateOffset(months=1)))
    .pivot_table(index='date',columns='type',
                 aggfunc={'salary': np.median})
    .reindex(df['date'].unique()) # to avoid an extra month
)

print(res)
             salary                  
type          type1    type2    type3
date                                 
2020-01-01   5500.0      NaN      NaN
2020-02-01  10000.0  26000.0  15000.0
2020-03-01  30000.0  30000.0  10000.0

第一次约会如果你想得到nan 就像滚动窗口一样，那么你可以在之后做res.loc[res.index.min()] = np.nan

【讨论】：

有趣的方法。我会试一试，看看它是如何工作的。