如何在熊猫中获得上个月的平均值答案

【问题标题】：How to get mean of last month in pandas如何在熊猫中获得上个月的平均值
【发布时间】：2021-06-01 12:04:55
【问题描述】：

我有一个数据集，第一列是日期，第二列是合作者，第三列是支付的价格。

我想获取每个协作者上个月支付的平均价格。我想返回一个看起来像这样的表：

我使用了一些解决方案，例如滚动，但我只能得到过去 X 天，而不是过去一个月

【问题讨论】：

您能否发布您的预期输出以及可重现的代码，以便其他人快速尝试解决方案。

标签： python pandas dataframe time-series

【解决方案1】：

Pandas 有一个内置方法.rolling

x = 3 # This is where you define the number of previous entries
df.rolling(x).mean() # Apply the mean

因此：

df['LastMonthMean'] = df['Price'].rolling(x).mean()

我不确定你想如何计算你的平均值，但希望这会有所帮助

【讨论】：

【解决方案2】：

我会先添加month 列，然后使用 groupby 并检索第一项

import pandas as pd
df = pd.DataFrame({
    'month': [1, 1, 1, 2, 2, 2],
    'collaborator': [1, 2, 3, 1, 2, 3],
    'price': [100, 200, 300, 400, 500, 600]
})

df.groupby(['collaborator', 'month']).mean()

【讨论】：

【解决方案3】：

rolling() 方法必须应用于按协作者分组的 DataFrame 以获得上个月每个协作者的平均销售价格。因为数据会被分组和汇总，数据点的数量与原始数据集不匹配，因此您无法轻松地将结果附加到原始数据集。

如果您在 DataFrame 中使用 DatetimeIndex，它将被视为时间序列，然后您可以更轻松地resample() 数据。

根据您最初的问题，我在下面生成了一个可复制的解决方案，在该问题中我重新采样了数据并将上个月的平均值附加到它上面。感谢@akilat90 为generate random dates within a range 提供的功能。

import pandas as pd
import numpy as np

def random_dates(start, end, n=10):
    # Function copied from @akilat90
    # Available on https://stackoverflow.com/questions/50559078/generating-random-dates-within-a-given-range-in-pandas
    
    start_u = pd.to_datetime(start).value//10**9
    end_u = pd.to_datetime(end).value//10**9

    return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')

size = 1000

index = random_dates(start='2021-01-01', end='2021-06-30', n=size).sort_values()

collaborators = np.random.randint(low=1, high=4, size=size)

prices = np.random.uniform(low=5., high=25., size=size)

data = pd.DataFrame({'Collaborator': collaborators,
                     'Price': prices}, index=index)

monthly_mean = data.groupby('Collaborator').resample('M')['Price'].mean()

data_final = pd.merge(data, monthly_mean, how='left', left_on=['Collaborator', data.index.month],
         right_on=[monthly_mean.index.get_level_values('Collaborator'), monthly_mean.index.get_level_values(1).month + 1])

data_final.index = data.index
data_final = data_final.drop('key_1', axis=1)
data_final.columns = ['Collaborator', 'Price', 'LastMonthMean']

这是输出：

                     Collaborator      Price  LastMonthMean
2021-01-31 04:26:16             2  21.838910            NaN
2021-01-31 05:33:04             2  19.164086            NaN
2021-01-31 12:32:44             2  24.949444            NaN
2021-01-31 12:58:02             2   8.907224            NaN
2021-01-31 14:43:07             1   7.446839            NaN
2021-01-31 18:38:11             3   6.565208            NaN
2021-02-01 00:08:25             2  24.520149      15.230642
2021-02-01 09:25:54             2  20.614261      15.230642
2021-02-01 09:59:48             2  10.879633      15.230642
2021-02-02 10:12:51             1  22.134549      14.180087
2021-02-02 17:22:18             2  24.469944      15.230642

如您所见，2021 年 1 月（该时间序列中的第一个月）的记录没有有效的上个月平均值，这与 2 月的记录不同。

【讨论】：