使用 pandas 进行上采样时，将总和平均分配到更高的采样时间段答案

【问题标题】：Divide total sum equally to higher sampled time periods when upsampling with pandas使用 pandas 进行上采样时，将总和平均分配到更高的采样时间段
【发布时间】：2020-11-14 09:13:41
【问题描述】：

我试图将一个时间段的总和平均分配给较高采样时间段的组成部分。

我做了什么：

>>> rng = pandas.PeriodIndex(start='2014-01-01', periods=2, freq='W')
>>> ts = pandas.Series([i+1 for i in range(len(rng))], index=rng)
>>> ts
2013-12-30/2014-01-05    1
2014-01-06/2014-01-12    2
Freq: W-SUN, dtype: float64

>>> ts.resample('D')
2013-12-30     1
2013-12-31   NaN
2014-01-01   NaN
2014-01-02   NaN
2014-01-03   NaN
2014-01-04   NaN
2014-01-05   NaN
2014-01-06     2
2014-01-07   NaN
2014-01-08   NaN
2014-01-09   NaN
2014-01-10   NaN
2014-01-11   NaN
2014-01-12   NaN
Freq: D, dtype: float64

我真正想要的是：

>>> ts.resample('D', some_miracle_thing)
2013-12-30     1/7
2013-12-31     1/7
2014-01-01     1/7
2014-01-02     1/7
2014-01-03     1/7
2014-01-04     1/7
2014-01-05     1/7
2014-01-06     2/7
2014-01-07     2/7
2014-01-08     2/7
2014-01-09     2/7
2014-01-10     2/7
2014-01-11     2/7
2014-01-12     2/7
Freq: D, dtype: float64

有办法吗

特别是 - 例如，使用 x/7 lambda 函数？
一般来说，所以它独立于因子 7 工作（比如 24 小时到几天等等）？

【问题讨论】：

五年后，有没有更好更规范的方案？

标签： python pandas

【解决方案1】：

有点令人费解，但这有效吗？

首先，当您重新采样时，添加一个.groupby(level=0)，以便保留原始时间戳。（基于此answer）

rs = ts.groupby(level=0).resample('D')

然后在MultiIndex的第一级应用groupby来应用你想要的操作。

In [285]: rs.groupby(level=0).transform(lambda x: x.iloc[0] / float(len(x)))
Out[285]: 
2013-12-30/2014-01-05  2013-12-30    0.142857
                       2013-12-31    0.142857
                       2014-01-01    0.142857
                       2014-01-02    0.142857
                       2014-01-03    0.142857
                       2014-01-04    0.142857
                       2014-01-05    0.142857
2014-01-06/2014-01-12  2014-01-06    0.285714
                       2014-01-07    0.285714
                       2014-01-08    0.285714
                       2014-01-09    0.285714
                       2014-01-10    0.285714
                       2014-01-11    0.285714
                       2014-01-12    0.285714
dtype: float64

【讨论】：

看起来不错，但我有点郁闷，这个基本功能没有好用的实现。还有其他想法吗？

【解决方案2】：

这可行，但我觉得它很难看：

>>> rs = ts.resample('D', fill_method="pad")
>>> rs/7

2013-12-30    0.142857
2013-12-31    0.142857
2014-01-01    0.142857
2014-01-02    0.142857
2014-01-03    0.142857
2014-01-04    0.142857
2014-01-05    0.142857
2014-01-06    0.285714
2014-01-07    0.285714
2014-01-08    0.285714
2014-01-09    0.285714
2014-01-10    0.285714
2014-01-11    0.285714
2014-01-12    0.285714
Freq: D, dtype: float64

这个基本功能没有内部函数吗？

【讨论】：

你找到答案了吗？

【解决方案3】：

我讨厌这种解决方案，但是当您不确定新间隔的数量时，它适用于上采样。从一周到一天很容易，它总是每周 7 天。但我发现基于上采样的间隔数通常是未知的 - 这个解决方案适用于此。

这个想法是将重新采样后的间隔数获取到初始预重新采样的数据帧中，然后重新采样并将您的数据除以间隔计数。旁注 - 这是一个数据框，而不是一个系列。

# Create unique group IDs by simply using the existing index (Assumes an integer, non-duplicated index)
df['group'] = df.index  

# Get the count of intervals for each post-resampled timestamp.
df['count'] = df.set_index('timestamp').resample('15min').ffill()['group'].value_counts()

# Resample all data again and fill so that the count is now included in every row.
df          = df.set_index('timestamp').resample('15min').ffill()

# Apply the division on the entire dataframe and clean up.
df          = df.div(df['count'], axis = 0).reset_index().drop(['group','count'], axis = 1)

我会将它包装在一个函数中并将它藏起来，这样我就不必再看它了，比如：

def distribute_upsample(df, index, freq)

index 是 'timestamp'，freq 是 '15min'

【讨论】：