Pandas：撤消累积（例如累积总和）答案

【问题标题】：Pandas: undo accumulation (e.g. cumulative sum)Pandas：撤消累积（例如累积总和）
【发布时间】：2019-02-15 22:04:04
【问题描述】：

我收到了累积数字的数据。有没有一种聪明的方法来去积累数据，所以我每个月都有它，而不是相互堆叠？

（在此处查看示例 xlsx：https://docs.google.com/spreadsheets/d/1yELrJdZmi3CFJccYSi5U6GGDW-Awp5spHDnsDyshBe0/edit?usp=sharing。）

示例输入：

Date    SalesRep    itemA   itemB
01-01-2018  Jakob   5       10
01-01-2018  Adomas  10      20
01-01-2018  Thomas  15      30
01-02-2018  Jakob   50      30
01-02-2018  Adomas  100     40
01-02-2018  Thomas  150     65

期望的输出：

Date    SalesRep    itemA   itemB
01-01-2018  Jakob   5       10
01-01-2018  Adomas  10      20
01-01-2018  Thomas  15      30
01-02-2018  Jakob   45      20
01-02-2018  Adomas  90      20
01-02-2018  Thomas  135     35

最好的问候，

普热梅斯拉夫

附注更新

如果数据不是每个月都在增加怎么办？

示例输入：

Date    SalesRep    itemA   itemB
01-01-2018  Jakob   5       10
01-01-2018  Adomas  10      20
01-01-2018  Thomas  15      30
**01-02-2018    Jakob   50      30**
01-02-2018  Adomas  100     40
01-02-2018  Thomas  150     65
**01-03-2018    Jakob   50      30**
01-03-2018  Adomas  102     60
01-03-2018  Thomas  155     75

如果 Jakob 不是每个月都在增加，那么您的解决方案不起作用怎么办？我可以以某种方式指定参数来检查它并仅在有变化时减去吗？

【问题讨论】：

将您的示例添加到问题的文本中，而不是指向外部文档的链接。
您的更新没有多大意义。当没有变化时，解决方案可以正常工作——根据定义，在这种情况下，当月的输出应该是 0。添加示例输出。

标签： python excel pandas cumulative-sum

【解决方案1】：

这是 Denziloe 答案的一个不太通用但更漂亮的版本：

def reverse_cumsum(series):
    series_zeroed = pd.concat([pd.Series([0]), series])
    return series_zeroed.diff()[1:]

这可以通过按日期排序在您的示例中使用，然后在按所需列（在您的情况下为“SalesRep”）分组后应用它。

【讨论】：

【解决方案2】：

基本上使用DataFrame.groupby 和diff。不幸的是，第一行缺少前一行的差异，是nan，这需要一些混乱的清理。可能有更漂亮的方式。

df = pd.DataFrame(
    data=[
        ['01-01-2018', 'Jakob', 5, 10],
        ['01-01-2018', 'Adomas', 10, 20],
        ['01-01-2018', 'Thomas', 15, 30],
        ['01-02-2018', 'Jakob', 50, 30],
        ['01-02-2018', 'Adomas', 100, 40],
        ['01-02-2018', 'Thomas', 150, 65],
        ['01-03-2018', 'Jakob', 60, 30],
        ['01-03-2018', 'Adomas', 120, 45],
        ['01-03-2018', 'Thomas', 200, 75]
    ],
    columns=['Date', 'Sales rep', 'item A', 'item B']
)

cum_columns = ['item A', 'item B']

result = df.merge(
    df.groupby('Sales rep')[cum_columns].diff(),
    left_index=True, right_index=True, suffixes=['', '_uncum']
).fillna({'{}_uncum'.format(cum_column): df[cum_column] for cum_column in cum_columns})

print(result)
Out:
         Date Sales rep  item A  item B  item A_uncum  item B_uncum
0  01-01-2018     Jakob       5      10           5.0          10.0
1  01-01-2018    Adomas      10      20          10.0          20.0
2  01-01-2018    Thomas      15      30          15.0          30.0
3  01-02-2018     Jakob      50      30          45.0          20.0
4  01-02-2018    Adomas     100      40          90.0          20.0
5  01-02-2018    Thomas     150      65         135.0          35.0
6  01-03-2018     Jakob      60      30          10.0           0.0
7  01-03-2018    Adomas     120      45          20.0           5.0
8  01-03-2018    Thomas     200      75          50.0          10.0

【讨论】：

嗨很好的例子，我如何从 excel 中导入数据，然后在 pandas 中得到相同的结果？
你应该谷歌一下，互联网上有很多信息。但是pd.read_excel是相关函数。
数据不是每个月都在增加的情况怎么办？
你的问题是关于递增数据，那么数据不递增的情况是什么意思？

【解决方案3】：

这是使用shift 的另一种方法。它基本上减去了前一个数字。它假定 DataFrame 的顺序已经正确（如果不是，请先使用DataFrame.sort_values）。我认为这更好，因为它提供了就地单线。

df = pd.DataFrame(
    data=[
        ['01-01-2018', 'Jakob', 5, 10],
        ['01-01-2018', 'Adomas', 10, 20],
        ['01-01-2018', 'Thomas', 15, 30],
        ['01-02-2018', 'Jakob', 50, 30],
        ['01-02-2018', 'Adomas', 100, 40],
        ['01-02-2018', 'Thomas', 150, 65],
        ['01-03-2018', 'Jakob', 60, 30],
        ['01-03-2018', 'Adomas', 120, 45],
        ['01-03-2018', 'Thomas', 200, 75]
    ],
    columns=['Date', 'Sales rep', 'item A', 'item B']
)

group_by_columns = ['Sales rep']
cum_columns = ['item A', 'item B']

df[cum_columns] -= df.groupby(group_by_columns)[cum_columns].shift(1).fillna(0)

print(df)
Out:
         Date Sales rep  item A  item B
0  01-01-2018     Jakob     5.0    10.0
1  01-01-2018    Adomas    10.0    20.0
2  01-01-2018    Thomas    15.0    30.0
3  01-02-2018     Jakob    45.0    20.0
4  01-02-2018    Adomas    90.0    20.0
5  01-02-2018    Thomas   135.0    35.0
6  01-03-2018     Jakob    10.0     0.0
7  01-03-2018    Adomas    20.0     5.0
8  01-03-2018    Thomas    50.0    10.0

【讨论】：

有一个错字，我提供的扩展示例没有增加月份，但它并没有真正打破答案。我已经编辑以修复错字...这是否解决了您的困惑？

【解决方案4】：

您可以按销售代表分组并获取逐行差异。然后将数据集重新合并在一起。

import pandas as pd

df = pd.DataFrame({
    'Date': ['01-01-2018', '01-01-2018', '01-01-2018', '01-02-2018', '01-02-2018', '01-02-2018'],
    'SalesRep': ['Jakob', 'Adomas', 'Thomas', 'Jakob', 'Adomas', 'Thomas',],
    'itemA': [5, 10, 15, 50, 100, 150],
    'itemB': [10, 20, 30, 30, 40, 65]})

df_diff = df.groupby('SalesRep').diff().fillna(0).astype(int)
df.loc[:, ['itemA', 'itemB']] = df_diff.where(df_diff, df.loc[:, ['itemA', 'itemB']])

df
# returns:
         Date SalesRep  itemA  itemB
0  01-01-2018    Jakob      5     10
1  01-01-2018   Adomas     10     20
2  01-01-2018   Thomas     15     30
3  01-02-2018    Jakob     45     20
4  01-02-2018   Adomas     90     20
5  01-02-2018   Thomas    135     35

【讨论】：

嗨很好的例子，我如何从 excel 中导入数据，然后在 pandas 中得到相同的结果？
数据不是每个月都在增加的情况怎么办？
您必须为该案例提供更多数据。