如何在跳过前两行时计算列的累积总和？答案

【问题标题】：How do I compute the cumlative sum of a column while skipping the first two rows?如何在跳过前两行时计算列的累积总和？
【发布时间】：2021-12-27 21:13:47
【问题描述】：

我有一个看起来像这样的熊猫数据框：

                                 capacity_gw  marginal_cost  chained_capacity
Case                   Category
CES - No Storage       Hydro        4.277016       0.000000               NaN
                       Solar        9.774715       0.000000               NaN
                       Wind        11.881870       0.000000          4.277016
                       Nuclear      5.242805      12.689066         14.051731
                       NGCC         2.101907      25.109150         25.933600
                       NGGT         4.638107      32.703513         31.176405
                       Overflow    35.000000     169.679554         33.278312
CES - Storage          Hydro        4.277016       0.000000         37.916419
                       Solar        9.774715       0.000000         72.916419
                       Wind        11.881869       0.000000          4.277016
                       Nuclear      5.242805      12.689066         14.051731
                       NGCC         2.101907      25.109150         25.933600
                       NGGT         2.101907      32.703513         31.176405
                       Overflow    35.000000     169.679554         33.278312
Reference - No Storage Hydro        4.277016       0.000000         35.380219
                       Solar       14.289311       0.000000         70.380219
                       Wind        10.435570       0.000000          4.277016
                       Nuclear      1.143500      12.689066         18.566327
                       NGCC         4.533380      25.109150         29.001897
                       NGGT        17.224408      32.703513         30.145397
                       Overflow    35.000000     169.679554         34.678777
Reference - Storage    Hydro        4.277016       0.000000         51.903185
                       Solar       14.894274       0.000000         86.903185
                       Wind        10.435570       0.000000          4.277016
                       Nuclear      1.143500      12.689066         19.171290
                       NGCC         4.533380      25.109150         29.606860
                       NGGT        14.524706      32.703513         30.750360
                       Overflow    35.000000     169.679554         35.283740

我使用以下方法创建了chained_capacity 变量：

stack['chained_capacity'] = stack.groupby('Case')['capacity_gw'].cumsum().shift(2)

但这不是我想要的结果。如您所见，它仍然以列中的第一个初始值开始求和。我希望总和从第三个值开始。所以预期的输出是：

                                 capacity_gw  marginal_cost  chained_capacity
Case                   Category
CES - No Storage       Hydro        4.277016       0.000000               NaN
                       Solar        9.774715       0.000000               NaN
                       Wind        11.881870       0.000000         11.881870
                       Nuclear      5.242805      12.689066         17.124674
                       NGCC         2.101907      25.109150         17.12 + 2.10
                       NGGT         4.638107      32.703513         ...
                       Overflow    35.000000     169.679554         ...
...

这里是df.to_dict() 能够完全重现数据：

{'capacity_gw': {('Reference - No Storage', 'Solar'): 14.289311043873823, ('Reference - No Storage', 'Wind'): 10.43556981658827, ('Reference - No Storage', 'Hydro'): 4.277016, ('Reference - No Storage', 'Nuclear'): 1.1435, ('Reference - No Storage', 'NGCC'): 4.533380090390558, ('Reference - No Storage', 'NGGT'): 17.22440836569597, ('Reference - No Storage', 'Overflow'): 35.0, ('Reference - Storage', 'Solar'): 14.894274398144354, ('Reference - Storage', 'Wind'): 10.435569838806854, ('Reference - Storage', 'Hydro'): 4.277016, ('Reference - Storage', 'Nuclear'): 1.1435, ('Reference - Storage', 'NGCC'): 4.533380082818851, ('Reference - Storage', 'NGGT'): 14.524706430121823, ('Reference - Storage', 'Overflow'): 35.0, ('CES - No Storage', 'Solar'): 9.774714739869358, ('CES - No Storage', 'Wind'): 11.881869635856951, ('CES - No Storage', 'Hydro'): 4.277016, ('CES - No Storage', 'Nuclear'): 5.242805, ('CES - No Storage', 'NGCC'): 2.1019069999999997, ('CES - No Storage', 'NGGT'): 4.638107074198996, ('CES - No Storage', 'Overflow'): 35.0, ('CES - Storage', 'Solar'): 9.774714538236491, ('CES - Storage', 'Wind'): 11.881869305881622, ('CES - Storage', 'Hydro'): 4.277016, ('CES - Storage', 'Nuclear'): 5.242805, ('CES - Storage', 'NGCC'): 2.1019069999999997, ('CES - Storage', 'NGGT'): 2.1019069999999997, ('CES - Storage', 'Overflow'): 35.0}, 'marginal_cost': {('Reference - No Storage', 'Solar'): 0.0, ('Reference - No Storage', 'Wind'): 0.0, ('Reference - No Storage', 'Hydro'): 0.0, ('Reference - No Storage', 'Nuclear'): 12.68906562274404, ('Reference - No Storage', 'NGCC'): 25.10914978408783, ('Reference - No Storage', 'NGGT'): 32.703513055654646, ('Reference - No Storage', 'Overflow'): 169.6795540944021, ('Reference - Storage', 'Solar'): 0.0, ('Reference - Storage', 'Wind'): 0.0, ('Reference - Storage', 'Hydro'): 0.0, ('Reference - Storage', 'Nuclear'): 12.68906562274404, ('Reference - Storage', 'NGCC'): 25.10914978408783, ('Reference - Storage', 'NGGT'): 32.703513055654646, ('Reference - Storage', 'Overflow'): 169.6795540944021, ('CES - No Storage', 'Solar'): 0.0, ('CES - No Storage', 'Wind'): 0.0, ('CES - No Storage', 'Hydro'): 0.0, ('CES - No Storage', 'Nuclear'): 12.68906562274404, ('CES - No Storage', 'NGCC'): 25.10914978408783, ('CES - No Storage', 'NGGT'): 32.703513055654646, ('CES - No Storage', 'Overflow'): 169.6795540944021, ('CES - Storage', 'Solar'): 0.0, ('CES - Storage', 'Wind'): 0.0, ('CES - Storage', 'Hydro'): 0.0, ('CES - Storage', 'Nuclear'): 12.68906562274404, ('CES - Storage', 'NGCC'): 25.10914978408783, ('CES - Storage', 'NGGT'): 32.703513055654646, ('CES - Storage', 'Overflow'): 169.6795540944021}}

【问题讨论】：

如果你在cumsum 之前shift 喜欢.shift(2).cumsum() 会发生什么？
好像这样可以回答你的问题：stackoverflow.com/questions/54993050/…
@JoshuaVoskamp 这适用于Case，但它不能解决仅从第 3 个条目开始的原始问题。

标签： python pandas cumulative-sum

【解决方案1】：

使用来自this answer 的技巧，或者

# this transform may be slow for large dataframes
stack['chained_capacity'] = \
    stack.groupby('Case')['capacity_gw'].transform(lambda x: x.cumsum().shift(2))

或

# creates a temporary column; should be fast/scalable for large df
stack['temp'] = stack.groupby('Case')['capacity_gw'].cumsum()
stack['chained_capacity'] = stack.groupby('Case')['temp'].shift(2)
stack = stack.drop(columns=['temp'])

【讨论】：

其实这个不完全行，好像是对Case求和，我需要每个Case的累加和。
您能提供一个样本df.to_dict() 使用吗？
我已编辑我的问题以包含该信息。
通过将您的原始解决方案与您链接我的解决方案相结合，我也能够得到正确的答案，因此：stack.groupby('Case')['capacity_gw'].transform(lambda x: x.shift(-2).cumsum().shift(2))
我可能弄错了；我认为我的.shift(-2) 想法完全没有必要。