【问题标题】:Cumulative sum only applying on 1 column python累积和仅适用于 1 列 python
【发布时间】:2017-07-20 11:06:02
【问题描述】:

我只想在 1 个特定列上应用 cumsum,因为我在不同列中有其他值必须保持不变。

这是我目前的脚本

df.groupby(by=['name','day']).sum().groupby(level=[0]).cumsum()

但是,此脚本导致我的 pandas df 中的所有列都会累积。唯一必须累积总和的列是data

根据要求,这里是一些示例数据:

df = pd.DataFrame({'ID': ["880022443344556677787", "880022443344556677782", "880022443344556677787",
                          "880022443344556677782", "880022443344556677787", "880022443344556677782",
                          "880022443344556677781"],
                   'Month': ["201701", "201701", "201702", "201702", "201703", "201703", "201703"],
                   'Usage': [20, 40, 100, 50, 30, 30, 2000],
                   'Sec': [10, 15, 20, 1, 5, 6, 30]})

                      ID   Month  Sec  Usage
0  880022443344556677787  201701   10     20
1  880022443344556677782  201701   15     40
2  880022443344556677787  201702   20    100
3  880022443344556677782  201702    1     50
4  880022443344556677787  201703    5     30
5  880022443344556677782  201703    6     30
6  880022443344556677781  201703   30   2000

期望的输出

                      ID   Month  Sec  Usage
0  880022443344556677787  201701   10     20
1  880022443344556677782  201701   15     40
2  880022443344556677787  201702   20    120
3  880022443344556677782  201702    1     90
4  880022443344556677787  201703    5    150
5  880022443344556677782  201703    6    120
6  880022443344556677781  201703   30   2000

【问题讨论】:

    标签: python pandas cumulative-sum


    【解决方案1】:

    我认为你需要 set_index 用于不需要 cumsum 的列 - 我通过 list comprehension 动态找到它们:

    cumsum_col = 'Usage'
    df1 = df.groupby(by=['ID','Month'], sort=False).sum()
    cols = [col for col in df1.columns if col != cumsum_col]
    
    df1 = df1.set_index(cols, append=True).groupby(level=[0]).cumsum().reset_index()
    print (df1)
                          ID   Month  Sec  Usage
    0  880022443344556677787  201701   10     20
    1  880022443344556677782  201701   15     40
    2  880022443344556677787  201702   20    120
    3  880022443344556677782  201702    1     90
    4  880022443344556677787  201703    5    150
    5  880022443344556677782  201703    6    120
    6  880022443344556677781  201703   30   2000
    

    编辑:

    cumsum_col = 'Usage'
    df2 = df.groupby(by=['ID','Month'], sort=False).sum()
    cols = [col for col in df2.columns if col != cumsum_col]
    df1 = df2.set_index(cols, append=True).groupby(level=[0]).cumsum()
    df1 = df2.assign(Usage_cumsum = df1.reset_index(level=2, drop=True)).reset_index()
    print (df1)
                          ID   Month  Sec  Usage  Usage_cumsum
    0  880022443344556677787  201701   10     20            20
    1  880022443344556677782  201701   15     40            40
    2  880022443344556677787  201702   20    100           120
    3  880022443344556677782  201702    1     50            90
    4  880022443344556677787  201703    5     30           150
    5  880022443344556677782  201703    6     30           120
    6  880022443344556677781  201703   30   2000          2000
    

    编辑1:

    在您的示例数据中没有聚合sum,因此数据进行了一些修改(解决方案类似,但与另一个不同):

    df = pd.DataFrame({'ID': ["880022443344556677787", "880022443344556677782", "880022443344556677787",
                              "880022443344556677782", "880022443344556677787", "880022443344556677782",
                              "880022443344556677781"],
                       'Month': ["201701", "201701", "201701", "201702", "201703", "201701", "201703"],
                       'Usage': [20, 40, 100, 50, 30, 30, 2000],
                       'Sec': [10, 15, 20, 1, 5, 6, 30]})
    
    print (df)
                          ID   Month  Sec  Usage
    0  880022443344556677787  201701   10     20
    1  880022443344556677782  201701   15     40
    2  880022443344556677787  201701   20    100
    3  880022443344556677782  201702    1     50
    4  880022443344556677787  201703    5     30
    5  880022443344556677782  201701    6     30
    6  880022443344556677781  201703   30   2000
    
    #aggregate sum to all columns
    df1 = df.groupby(['ID', 'Month']).sum() 
    print (df1)
                                  Sec  Usage
    ID                    Month             
    880022443344556677781 201703   30   2000
    880022443344556677782 201701   21     70
                          201702    1     50
    880022443344556677787 201701   30    120
                          201703    5     30
    
    #aggregate cumcum to Usage column only 
    s = df1.groupby(level=0)['Usage'].cumsum()
    print (s)
    ID                     Month 
    880022443344556677781  201703    2000
    880022443344556677782  201701      70
                           201702     120
    880022443344556677787  201701     120
                           201703     150
    Name: Usage, dtype: int64
    
    #join cumsum series to aggregate df1
    df3 = df1.join(s, rsuffix='_cumsum').reset_index()
    print (df3)
                          ID   Month  Sec  Usage  Usage_cumsum
    0  880022443344556677781  201703   30   2000          2000
    1  880022443344556677782  201701   21     70            70
    2  880022443344556677782  201702    1     50           120
    3  880022443344556677787  201701   30    120           120
    4  880022443344556677787  201703    5     30           150
    

    【讨论】:

    • 是否可以用 cum sum 数据添加一个额外的列而不是替换它?
    • 不确定发生了什么,但是当我将它应用到我的 df 时,您的第一个方法正在工作,但是带有附加列的新方法带有 cumsum 返回 NaN 值。你知道发生了什么吗?
    • 看来你的真实数据有更多的列,所以需要更改df1.reset_index(level=[2,3,4], drop=True) - 每个级别都有额外的列。但是我修改了另一个解决方案,请稍等。
    【解决方案2】:

    考虑数据框df

    df = pd.DataFrame(dict(
            name=list('aaaaaaaabbbbbbbb'),
            day=np.tile(np.arange(2).repeat(4), 2),
            data=np.arange(16)
        ))
    

    首先,您通过在 groupby 语句之后命名特定列来对特定列执行 cumsum

    其次,您可以将其添加回数据框 dfjoin

    d2 = df.groupby(['name', 'day']).data.sum().groupby(level=0).cumsum()
    
    df.join(d2, on=['name', 'day'], rsuffix='_cum')
    
        data  day name  data_cum
    0      0    0    a         6
    1      1    0    a         6
    2      2    0    a         6
    3      3    0    a         6
    4      4    1    a        28
    5      5    1    a        28
    6      6    1    a        28
    7      7    1    a        28
    8      8    0    b        38
    9      9    0    b        38
    10    10    0    b        38
    11    11    0    b        38
    12    12    1    b        92
    13    13    1    b        92
    14    14    1    b        92
    15    15    1    b        92
    

    【讨论】:

      【解决方案3】:

      您已经可以将累积和 ('cumsum') 作为 df.groupby 的聚合。您需要将 'cumsum' 作为字符串作为聚合函数提供给“数据”列。

      df.groupby(['name','day']).agg({'data': 'cumsum'})
      

      【讨论】:

      • 这是错误的,因为首先需要聚合sum 然后groupby by first level 仅用于聚合cumsum。
      猜你喜欢
      • 2020-09-29
      • 1970-01-01
      • 1970-01-01
      • 2014-02-27
      • 2020-07-18
      • 1970-01-01
      • 1970-01-01
      • 2018-04-09
      相关资源
      最近更新 更多