Python/Pandas：如何将 cumsum 和 cumcount 与 agg 函数结合使用？答案

【问题标题】：Python/Pandas: How to combine cumsum and cumcount with agg function?Python/Pandas：如何将 cumsum 和 cumcount 与 agg 函数结合使用？
【发布时间】：2017-09-03 02:16:50
【问题描述】：

我有一个 DataFrame，我按 Internal Score 和 Issue Date（按季度）分组。然后我想创建一个统计表，其中包括贷款数量的累积计数（由Loan # 的不同计数表示）、贷款金额的累积总和以及Actual Loss 和Outstanding Principal 的总和。累积总和和累积计数应包括直到该特定时间点的第一个日期的快照。（即2015年第一季度到2015年第二季度，然后从2015年第一季度到2015年第三季度，然后从2015年第一季度到2015年第四季度等的累计总和）

样本数据集：

   Loan #   Amount Issue Date TU Status List Internal Score  Last Actual Paid  \
0   57144  3337.76 2017-04-03              B              A               0.0   
1   57145  5536.46 2017-04-03              B              C               0.0   
2   57160  3443.91 2017-04-03              B              B               0.0   
3   57161  1162.79 2017-04-03              B              B               0.0   
4   57162  3845.98 2017-04-03              B              B               0.0   
5   57163  3441.50 2017-04-03              B              B               0.0   
6   57164  2039.96 2017-04-03              B              C               0.0   
7   57165  4427.53 2017-04-03              B              A               0.0   
8   57166  4427.53 2017-04-03              B              A               0.0   
9   57167  1617.77 2017-04-03              B              B               0.0   

   Outstanding-Principal  Actual Loss  
0                3337.76          0.0  
1                5536.46          0.0  
2                3443.91          0.0  
3                1162.79          0.0  
4                3845.98          0.0  
5                3441.50          0.0  
6                2039.96          0.0  
7                4427.53          0.0  
8                4427.53          0.0  
9                1617.77          0.0

我尝试过这样的事情：

container = []
for i in ['A', 'B', 'C', 'D']:

    subdf = df[df['Internal Score'].str.contains(i)]

    # Calculate Quarterly Vintages
    subdf.set_index('Issue Date', inplace=True)
    df2 = subdf.groupby(pd.TimeGrouper('Q')).agg({'Outstanding-Principal': np.sum, 'Actual Loss': np.sum,
                                                  'Amount': cumsum, 'Loan #': cumcount})
    df2['Internal Score'] = i
    container.append(df2)

ddf = pd.concat(container)

【问题讨论】：

cumulative counts of the number of loans (represented by the distinct count of Loan #) 是什么意思？喜欢结果如何？
通常我会在 agg 函数中创建一个字典，例如 .agg('Loan #': len}。但这只是每季度的计数。我想要累积计数
知道了。你处理一年以上吗？就像 2015 年第一季度到 2017 年第三季度一样，还是在几年内滚动？
一年多了。我目前的数据跨度从 2015 年第四季度到 2017 年第一季度
您是在整个时间跨度内累积还是在几年内累积？

标签： python pandas dataframe aggregate cumsum

【解决方案1】：

您可以先使用groupby，然后再应用cumsum。

我修改了您的虚拟数据，同时将日期更改为跨季度，以使您的示例更加清晰：

print(df)

    Loan #  Amount      Issue Date  Internal Score  Outstanding Principal   Actual Loss
0   57144   3337.76     2017-04-03  A               3337.76                 0.0
1   57145   5536.46     2017-04-03  C               5536.46                 0.0
2   57160   3443.91     2017-04-03  B               3443.91                 0.0
3   57161   1162.79     2017-04-03  B               1162.79                 0.0
4   57162   3845.98     2017-04-03  B               3845.98                 0.0
5   57163   3441.50     2017-07-03  B               3441.50                 0.0
6   57164   2039.96     2017-07-03  C               2039.96                 0.0
7   57165   4427.53     2017-07-03  A               4427.53                 0.0
8   57166   4427.53     2017-07-03  A               4427.53                 0.0
9   57167   1617.77     2017-07-03  B               1617.77                 0.0

首先，创建一个包含标识给定时间戳的季度和年份的键的列：

# in case it is not a timestamp already
df["Issue Date"] = pd.to_datetime(df["Issue Date"])

dt = df["Issue Date"].dt
df["Quarter"] = dt.strftime("%Y").str.cat(dt.quarter.astype(str), " Q")

print(df["Quarter"])

0    2017 Q2
1    2017 Q2
2    2017 Q2
3    2017 Q2
4    2017 Q2
5    2017 Q3
6    2017 Q3
7    2017 Q3
8    2017 Q3
9    2017 Q3
Name: Quarter, dtype: object

现在，聚合：

funcs = {'Outstanding Principal': np.sum, 
         'Actual Loss': np.sum, 
         'Amount': np.sum, 
         'Loan #': len}

result = df.groupby(['Internal Score', "Quarter"]).agg(funcs)
print(result)

                            Outstanding Principal   Amount      Actual Loss     Loan #
Internal Score  Quarter                 
             A  2017 Q2     3337.76                 3337.76     0.0             1
                2017 Q3     8855.06                 8855.06     0.0             2
             B  2017 Q2     8452.68                 8452.68     0.0             3
                2017 Q3     5059.27                 5059.27     0.0             2
             C  2017 Q2     5536.46                 5536.46     0.0             1
                2017 Q3     2039.96                 2039.96     0.0             1

最后使用transform 和cumsum：

cum_cols = ["Amount", "Loan #"]
cumsums = result.groupby(level="Internal Score")[cum_cols].transform(lambda x: x.cumsum())
result.loc[:, cum_cols] = cumsums

print(result)
                            Outstanding Principal   Amount      Actual Loss     Loan #
Internal Score  Quarter                 
             A  2017 Q2     3337.76                 3337.76     0.0             1
                2017 Q3     8855.06                12192.82     0.0             3
             B  2017 Q2     8452.68                 8452.68     0.0             3
                2017 Q3     5059.27                13511.95     0.0             5
             C  2017 Q2     5536.46                 5536.46     0.0             1
                2017 Q3     2039.96                 7576.42     0.0             2

【讨论】：

太棒了！非常感谢潘森。非常感谢。
很高兴为您提供帮助 :-)