如何根据索引的最大值差异创建新列？答案

【问题标题】：How to make a new column based on difference of max values by index?如何根据索引的最大值差异创建新列？
【发布时间】：2021-03-13 15:24:22
【问题描述】：

采用以下多索引数据框：

index_1   index_2   cum_value
0         2020-01      100.00
0         2020-02       50.00 
0         2020-03      -50.00
0         2020-04      150.00
0         2020-05      200.00    
1         2020-01       25.00
1         2020-02       50.00
1         2020-03     -100.00
1         2020-04       50.00
1         2020-05      200.00

我需要创建一个new_col 来计算每个index_1 的最后一个cum_value 的差异，如果这个cum_value 在那个月增加了，考虑到过去几个月在那个index_1 中的过去最大值.

结果应该是这样的：

index_1   index_2   cum_value   new_col
0         2020-01      100.00    100.00 --> first positive value on index_1 [0]
0         2020-02       50.00      0.00
0         2020-03      -50.00      0.00
0         2020-04      150.00     50.00 --> (150 - 100)
0         2020-05      200.00     50.00 --> (200 - 150)
1         2020-01       25.00     25.00 --> first positive value on index_1 [1]
1         2020-02       50.00     25.00 --> (50 - 25)
1         2020-03     -100.00      0.00
1         2020-04       50.00      0.00
1         2020-05      200.00    150.00 --> (200 - 50)

new_col 上具有正值的第一行必须显示该值。我不需要负最大值。

这是计算边际价值以支付一些税款的基本原理。

【问题讨论】：

cum_value列中的正值是否总是按升序排列？
@Shubham Sharma 没有。它们可以是正面的，但低于之前的正面值。
那么考虑一下cum_value对应的索引0 2020-04 列中的值是否是50而不是150，那么这种情况下会输出什么？
@Shubham Sharma 为零，因为 50 不大于前一个最大值，即 100。即。这不是新的最大值。我只想减去最后 2 个最大值来查看残值。

标签： python pandas dataframe

【解决方案1】：

代码

c = df.groupby(level=0)['cum_value'].cummax()
m = df['cum_value'].ge(c) & df['cum_value'].ge(0)
df['new_col'] = df.loc[m, 'cum_value'].groupby(level=0).diff()
df['new_col'] = df['new_col'].fillna(df['cum_value']).mask(~m, 0)

说明

让我们grouplevel=0 上的数据框，即index_1 并使用cummax 转换列cum_value 以计算每个level=0 组的累积最大值：

>>> c

index_1  index_2
0        2020-01    100.0
         2020-02    100.0
         2020-03    100.0
         2020-04    150.0
         2020-05    200.0
1        2020-01     25.0
         2020-02     50.0
         2020-03     50.0
         2020-04     50.0
         2020-05    200.0
Name: cum_value, dtype: float64

现在，将cum_value 列与上面计算的累积最大值进行比较，以创建一个布尔掩码。请注意，我们只考虑cum_value 中的正值。这个布尔掩码背后的想法是，如果当前月份的值大于或等于前几个月的最大值，那么这个掩码的输出将为True，否则为False。

>>> m

index_1  index_2
0        2020-01     True
         2020-02    False
         2020-03    False
         2020-04     True
         2020-05     True
1        2020-01     True
         2020-02     True
         2020-03    False
         2020-04     True
         2020-05     True
Name: cum_value, dtype: bool

由于我们只对满足上述条件的cum_value 列中的值感兴趣，因此我们可以使用布尔掩码来过滤这些值。

>>> df.loc[m, 'cum_value']

index_1  index_2
0        2020-01    100.0
         2020-04    150.0
         2020-05    200.0
1        2020-01     25.0
         2020-02     50.0
         2020-04     50.0
         2020-05    200.0
Name: cum_value, dtype: float64

现在grouplevel=0 上的上述过滤值，即index_1，并在cum_value 列上使用diff 来计算当前值与先前最大值之间的差异：

>>> df.loc[m, 'cum_value'].groupby(level=0).diff()

index_1  index_2
0        2020-01      NaN
         2020-04     50.0
         2020-05     50.0
1        2020-01      NaN
         2020-02     25.0
         2020-04      0.0
         2020-05    150.0
Name: cum_value, dtype: float64

最后，在新创建的new_col中填充NaN的值，并用0屏蔽不满足条件m的值：

>>> df
                 cum_value  new_col
index_1 index_2                    
0       2020-01      100.0    100.0
        2020-02       50.0      0.0
        2020-03      -50.0      0.0
        2020-04      150.0     50.0
        2020-05      200.0     50.0
1       2020-01       25.0     25.0
        2020-02       50.0     25.0
        2020-03     -100.0      0.0
        2020-04       50.0      0.0
        2020-05      200.0    150.0

【讨论】：

很好的解决方案！！它适用于数据框片段。我唯一的问题：如果第一个最大值是负数，我需要忽略这些最大值，new_col 必须显示为零。解决这个问题是完美的解决方案。
@DanielArges 很好的观察......我会更新答案。谢谢！
@ShubhamSharma，很好的解释，感谢分享先生的欢呼。