【问题标题】：Modify DataFrame based on previous row (cumulative sum with condition based on previous cumulative sum result)根据上一行修改DataFrame（累积和条件基于之前的累积和结果）
【发布时间】：2021-10-29 16:49:43
【问题描述】：

我有一个数据框，其中一列包含数字（数量）。每行代表一天，因此应将整个数据框视为顺序数据。我想添加第二列来计算数量列的累积总和，但是如果在任何时候累积总和大于 0，下一行应该从 0 开始计算累积总和。

我使用 iterrows() 解决了这个问题，但我读到这个函数效率非常低，并且有数百万行，计算需要 20 多分钟。我的解决方案如下：

import pandas as pd

df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1], columns=['quantity'])


for index, row in df.iterrows():
    if index == 0:
        df.loc[index, 'outcome'] = df.loc[index, 'quantity']
    else:
        previous_outcome = df.loc[index-1, 'outcome'] 
        if previous_outcome > 0:
            previous_outcome = 0

        df.loc[index, 'outcome'] = previous_outcome + df.loc[index, 'quantity']

print(df)

#   quantity    outcome
#   -1          -1.0
#   -1          -2.0
#   -1          -3.0
#   -1          -4.0
#   15          11.0 <- since this is greater than 0, next line will start counting from 0
#   -1          -1.0
#   -1          -2.0
#   -1          -3.0
#   -1          -4.0
#   5            1.0 <- since this is greater than 0, next line will start counting from 0
#   -1          -1.0
#   15          14.0 <- since this is greater than 0, next line will start counting from 0
#   -1          -1.0
#   -1          -2.0
#   -1          -3.0

是否有更快（更优化的方式）来计算？

我也不确定“if index == 0”块是否是最佳解决方案，是否可以以更优雅的方式解决？如果没有这个块，就会出现错误，因为在第一行中不能有“上一行”进行计算。

【问题讨论】：

你的数据更像是一个数组。您是否尝试过查看numpy 函数？迭代 numpy 数组比迭代 DataFrame 行更有效 - 永远不要那样做！

标签： python pandas sequential

【解决方案1】：

迭代DataFrame 行非常慢，应该避免。处理数据块是使用pandas 的方式。

对于您的情况，将您的 DataFrame 列 quantity 视为 numpy 数组，与您的方法相比，下面的代码应该会大大加快该过程：

import pandas as pd
import numpy as np

df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1], columns=['quantity'])

x = np.array(df.quantity)
y = np.zeros(x.size)

total = 0
for i, xi in enumerate(x):
    total += xi
    y[i] = total
    total = total if total < 0 else 0

df['outcome'] = y

print(df)

输出：

    quantity  outcome
0         -1     -1.0
1         -1     -2.0
2         -1     -3.0
3         -1     -4.0
4         15     11.0
5         -1     -1.0
6         -1     -2.0
7         -1     -3.0
8         -1     -4.0
9          5      1.0
10        -1     -1.0
11        15     14.0
12        -1     -1.0
13        -1     -2.0
14        -1     -3.0

如果您仍然需要更快的速度，建议查看 numba 按照 jezrael answer。

编辑 - 性能测试

我对性能感到好奇，并使用所有 3 种方法完成了这个模块。

我没有优化个别功能，只是从 OP 和 jezrael answer 复制代码并稍作改动。

"""
bench_dataframe.py
Performance test of iteration over DataFrame rows.

Methods tested are `DataFrame.iterrows()`, loop over `numpy.array`,
and same using `numba`.
"""
from numba import njit
import pandas as pd
import numpy as np


def pditerrows(df):
    """Iterate over DataFrame using `iterrows`"""

    for index, row in df.iterrows():
        if index == 0:
            df.loc[index, 'outcome'] = df.loc[index, 'quantity']
        else:
            previous_outcome = df.loc[index-1, 'outcome'] 
            if previous_outcome > 0:
                previous_outcome = 0

            df.loc[index, 'outcome'] = previous_outcome + df.loc[index, 'quantity']
            
    return df


def nparray(df):
    """Convert DataFrame column to `numpy` arrays."""

    x = np.array(df.quantity)
    y = np.zeros(x.size)

    total = 0
    for i, xi in enumerate(x):
        total += xi
        y[i] = total
        total = total if total < 0 else 0
    
    df['outcome'] = y
    
    return df


@njit
def f(x, lim):
    result = np.empty(len(x))
    result[0] = x[0]

    for i, j in enumerate(x[1:], 1):
        previous_outcome = result[i-1]
        if previous_outcome > lim:
            previous_outcome = 0
        result[i] = previous_outcome + x[i]
    return result

def numbaloop(df):
    """Convert DataFrame to `numpy` arrays and loop using `numba`.
    See [https://stackoverflow.com/a/69750009/5069105]
    """
    df['outcome'] = f(df.quantity.to_numpy(), 0)
    return df

def create_df(size):
    """Create a DataFrame filed with -1's and 15's, with 90% of 
    the entries equal to -1 and 10% equal to 15, randomly 
    placed in the array.
    """
    df = pd.DataFrame(
            np.random.choice(
                (-1, 15), 
                size=size, 
                p=[0.9, 0.1]
            ),
            columns=['quantity'])
    return df


# Make sure all tests lead to the same result
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1],
                  columns=['quantity'])
assert nparray(df.copy()).equals(pditerrows(df.copy()))
assert nparray(df.copy()).equals(numbaloop(df.copy()))

运行一个较小的数组size = 20_000，会导致：

In: import bench_dataframe as bd
 .. df = bd.create_df(size=20_000)

In: %timeit bd.pditerrows(df.copy())
7.06 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In: %timeit bd.nparray(df.copy())
9.76 ms ± 710 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In: %timeit bd.numbaloop(df.copy())
437 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

这里 numpy 数组比 iterrows() 快 700+ 倍，numba 仍然比 numpy 快 22 倍。

对于更大的数组，size = 200_000，我们得到：

In: import bench_dataframe as bd
 .. df = bd.create_df(size=200_000)

In: %timeit bd.pditerrows(df.copy())
I gave up and hit Ctrl+C after 10 minutes or so... =P

In: %timeit bd.nparray(df.copy())
86 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In: %timeit bd.numbaloop(df.copy())
3.15 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

在此示例中，再次使 numba 比 numpy 数组快 25 倍以上，并确认您应该不惜一切代价避免将 iterrows() 用于超过几百行的任何内容。

【讨论】：

我测试了您的解决方案，但它的结果与我的预期结果之间存在重大差异 - 每当“结果”大于 0 时，我希望将其保存为该值（而不是 0）。只有下一行应该从 0 开始计算。在我的示例数据中，数量 = 15 的第一行应该有结果 = 11。使用你的方法结果 = 0
@Malachiasz - 为了提高循环解决方案的性能需要 numba，如果仅使用 enumerate 性能不佳。
@Malachiasz 你所要做的就是交换 for 循环的最后两行......重点是：使用数组而不是数据帧
np。更新了答案以获得完整的代码。请对问题进行评分并将解决您的问题的问题标记为已接受。
@jezrael 公平点。由于编译，numba 有一些开销，因此真正的好处应该出现在更大的数组中。我将更新基准并查看 =)

【解决方案2】：

如果性能很重要，我认为numba 在使用循环时是最好的：

@njit
def f(x, lim):
    result = np.empty(len(x), dtype=np.int)
    result[0] = x[0]

    for i, j in enumerate(x[1:], 1):
        previous_outcome = result[i-1]
        if previous_outcome > lim:
            previous_outcome = 0
        result[i] = previous_outcome + x[i]
    return result

df['outcome1'] = f(df.quantity.to_numpy(), 0)
print(df)
    quantity  outcome  outcome1
0         -1     -1.0        -1
1         -1     -2.0        -2
2         -1     -3.0        -3
3         -1     -4.0        -4
4         15     11.0        11
5         -1     -1.0        -1
6         -1     -2.0        -2
7         -1     -3.0        -3
8         -1     -4.0        -4
9          5      1.0         1
10        -1     -1.0        -1
11        15     14.0        14
12        -1     -1.0        -1
13        -1     -2.0        -2
14        -1     -3.0        -3

【讨论】：