【问题标题】:Modify DataFrame based on previous row (cumulative sum with condition based on previous cumulative sum result)根据上一行修改DataFrame(累积和条件基于之前的累积和结果)
【发布时间】:2021-10-29 16:49:43
【问题描述】:

我有一个数据框,其中一列包含数字(数量)。每行代表一天,因此应将整个数据框视为顺序数据。我想添加第二列来计算数量列的累积总和,但是如果在任何时候累积总和大于 0,下一行应该从 0 开始计算累积总和。

我使用 iterrows() 解决了这个问题,但我读到这个函数效率非常低,并且有数百万行,计算需要 20 多分钟。我的解决方案如下:

import pandas as pd

df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1], columns=['quantity'])


for index, row in df.iterrows():
    if index == 0:
        df.loc[index, 'outcome'] = df.loc[index, 'quantity']
    else:
        previous_outcome = df.loc[index-1, 'outcome'] 
        if previous_outcome > 0:
            previous_outcome = 0

        df.loc[index, 'outcome'] = previous_outcome + df.loc[index, 'quantity']

print(df)

#   quantity    outcome
#   -1          -1.0
#   -1          -2.0
#   -1          -3.0
#   -1          -4.0
#   15          11.0 <- since this is greater than 0, next line will start counting from 0
#   -1          -1.0
#   -1          -2.0
#   -1          -3.0
#   -1          -4.0
#   5            1.0 <- since this is greater than 0, next line will start counting from 0
#   -1          -1.0
#   15          14.0 <- since this is greater than 0, next line will start counting from 0
#   -1          -1.0
#   -1          -2.0
#   -1          -3.0

是否有更快(更优化的方式)来计算?

我也不确定“if index == 0”块是否是最佳解决方案,是否可以以更优雅的方式解决?如果没有这个块,就会出现错误,因为在第一行中不能有“上一行”进行计算。

【问题讨论】:

  • 你的数据更像是一个数组。您是否尝试过查看numpy 函数?迭代 numpy 数组比迭代 DataFrame 行更有效 - 永远不要那样做!

标签: python pandas sequential


【解决方案1】:

迭代DataFrame 行非常慢,应该避免。处理数据块是使用pandas 的方式。

对于您的情况,将您的 DataFramequantity 视为 numpy 数组,与您的方法相比,下面的代码应该会大大加快该过程:

import pandas as pd
import numpy as np

df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1], columns=['quantity'])

x = np.array(df.quantity)
y = np.zeros(x.size)

total = 0
for i, xi in enumerate(x):
    total += xi
    y[i] = total
    total = total if total < 0 else 0

df['outcome'] = y

print(df)

输出:

    quantity  outcome
0         -1     -1.0
1         -1     -2.0
2         -1     -3.0
3         -1     -4.0
4         15     11.0
5         -1     -1.0
6         -1     -2.0
7         -1     -3.0
8         -1     -4.0
9          5      1.0
10        -1     -1.0
11        15     14.0
12        -1     -1.0
13        -1     -2.0
14        -1     -3.0

如果您仍然需要更快的速度,建议查看 numba 按照 jezrael answer

编辑 - 性能测试

我对性能感到好奇,并使用所有 3 种方法完成了这个模块。

我没有优化个别功能,只是从 OP 和 jezrael answer 复制代码并稍作改动。

"""
bench_dataframe.py
Performance test of iteration over DataFrame rows.

Methods tested are `DataFrame.iterrows()`, loop over `numpy.array`,
and same using `numba`.
"""
from numba import njit
import pandas as pd
import numpy as np


def pditerrows(df):
    """Iterate over DataFrame using `iterrows`"""

    for index, row in df.iterrows():
        if index == 0:
            df.loc[index, 'outcome'] = df.loc[index, 'quantity']
        else:
            previous_outcome = df.loc[index-1, 'outcome'] 
            if previous_outcome > 0:
                previous_outcome = 0

            df.loc[index, 'outcome'] = previous_outcome + df.loc[index, 'quantity']
            
    return df


def nparray(df):
    """Convert DataFrame column to `numpy` arrays."""

    x = np.array(df.quantity)
    y = np.zeros(x.size)

    total = 0
    for i, xi in enumerate(x):
        total += xi
        y[i] = total
        total = total if total < 0 else 0
    
    df['outcome'] = y
    
    return df


@njit
def f(x, lim):
    result = np.empty(len(x))
    result[0] = x[0]

    for i, j in enumerate(x[1:], 1):
        previous_outcome = result[i-1]
        if previous_outcome > lim:
            previous_outcome = 0
        result[i] = previous_outcome + x[i]
    return result

def numbaloop(df):
    """Convert DataFrame to `numpy` arrays and loop using `numba`.
    See [https://stackoverflow.com/a/69750009/5069105]
    """
    df['outcome'] = f(df.quantity.to_numpy(), 0)
    return df

def create_df(size):
    """Create a DataFrame filed with -1's and 15's, with 90% of 
    the entries equal to -1 and 10% equal to 15, randomly 
    placed in the array.
    """
    df = pd.DataFrame(
            np.random.choice(
                (-1, 15), 
                size=size, 
                p=[0.9, 0.1]
            ),
            columns=['quantity'])
    return df


# Make sure all tests lead to the same result
df = pd.DataFrame([-1,-1,-1,-1,15,-1,-1,-1,-1,5,-1,+15,-1,-1,-1],
                  columns=['quantity'])
assert nparray(df.copy()).equals(pditerrows(df.copy()))
assert nparray(df.copy()).equals(numbaloop(df.copy()))

运行一个较小的数组size = 20_000,会导致:

In: import bench_dataframe as bd
 .. df = bd.create_df(size=20_000)

In: %timeit bd.pditerrows(df.copy())
7.06 s ± 224 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In: %timeit bd.nparray(df.copy())
9.76 ms ± 710 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In: %timeit bd.numbaloop(df.copy())
437 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

这里 numpy 数组比 iterrows() 快 700+ 倍,numba 仍然比 numpy 快 22 倍。

对于更大的数组,size = 200_000,我们得到:

In: import bench_dataframe as bd
 .. df = bd.create_df(size=200_000)

In: %timeit bd.pditerrows(df.copy())
I gave up and hit Ctrl+C after 10 minutes or so... =P

In: %timeit bd.nparray(df.copy())
86 ms ± 2.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In: %timeit bd.numbaloop(df.copy())
3.15 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

在此示例中,再次使 numbanumpy 数组快 25 倍以上,并确认您应该不惜一切代价避免将 iterrows() 用于超过几百行的任何内容。

【讨论】:

  • 我测试了您的解决方案,但它的结果与我的预期结果之间存在重大差异 - 每当“结果”大于 0 时,我希望将其保存为该值(而不是 0)。只有下一行应该从 0 开始计算。在我的示例数据中,数量 = 15 的第一行应该有结果 = 11。使用你的方法结果 = 0
  • @Malachiasz - 为了提高循环解决方案的性能需要 numba,如果仅使用 enumerate 性能不佳。
  • @Malachiasz 你所要做的就是交换 for 循环的最后两行......重点是:使用数组而不是数据帧
  • np。更新了答案以获得完整的代码。请对问题进行评分并将解决您的问题的问题标记为已接受。
  • @jezrael 公平点。由于编译,numba 有一些开销,因此真正的好处应该出现在更大的数组中。我将更新基准并查看 =)
【解决方案2】:

如果性能很重要,我认为numba 在使用循环时是最好的:

@njit
def f(x, lim):
    result = np.empty(len(x), dtype=np.int)
    result[0] = x[0]

    for i, j in enumerate(x[1:], 1):
        previous_outcome = result[i-1]
        if previous_outcome > lim:
            previous_outcome = 0
        result[i] = previous_outcome + x[i]
    return result

df['outcome1'] = f(df.quantity.to_numpy(), 0)
print(df)
    quantity  outcome  outcome1
0         -1     -1.0        -1
1         -1     -2.0        -2
2         -1     -3.0        -3
3         -1     -4.0        -4
4         15     11.0        11
5         -1     -1.0        -1
6         -1     -2.0        -2
7         -1     -3.0        -3
8         -1     -4.0        -4
9          5      1.0         1
10        -1     -1.0        -1
11        15     14.0        14
12        -1     -1.0        -1
13        -1     -2.0        -2
14        -1     -3.0        -3

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-07-26
    • 1970-01-01
    • 1970-01-01
    • 2019-03-24
    • 2021-07-20
    • 1970-01-01
    相关资源
    最近更新 更多