熊猫循环优化答案

【问题标题】：Pandas loop optimization熊猫循环优化
【发布时间】：2018-07-26 07:46:13
【问题描述】：

有没有更好的方法（性能方面）在 pandas 中执行以下循环（假设 df 是 DataFrame）？

for i in range(len(df)):
    if df['signal'].iloc[i] == 0:   # if the signal is negative
        if df['position'].iloc[i - 1] - 0.02 < -1:   # if the row above - 0.1 < -1 set the value of current row to -1
            df['position'].iloc[i] = -1
        else:   # if the new col value above -0.1 is > -1 then subtract 0.1 from that value
            df['position'].iloc[i] = df['position'].iloc[i - 1] - 0.02
    elif df['signal'].iloc[i] == 1:     # if the signal is positive
        if df['position'].iloc[i - 1] + 0.02 > 1:     # if the value above + 0.1 > 1 set the current row to 1
            df['position'].iloc[i] = 1
        else:   # if the row above + 0.1 < 1 then add 0.1 to the value of the current row
            df['position'].iloc[i] = df['position'].iloc[i - 1] + 0.02

我将不胜感激任何建议，因为我刚刚开始通过 Pandas 路线，显然，可能会错过一些重要的事情。

来源 CSV 数据：

Date,sp500,sp500 MA,UNRATE,UNRATE MA,signal,position
2000-01-01,,,4.0,4.191666666666665,1,0
2000-01-02,,,4.0,4.191666666666665,1,0
2000-01-03,102.93,95.02135,4.0,4.191666666666665,1,0
2000-01-04,98.91,95.0599,4.0,4.191666666666665,1,0
2000-01-05,99.08,95.11245000000001,4.0,4.191666666666665,1,0
2000-01-06,97.49,95.15450000000001,4.0,4.191666666666665,1,0
2000-01-07,103.15,95.21575000000001,4.0,4.191666666666665,1,0
2000-01-08,103.15,95.21575000000001,4.0,4.191666666666665,1,0
2000-01-09,103.15,95.21575000000001,4.0,4.191666666666665,1,0

期望的输出：

Date,sp500,sp500 MA,UNRATE,UNRATE MA,signal,position
2000-01-01,,,4.0,4.191666666666665,1,0.02
2000-01-02,,,4.0,4.191666666666665,1,0.04
2000-01-03,102.93,95.02135,4.0,4.191666666666665,1,0.06
2000-01-04,98.91,95.0599,4.0,4.191666666666665,1,0.08
2000-01-05,99.08,95.11245000000001,4.0,4.191666666666665,1,0.1
2000-01-06,97.49,95.15450000000001,4.0,4.191666666666665,1,0.12
2000-01-07,103.15,95.21575000000001,4.0,4.191666666666665,1,0.14
2000-01-08,103.15,95.21575000000001,4.0,4.191666666666665,1,0.16
2000-01-09,103.15,95.21575000000001,4.0,4.191666666666665,1,0.18

更新下面的所有答案（在我写这篇文章的那一刻）都会产生常量 position 0.02 值，这与我的幼稚循环方法不同。换句话说，我正在寻找一种解决方案，它可以为position 列提供0.02、0.04、0.06、0.08 等。

【问题讨论】：

如果你用 pandas 循环，你几乎总是做错了
@SuperStew 是的，我有这样的直觉
你能添加输入和期望输出的例子吗？类似minimal reproducible example.
@varnie：大多数人错过的是输出的第 n 行不依赖于输入的第 n-1 行，而是第 n-1 行输出的行，因此不能简单地分解为班次。
如果您有一个包含简单循环的有效解决方案，请创建一个仅依赖于 numpy 数组的解决方案，如 @Jonas Byström 所做的，然后使用像 Numba 或 Cython 这样的编译器。例如。 stackoverflow.com/a/50969037/4045774

标签： python performance pandas

【解决方案1】：

不要使用循环。 Pandas 专注于矢量化操作，例如对于signal == 0：

pos_shift = df['position'].shift() - 0.02
m1 = df['signal'] == 0
m2 = pos_shift < -1

df.loc[m1 & m2, 'position'] = -1
df['position'] = np.where(m1 & ~m2, pos_shift, df['position'])

你可以为signal == 1写类似的东西。

【讨论】：

谢谢。它看起来很神奇，但我只是注意到你的版本产生的结果与我最初的代码有点不同。
@varnie 这就是为什么如果您 edited 您的问题包含一些示例输入和输出会非常方便:)
@JonClements 好的，尝试提供一些输入和输出（更新了我的问题）。
@jpp 从我的测试中看起来你的版本产生相同的position: 0.02 除第一行之外的所有行（第一行是NaN），但在我的版本中它得到了增加每行增加 0.02 步。
@varnie，说实话，我会首先关注逻辑而不是结果。 Python / Pandas（通常）会按照您的要求执行 :)。是不是有一点不明白？ pd.Series.shift 将有 NaN 在第一行，当然。但如果这是一个问题，你可以特殊情况。

【解决方案2】：

感谢您添加数据和示例输出。首先，我很确定您不能将其向量化，因为每个计算都取决于前一个计算的输出。所以这是我能做到的最好的了。

你的方法在我的机器上大约是 0.116999 秒

这个在0.0039999秒左右出现

没有矢量化，但它得到了很好的速度提升，因为为此使用列表并在最后将其添加回数据帧会更快。

def myfunc(pos_pre, signal):
    if signal == 0:  # if the signal is negative
        # if the new col value above -0.2 is > -1 then subtract 0.2 from that value
        pos = pos_pre - 0.02
        if pos < -1:  # if the row above - 0.2 < -1 set the value of current row to -1
            pos = -1

    elif signal == 1:
        # if the row above + 0.2 < 1 then add 0.2 to the value of the current row
        pos = pos_pre + 0.02
        if pos > 1:  # if the value above + 0.1 > 1 set the current row to 1
            pos = 1

    return pos


''' set first position value because you aren't technically calculating it correctly in your method since there is no 
position minus 1... IE: it will always be 0.02'''
new_pos = [0.02]

# skip index zero since there is no position 0 minus 1
for i in range(1, len(df)):
    new_pos.append(myfunc(pos_pre=new_pos[i-1], signal=df['signal'].iloc[i]))

df['position'] = new_pos

输出：

df.position
0    0.02
1    0.04
2    0.06
3    0.08
4    0.10
5    0.12
6    0.14
7    0.16
8    0.18

【讨论】：

【解决方案3】：

是的。在寻找性能时，您应该始终对底层的 numpy 数组进行操作：

signal = df['signal'].values
position = df['position'].values
for i in range(len(df)):
    if signal[i] == 0:
        if position[i-1]-0.02 < -1:
            position[i] = -1
        else:
            position[i] = position[i-1]-0.02
    elif signal[i] == 1:
        if position[i-1]+0.02 > 1:
            position[i] = 1
        else:
            position[i] = position[i-1]+0.02

您会对性能提升感到惊讶，通常是 10 倍或更多。

【讨论】：

这仍然以与问题相同的方式迭代。对 numpy 数组进行操作的主要好处是利用了矢量化操作。

【解决方案4】：

很可能有更好的方法，但这个也应该有效：

df['previous'] = df.signal.shift()

def get_signal_value(row):
    if row.signal == 0:
        compare = row.previous - 0.02
        if compare < -1:
            return -1
        else:
            return compare
    elif row.signal == 1: 
        compare = row.previous + 0.01
        if compare > 1:
            return 1
        else:
            return compare

df['new_signal'] = df.apply(lambda row: get_signal_value(row), axis=1)

【讨论】：