Pandas - 迭代行并比较以前的值 - 更快答案

【问题标题】：Pandas - Interate over row and compare previous values -fasterPandas - 迭代行并比较以前的值 - 更快
【发布时间】：2020-04-05 16:32:18
【问题描述】：

我正在尝试更快地获得结果（800 行需要 13 分钟）。我在这里问了一个类似的问题：pandas - iterate over rows and calculate - faster - 但我无法为我的变体使用好的解决方案。不同的是，如果'col2'中之前的值重叠大于'n=3'，则行中'col1'的值设置为'0'，影响后面的代码。

import pandas as pd
d = {'col1': [20, 23, 40, 41, 46, 47, 48, 49, 50, 50, 52, 55, 56, 69, 70],
    'col2': [39, 32, 42, 50, 63, 67, 64, 68, 68, 74, 59, 75, 58, 71, 66]}
df = pd.DataFrame(data=d)


df["overlap_count"] = ""  #create new column
n = 3 #if x >= n, then value = 0

for row in range(len(df)):
        x = (df["col2"].loc[0:row-1] > (df["col1"].loc[row])).sum()
        df["overlap_count"].loc[row] = x

        if x >= n:                 
            df["col2"].loc[row] = 0
            df["overlap_count"].loc[row] = 'x'
df

我得到以下结果：如果 col1 中的值大于 'n' 和列重叠计数，则替换它们

   col1 col2 overlap_count
0   20  39  0
1   23  32  1
2   40  42  0
3   41  50  1
4   46  63  1
5   47  67  2
6   48  0   x
7   49  0   x
8   50  68  2
9   50  0   x
10  52  0   x
11  55  0   x
12  56  0   x
13  69  71  0
14  70  66  1

感谢您的帮助和时间！

【问题讨论】：

您能否显示为预期的结果。

标签： python pandas loops

【解决方案1】：

创建一个函数，然后应用该函数，如下所示：

df['overlap_count'] = [fn(i) for i in df['overlap_count']]

【讨论】：

【解决方案2】：

试试这个，也许会更快。

df['overlap_count'] = df.groupby('col1')['col2'].transform(lambda g: len((g >= g.name).index))

【讨论】：

对不起，结果与我预期的不一样。谢谢！
您期望唯一值的数量 >= x 还是行数？在唯一的情况下将g >= g.name替换为g.drop_duplicates() >= g.name
如果您查看索引 8，其中 col1 = 50 并且 col2 的先前值是 39,32... 63,67,0.0。代码查看有多少值 >= 大于 50。结果是 2 (63,67)。如果结果 > n=3，比索引 8，col2 值将从 68 变为 0
我不太明白其中的逻辑。为什么是68？你想补充overlap_count，还是做其他事情？

【解决方案3】：

我认为您可以使用numba 来提高性能，只需要使用数值，因此添加x -1 并用0 填充新列而不是空字符串：

df["overlap_count"] = 0  #create new column
n = 3 #if x >= n, then value = 0

a = df[['col1','col2','overlap_count']].values

from numba import njit

@njit
def custom_sum(arr, n):
    for row in range(arr.shape[0]):
        x = (arr[0:row, 1] > arr[row, 0]).sum()
        arr[row, 2] = x
        if x >= n:
            arr[row, 1] = 0
            arr[row, 2] = -1
    return arr

df1 = pd.DataFrame(custom_sum(a, n), columns=df.columns)
print (df1)
    col1  col2  overlap_count
0     20    39              0
1     23    32              1
2     40    42              0
3     41    50              1
4     46    63              1
5     47    67              2
6     48     0             -1
7     49     0             -1
8     50    68              2
9     50     0             -1
10    52     0             -1
11    55     0             -1
12    56     0             -1
13    69    71              0
14    70    66              1

性能：

d = {'col1': [20, 23, 40, 41, 46, 47, 48, 49, 50, 50, 52, 55, 56, 69, 70],
    'col2': [39, 32, 42, 50, 63, 67, 64, 68, 68, 74, 59, 75, 58, 71, 66]}
df = pd.DataFrame(data=d)

#4500rows
df = pd.concat([df] * 300, ignore_index=True)

print (df)
In [115]: %%timeit
     ...: pd.DataFrame(custom_sum(a, n), columns=df.columns)
     ...: 
8.11 ms ± 224 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [116]: %%timeit 
     ...: for row in range(len(df)):
     ...:         x = (df["col2"].loc[0:row-1] > (df["col1"].loc[row])).sum()
     ...:         df["overlap_count"].loc[row] = x
     ...: 
     ...:         if x >= n:                 
     ...:             df["col2"].loc[row] = 0
     ...:             df["overlap_count"].loc[row] = 'x'
     ...:             
     ...:             
7.84 s ± 442 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

【讨论】：

@jezarel，你可以回答这个扩展...stackoverflow.com/questions/61078906/…