熊猫中子串替换的最快方法答案

【问题标题】：Fastest way for substring replacement in pandas熊猫中子串替换的最快方法
【发布时间】：2021-11-12 07:05:29
【问题描述】：

我有一个要替换为 ' ' 的子字符串列表。最快的方法是什么？这对cython可行吗？将其应用于 100 万行时这真的很慢，所以我正在寻找最快的执行速度。

例子：

df = pd.DataFrame({ "text":
                    ["first text to replace"
                     , "second text to replace"
                     , "test this string"
                     , "this is not the first string"
                     , "short string test"]
                    })

removal_list = ["text to replace", "this string"]

一些尝试：

def replace_str(df, col, removal_list):
    for item in removal_list:
        df[col] = df[col].str.replace(item, ' ')
    return df

replace_str(df,'text', removal_list)



 def replace_text(text):
    miscdict_comp = {re.compile(a): ' ' for a in removal_list}
    for pattern, replacement in miscdict_comp.items():
        text = pattern.sub(replacement, text)
    return text

df['text'] = apply(replace_text)

【问题讨论】：

标签： python pandas string replace substring

【解决方案1】：

这似乎是replace的简单用法：

reg = '|'.join(removal_list)
df['text'].str.replace(reg, '', regex=True)

输出：

0                          first 
1                         second 
2                           test 
3    this is not the first string
4               short string test
Name: text, dtype: object

这运行得非常快，这是 1M 行测试的基准（df = pd.concat([df]*200000) 使用 OP 的数据框）：

397 ms ± 4.53 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

比较：

# replace_str
586 ms ± 9.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
# replace_text
1.5 s ± 27.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

注意。我删除了测试的赋值部分只是为了比较计算，但实际上这一步也需要时间，所以多次赋值会影响性能

【讨论】：