用字典替换子字符串的最快方法（在大型数据集上）答案

【问题标题】：Fastest way to replace substrings with dictionary (On large dataset)用字典替换子字符串的最快方法（在大型数据集上）
【发布时间】：2018-02-24 13:48:56
【问题描述】：

我有 10M 文本（适合 RAM）和一个 Python 字典：

"old substring":"new substring"

字典的大小约为 15k 个子字符串。

我正在寻找用 dict 替换每个文本的最快方法（在每个文本中查找每个“旧子字符串”并将其替换为“新子字符串”）。

源文本位于 pandas 数据框中。目前我已经尝试了这些方法：

1) 在循环中用 reduce 和 str 替换替换（~120 行/秒）

replaced = []
for row in df.itertuples():
    replaced.append(reduce(lambda x, y: x.replace(y, mapping[y]), mapping, row[1]))

2) 在循环中使用简单的替换功能（“映射”是 15k 字典）（~160 行/秒）：

def string_replace(text):
    for key in mapping:
        text = text.replace(key, mapping[key])
    return text

replaced = []
for row in tqdm(df.itertuples()):
    replaced.append(string_replace(row[1]))

.iterrows() 的工作速度也比 .itertuples() 慢 20%

3) 在 Series 上使用 apply（也是 ~160 行/秒）：

replaced = df['text'].apply(string_replace)

以这样的速度处理整个数据集需要几个小时。

任何人都有这种大规模子字符串替换的经验？有没有可能加快速度？它可能很棘手或丑陋，但必须尽可能快，而不需要使用 pandas。

谢谢。

更新： 玩具数据来检验这个想法：

df = pd.DataFrame({ "old":
                    ["first text to replace",
                   "second text to replace"]
                    })

mapping = {"first text": "FT", 
           "replace": "rep",
           "second": '2nd'}

预期结果：

                      old         replaced
0   first text to replace        FT to rep
1  second text to replace  2nd text to rep

【问题讨论】：

检查Replace values in pandas Series with dictionary。
谢谢 Wiktor，我现在看到了 regexp=True 的想法，但它比头帖中的简单方法慢得多。

标签： string pandas numpy replace substring

【解决方案1】：

我又克服了这个问题，发现了一个很棒的库，叫做flashtext。

15k 词汇量的 10M 记录的加速大约是 x100（实际上比我第一篇文章中的正则表达式或其他方法快一百倍）！

非常容易使用：

df = pd.DataFrame({ "old":
                    ["first text to replace",
                   "second text to replace"]
                    })

mapping = {"first text": "FT", 
           "replace": "rep",
           "second": '2nd'}

import flashtext
processor = flashtext.KeywordProcessor()

for k, v in mapping.items():
    processor.add_keyword(k, v)

print(list(map(processor.replace_keywords, df["old"])))

结果：

['FT to rep', '2nd text to rep']

如果需要，还可以灵活地适应不同的语言，使用 processor.non_word_boundaries 属性。

这里使用的基于 Trie 的搜索提供了惊人的加速。

【讨论】：

【解决方案2】：

一种解决方案是将字典转换为 trie 并编写代码，以便您只通过一次修改后的文本。

基本上，您一次遍历文本并尝试一个字符，一旦找到匹配项，您就替换它。

当然，如果您还需要对已替换的文本应用替换，这会更难。

【讨论】：

【解决方案3】：

我认为您正在寻找用 df 上的正则表达式替换，即

如果您有字典，则将其作为参数传递。

d = {'old substring':'new substring','anohter':'another'}

对于整个数据框

df.replace(d,regex=True)

对于系列

df[columns].replace(d,regex=True)

例子

df = pd.DataFrame({ "old":
                ["first text to replace",
               "second text to replace"]
                })

mapping = {"first text": "FT", 
       "replace": "rep",
       "second": '2nd'}

df['replaced'] = df['old'].replace(mapping,regex=True)

【讨论】：

可以直接传字典
谢谢。不幸的是，这是一种慢得多的方法。每秒约 100 行。
@AlexeyTrofimov 试试regex=False
@cᴏʟᴅsᴘᴇᴇᴅ 他想替换字符串的子字符串。
@Bharathshetty 您不需要正则表达式进行简单替换