【问题标题】:Performance is slow when replacing a string in a pandas dataframe using a dict使用字典替换熊猫数据框中的字符串时性能很慢
【发布时间】:2017-09-04 21:12:45
【问题描述】:

以下代码有效,但需要运行得更快。 dict 有 ~25K 键,数据帧是 ~3M 行。有没有办法产生相同的结果,但运行速度更快的 python 代码? (如果没有多重处理,处理速度会慢 8 倍)。

miscdict={" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}

df=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})

def parse_text(data):
    for key, replacement in miscdict.items():
        data['q1'] = data['q1'].str.replace( key, replacement )
    return data

if __name__ == '__main__':
    t1_1 = datetime.datetime.now()
    p = multiprocessing.Pool(processes=8)
    split_dfs = np.array_split(df,8)
    pool_results = p.map(parse_text, split_dfs)
    p.close()
    p.join()
    parts = pd.concat(pool_results, axis=0)
    df = pd.concat([parts], axis=1)
    t2_1 = datetime.datetime.now()
    print("done"+ str(t2_1-t1_1)) 

【问题讨论】:

  • 多处理没有帮助?您确定您正确地进行了多处理吗?这看起来像是一个令人尴尬的可并行化问题。也许展示一下你是如何进行颚化的?
  • 我已更新代码以包含我使用的多处理。它正在工作(快 8 倍,系统监视器显示所有 8 个内核均已充分利用)。

标签: python pandas dictionary


【解决方案1】:

在我的情况下,将预编译的 miscdict 与 Vaishali 的示例一起使用比以下显示的其他数据快约 10 倍:

data=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})

miscdict = {" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}
miscdict_comp = {re.compile(k): v for k, v in miscdict.items()}

data['q1'].replace(miscdict_comp, regex = True, inplace = True)

【讨论】:

    【解决方案2】:

    我测试了其中的一些。 @A-Za-z 的建议是一项重大改进,但可能会做得更快。

    编辑:我重新运行了预先计算替换字典和数据框(以及预编译的正则表达式)的测试。新的时间安排是:

    • 原文:11.71 秒
    • @A-Za-z:4.72 秒,提高了 60%。
    • @piRSquared:4.95 秒,提高了 58%。
    • 预编译:2.81 秒,提高了 76%。

    时序中包含数据生成和正则表达式编译的原始结果:

    “测试您的代码我得到了 15 秒,@A-Za-z 的代码给了 8-9 秒,而我自己的解决方案将其降低到 6 秒。它使用预编译的正则表达式。请参阅此答案的结尾。”


    进口:

    import pandas as pd
    import re
    import timeit
    

    您的原始代码:

    miscdict = {" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}
    data=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})
    def org(printout=False):
        def parse_text(data):
            for key, replacement in miscdict.items():
                data['q1'] = data['q1'].str.replace( key, replacement )
            return data
        data2 = parse_text(data)
        if printout:
            print(data2)
    org(printout=True)
    print(timeit.timeit(org, number=10000))
    

    这用了 11.7 秒:

                           q1
    0              beer is ok
    1          beer is not ok
    2  beer was not available
    3   Sierra Nevada is good
    11.71043858179268
    

    用户@A-Za-z的代码:

    miscdict = {" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}
    data=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})
    def alt1(printout=False):
        data['q1'].replace(miscdict, regex = True, inplace = True)
        if printout:
            print(data)
    alt1(printout=True)
    print(timeit.timeit(alt1, number=10000))
    

    这用了 4.7 秒:

                           q1
    0              beer is ok
    1          beer is not ok
    2  beer was not available
    3   Sierra Nevada is good
    4.721581550644499
    

    用户@piRSquared 的代码:

    miscdict = {" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}
    data=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})
    def alt2(printout=False):
        # regex = True is added later because it doesn't work without it.
        data = data.replace(miscdict, regex = True)
        if printout:
            print(data)
    alt2(printout=True)
    print(timeit.timeit(alt2, number=10000))
    

    这用了 5.0 秒:

                           q1
    0              beer is ok
    1          beer is not ok
    2  beer was not available
    3   Sierra Nevada is good
    4.951810616074919
    

    miscdict = {" isn't ": ' is not '," aren't ":' are not '," wasn't ":' was not '," snevada ":' Sierra Nevada '}
    miscdict_comp = {re.compile(k): v for k, v in miscdict.items()}
    data=pd.DataFrame({"q1":["beer is ok","beer isn't ok","beer wasn't available"," snevada is good"]})
    def alt3(printout=False):
        def parse_text(text):
            for pattern, replacement in miscdict_comp.items():
                text = pattern.sub(replacement, text)
            return text
        data["q1"] = data["q1"].apply(parse_text)
        if printout:
            print(data)
    alt3(printout=True)
    print(timeit.timeit(alt3, number=10000))
    

    这用了 2.8 秒:

                           q1
    0              beer is ok
    1          beer is not ok
    2  beer was not available
    3   Sierra Nevada is good
    2.810334940701157
    

    这个想法是预编译你想要改变的模式。

    我从这里得到了这个想法:https://jerel.co/blog/2011/12/using-python-for-super-fast-regex-search-and-replace

    【讨论】:

    • 哇!甚至更快。你们帮了大忙。非常感谢!
    • 出色的答案。研究工作做得很好。加我一个。
    • Andre' -> 你帮助我从工作无法完成的情况转变为现在只需 2 小时即可完成的情况。在处理大型文本语料库时,预编译正则表达式似乎是一项必要的任务。
    • 预编译模式的想法真的很有帮助。谢谢!
    【解决方案3】:

    哇!我们重新发明了轮子并设计了一些时髦的辐条和钉鞋......

    ... 就这样做

    df.replace(miscdict)
    
                           q1
    0              beer is ok
    1          beer is not ok
    2  beer was not available
    3   Sierra Nevada is good
    

    除非我遗漏了一些明显的东西。

    【讨论】:

    • 版本可能有问题,但如果没有 regex=True,它不会给我想要的结果
    • 问题不是优雅,而是速度。
    • @AndréChristofferAndersen 谢谢你,你检查过速度吗?超过 400,000 行,它比 OPs 解析器提供了 10 倍的改进。我不能像往常一样发布完整的答案,但我认为简化和改进的答案对于走出去很重要。很抱歉,您认为此答案没有用。感谢您的意见和反对票。
    • @piRSquared 您的贡献自然受到欢迎,但是,IMO 我们应该努力使我们在 SO 上的沟通尽可能专业和不争吵。我发现你的回答是贬低的。话虽如此,我确实觉得我的反对票过于热心了。如果您对您的评论进行轻微修改,任何修改,我都可以恢复它。
    • 我测试了纯df.replace(miscdict)版本。最初我认为它有效,但似乎我错了。我认为@A-Za-z 是正确的,您实际上需要设置regex = True 才能使其工作。如果它适用于您,您使用的是哪个版本的 pandas?
    【解决方案4】:

    您不需要这里的循环,df.replace 与 regex = True 一起完成这项工作,它将时间减少了一半以上。

    df['q1'].replace(miscdict, regex = True, inplace = True)
    1000 loops, best of 3: 1.08 ms per loop
    

    得到你

            q1
    0   beer is ok
    1   beer is not ok
    2   beer was not available
    3   Sierra Nevada is good
    

    将其与当前解决方案进行比较

    for key, replacement in miscdict.items(): df['q1'] = df['q1'].str.replace( key, replacement )
    100 loops, best of 3: 2.35 ms per loop
    

    【讨论】:

    • 快得多!非常感谢!
    猜你喜欢
    • 2020-04-25
    • 2019-01-25
    • 2021-09-07
    • 2017-06-18
    • 2018-08-01
    • 2022-10-13
    • 2017-09-09
    • 2018-07-27
    相关资源
    最近更新 更多