【问题标题】:Pandas: improve running time looping over string contains substring熊猫:提高运行时间循环字符串包含子字符串
【发布时间】:2018-07-28 06:03:22
【问题描述】:

我得到了一个 Pandas 数据框,其中包含一个包含很长字符串的列(比如说 URL_paths)和一个唯一子字符串列表(参考列表)。对于我的数据框中的每一行,我想确定列表中相应的参考元素。因此,如果给定行中的 URL 是例如 abcd1234,并且参考值之一是 cd123,那么我想添加 cd123 作为对我的数据框的引用,以对此行/URL 进行分类。

我的代码可以正常工作(参见下面的示例),但是由于我无法摆脱的 for 循环(我猜)它非常慢。我感觉我的代码可以更快,但想不出改进它的方法。

如何提高运行时间?

请参阅下面的工作示例:

import string
import secrets
import pandas as pd
import time
from random import randint

n_ref = 100
n_target = 1000000

## Build reference Series, and target dataframe
reference = pd.Series(''.join(secrets.choice(string.ascii_uppercase + string.digits) for _ in range(randint(10, 19))) 
                      for _ in range(n_ref))

target = pd.Series(reference.sample(n = n_target, replace = True)).reset_index().iloc[:,1]

dfTarget = pd.DataFrame({
        'target' : target,
        'pre-string' : pd.Series(''.join(secrets.choice(string.ascii_uppercase + string.digits) 
                                    for _ in range(randint(1, 10))) 
                                    for _ in range(n_target)),
        'post-string' : pd.Series(''.join(secrets.choice(string.ascii_uppercase + string.digits) 
                                    for _ in range(randint(1, 10))) 
                                    for _ in range(n_target)),
        'reference' : pd.Series()})

dfTarget['target_combined'] = dfTarget[['pre-string', 'target', 'post-string']].apply(lambda x: ''.join(x), axis=1)

## Fill in reference column
## Loop over references and return reference in reference column

start_time = time.time()
for x in reference:
    dfTarget.loc[dfTarget['target_combined'].str.contains(x) == True, 'reference'] = x
print("--- %s seconds ---" % (time.time() - start_time))

输出:42.60... seconds

【问题讨论】:

    标签: python pandas dataframe substring


    【解决方案1】:

    在我的机器上,我看到使用 pd.Series.apply 的性能提高了 17 倍:

    reference_set = set(reference)
    
    def calculator(x):
        return next((i for i in reference_set if i in x), None)
    
    dfTarget['reference'] = dfTarget['target_combined'].apply(calculator)
    

    但要获得最佳性能,请参阅@unutbu's solution

    【讨论】:

    • 这很聪明! +1 试试:reference_lst = set(reference.tolist()) - 它应该更快...
    • 非常感谢。在我的电脑上,改进速度快了大约 13 倍。我也会检查 unutbu 的解决方案。 @MaxU,确实稍微快了一点(4%)
    【解决方案2】:

    这是一个稍微快一点(4.3 倍)的方法:

    正则表达式模式:

    In [23]: pat = '.*({}).*'.format(reference.str.cat(sep='|'))
    
    In [24]: pat
    Out[24]: '.*(J6BUVB2BRDLL3IR9S1J|ZOXS91UK513RR18YREI|92KWUFKOK4G9XJAHIBJ|PMEH6N96091AK9XCA5J|3CICA38SDIXLFVED74I|V48OJCY2DS|LX8KGGBORWP6A|7H
    V3NN71MU|JMA2K7QSHK72X|CNAOYI3C8T|NZE9SFKPYX|EU9K88XA29YATWR|SB871PEZ7TOPCG8|ZPP76BSDULM8|3QHLISVYEBWH|ST8VOI959D8YPCZ0|02BW83KYG3TEPWMOP|TG
    I3P5QZC988GNM8FI0|GJG9MC18G5TU1TIDQB6|V7V5ZZJ5W7O|51KMJ07HEBIX|27GPT3B9DLY|O8KSR85BUB6WBKRC|ZKUEEFX5JFRE0IFRN0|FH8CUWHDETQ5TXWHSS1|N77FTB9VG
    LK|JS4RUUQLD7IFP|3R45N7LOY1BZ8RR6O|JY3RXZ0OTC|YJQYOO03G0N7H7E56D|RVJ2VFNK6T7P30|GKPGAK6WAQ2QCAU6H3|7XNJ7A24CHWO1PK|1DVD5G1AE3I40|9F7CCWKHMMF
    MBYD18|FWPEUWOWNK2SXR36SG|VTE64VCRY5|YGM8TT19EZTX|GKJYM3QS9ONTERQY1O0|KWMB1TMQTWMC6QCY|JS9SY7W5HI0KK|WNSHPK9KNEP77B|7EIS883NUXSO5Q6|K3HL2UYW
    458LCBOSL|XI1FRVGHN0IL0F53CK4|F4HL7GKMOL2Q4Y13|IAXPAA4OX2J1X1|SXPLPYVB6EFSN4U5ZW|5L947F08PX8UW|IONNAOC26A|VQVHXHGYP8634|509ALPOKABO|SUJA66H2
    DS7UOXFV|3GYIZATSZAXF8283SZO|A5612XI7X3N4|IH3RB3640D23Q28O|MH0YD83OELSI|RIFFPNRIV0XCY|Y0CXWE6GZPQ3FKH|WSCWR598Z8GBW9G|7C9O59EIA23POSI|UG4D5H
    AAOYU5E|F249VSIILZ6KXDQSX|06XZSJHWSM|X01Y9AZ2W5V8HZ|1JLPWMPRGRFWIK|3ZVBSLEQ8DO|WMLKKETELHC|WDPHDS7A7XN7|6X4O4AE2IB3OS|V5J5HWO9RO19ZW2LGT|MK9
    P8D9N8V4AJZB|0VT48C38I4T1V6S|R987QUQBTPRHCT7QWA4|D4XXBMCYWQ1172OY|ZUY1O565D2W5GSAL8|V8AR792X1K5UL9DLCKV|CXYK6IQWK3MUC3CO|6X7B6240VC9YL|4QV2D
    13ZY15A9D5M1H|WJ7HOMK2FNBZZ6N2Z|QCOWSA3RLR|81I6Z0I5GM|KRD9Y1H3E2WEY9710Q|0161MNQHKEC30E8UI|HGB4XB0QDVHM4H92|RWD6L6EZJUSRK|6U9WOE3YVYKY31K8Q0
    K|KCXWHL43B16MRQ1|EO330WAPN7XMX4|VYUX5W2NN277W09NMDB|J8EXE4YIMN0FB|SHE8D14C5A3X|PMPYKSY2FVXFR4Y8X3W|G3YU894U5QGOOM3Z|58J37WJPJBOC7QNKV|NE9WE
    JSRXTYFXYZ0TBI|7UPR5XSVOJ244HHZ|N0QZCN6NADW|W2CTEUISOHUY).*'
    

    替换:

    dfTarget['reference'] = dfTarget['target_combined'].str.replace(pat, r'\1')
    

    时间针对 10.000 行 DF:

    In [25]: %%timeit
        ...: dfTarget['reference'] = dfTarget['target_combined'].str.replace(pat, r'\1')
        ...:
    617 ms ± 2.14 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [26]: %%timeit
        ...: [dfTarget.loc[dfTarget['target_combined'].str.contains(x) == True, 'reference'] for x in reference]
        ...:
    1.96 s ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [27]: %%timeit
        ...: for x in reference:
        ...:     dfTarget.loc[dfTarget['target_combined'].str.contains(x) == True, 'reference'] = x
        ...:
    2.64 s ± 14.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
    
    In [28]: 2.64/0.617
    Out[28]: 4.278768233387359
    
    In [29]: 2.64/1.96
    Out[29]: 1.3469387755102042
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2014-04-05
      • 2017-04-29
      • 2018-08-08
      • 2017-02-07
      • 2019-07-12
      • 2015-03-19
      • 1970-01-01
      相关资源
      最近更新 更多