【问题标题】:Searching many strings for many dictionary keys, quickly快速搜索许多字符串以查找许多字典键
【发布时间】:2017-02-05 20:16:36
【问题描述】:

我有一个独特的问题,我主要希望找出一些方法来加快这段代码的速度。我有一组存储在数据框中的字符串,每个字符串中都有多个名称,并且我知道在此步骤之前的名称数量,如下所示:

print df

description                      num_people        people    
'Harry ran with sally'                2              []         
'Joe was swinging with sally'         2              []
'Lola Dances alone'                   1              []

我正在使用带有我希望在描述中找到的键的字典,如下所示:

my_dict={'Harry':'1283','Joe':'1828','Sally':'1298', 'Cupid':'1982'}

然后使用 iterrows 在每个字符串中搜索匹配项,如下所示:

for index, row in df.iterrows():
    row.people=[key for key in my_dict if re.findall(key,row.desciption)]

当运行时它以

结束
print df

 description                      num_people        people    
'Harry ran with sally'                2              ['Harry','Sally']         
'Joe was swinging with sally'         2              ['Joe','Sally']
'Lola Dances alone'                   1              ['Lola']

我看到的问题是,这段代码完成工作仍然相当慢,而且我有大量的描述和超过1000 键。有没有更快的方法来执行这个操作,比如使用找到的人数?

【问题讨论】:

    标签: string python-2.7 pandas dictionary text-extraction


    【解决方案1】:

    更快的解决方案:

    #strip ' in start and end of text, create lists from words
    splited = df.description.str.strip("'").str.split()
    #filtering
    df['people'] = splited.apply(lambda x: [i for i in x if i in my_dict.keys()])
    print (df)
                         description  num_people          people
    0         'Harry ran with Sally'           2  [Harry, Sally]
    1  'Joe was swinging with Sally'           2    [Joe, Sally]
    2            'Lola Dances alone'           1          [Lola]
    

    时间安排

    #[30000 rows x 3 columns]
    In [198]: %timeit (orig(my_dict, df))
    1 loop, best of 3: 3.63 s per loop
    
    In [199]: %timeit (new(my_dict, df1))
    10 loops, best of 3: 78.2 ms per loop
    
    df['people'] = [[],[],[]]
    df = pd.concat([df]*10000).reset_index(drop=True)
    df1 = df.copy()
    
    my_dict={'Harry':'1283','Joe':'1828','Sally':'1298', 'Lola':'1982'}
    
    def orig(my_dict, df):
        for index, row in df.iterrows():
            df.at[index, 'people']=[key for key in my_dict if re.findall(key,row.description)]
        return (df)
    
    
    def new(my_dict, df):
        df.description = df.description.str.strip("'")
        splited = df.description.str.split()
        df.people = splited.apply(lambda x: [i for i in x if i in my_dict.keys()])
        return (df)
    
    
    print (orig(my_dict, df))
    print (new(my_dict, df1))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2017-11-11
      • 1970-01-01
      • 2013-01-20
      • 1970-01-01
      • 2015-03-20
      • 1970-01-01
      • 2013-01-06
      相关资源
      最近更新 更多