字符串匹配的向量化答案

【问题标题】：Vectorisation of string matching字符串匹配的向量化
【发布时间】：2018-07-06 15:30:29
【问题描述】：

问题：是否可以向量化两个 DataFrames/Series 的字符串匹配？

概念：我有两个 DataFrame（df_address、df_world_city）：

df_address：包含一列地址数据（例如“Sherlock Str.; Paris;”）
df_world_city：包含一个包含城市名称和相应国家/地区（“FRA”、“Paris”）的列

我遍历每个地址并尝试匹配所有城市，以找出地址中提到的城市并将相应的国家添加到其中。匹配的城市保存在一个列表中，该列表是以国家为键的字典的值（{'FRA': ['Paris']}）。

目前，我主要使用 for 循环来遍历地址和城市以匹配它们。多处理（48 个进程）和大量数据（df_address：160,000 行；df_wordl_city：2,200,000 行）大约需要 4-5 天。

def regex_city_matching(target, location):

    if type(target) != str or type(location) != str or len(target) <= 3:
        # Skip NaN and to short cities
        return False
    # Match city only as full word, not a substring of another word
    pattern = re.compile('(^|[\W])' + re.escape(target) + '($|[\W])', re.IGNORECASE)
    result = re.search(pattern, location)
    if result:
        return True
    return False


def city_matching_no_country_multi_dict_simple(self, df_world_city, df_address):

 col_names = ['node_id', 'name', 'city_iso']
 df_matched_city_no_country = pd.DataFrame(columns=col_names)

 for index_city in df_world_city.index:
     # Iterate over each city
     w_city = df_world_city.at[index_city, 'city']
     if type(w_city) != str or len(w_city) <= 3:
         # Skip NaN and to short cities
         continue

     w_country = df_world_city.at[index_city, 'iso']

     for ind_address in df_address.index:
         if self.regex_city_matching(w_city, df_address.at[ind_address, 'name']):
             node_id = df_address.at[ind_address, 'node_id']
             address = df_address.at[ind_address, 'name']
             if (df_matched_city_no_country['node_id'] == node_id).any():
                 # append new city / country
                 ind_append_address = df_matched_city_no_country.loc[df_matched_city_no_country.node_id == node_id].index[0]
                          if w_country in df_matched_city_no_country.at[ind_append_address, 'city_iso']:
                     # Country in dictionary
                     df_matched_city_no_country.at[ind_append_address, 'city_iso'][w_country].append(w_city)
                 else:
                     # Country not in dictionary
                     df_matched_city_no_country.at[ind_append_address, 'city_iso'][w_country] = [w_city]
             else:
                 # add new address with city / country
                 dict_iso_city = {w_country: [w_city]}
                 df_matched_city_no_country = df_matched_city_no_country.append(
                     {'node_id': node_id, 'name': address, 'city_iso': dict_iso_city},
                     ignore_index=True)

return df_matched_city_no_country

编辑：谢谢@lenik！与一组城市的匹配效率更高，而且完成得非常快。

但并没有完全实现，因为测试表明误报率很高。

【问题讨论】：

标签： python pandas

【解决方案1】：

你应该用{ 'city' : 'COUNTRY', }做一个逆字典，这样你就不用循环了，只需要在常数（O(1)）时间内直接访问即可。

除了我会创建一个已知城市的set()，所以我不需要循环任何东西，只需快速查找，我就知道这个城市是否未知。

最后，我会在不使用非常昂贵的正则表达式的情况下简化地址解析，将所有字符转换为大写或小写，用空格替换非字母字符，并且只需 .split() 来获取单词列表而不是你正在做的事情现在。

完成所有这些更改后，处理 200 万个已知城市的 16 万个地址可能需要 10-15 秒。

请告诉我您是否需要代码示例？

【讨论】：

谢谢！ -- 1. 为什么要颠倒字典？我真的不会遍历它？！ -- 2. 一组城市听起来不错，但是：我在不同的国家/地区多次拥有一些城市，我怎么知道城市来自哪个国家？ -- 3. 拆分听起来不错，但您能否提供一些示例，我将如何通过列表来匹配我的城市？再次感谢:)
@user3388671 你现在如何处理不同国家的相同城市？
我认为每个城市都是独一无二的，因为在另一列中的“iso”是国家代码。因此，如果一个城市匹配，我使用城市的索引将国家代码分配给 w_country。您能否提供代码示例如何有效地遍历拆分的地址并签入 set_world_city。
@user3388671 如果你有国家代码，为什么你需要搜索城市，只要通过国家代码得到一个国家，不管是哪个城市？
这个想法是从数据中取出国家和城市。带有明确国家名称的地址我变得非常容易和快速。下一步是获取该地址的城市，因为我只匹配来自匹配国家的城市。最后一步和当前函数的想法是从数据中获取城市并根据匹配城市识别地址所在的国家/地区，即使它没有写出来。但误报命中率极高，结果可用。