查找近似字符串匹配并替换为预定义字符串的有效方法答案

【问题标题】：Efficient way to find an approximate string match and replacing with predefined string查找近似字符串匹配并替换为预定义字符串的有效方法
【发布时间】：2021-12-31 05:21:38
【问题描述】：

我需要构建一个NER 系统（Named Entity Recognition）。为简单起见，我通过使用近似字符串匹配来做到这一点，因为输入可以包含拼写错误和其他小的修改。我遇到了一些很棒的库，例如：fuzzywuzzy 甚至更快的RapidFuzz。但不幸的是，我没有找到返回匹配发生位置的方法。因为，就我的目的而言，我不仅需要找到匹配项，还需要知道匹配项发生在哪里。至于NER，我需要用一些预定义的字符串替换那些匹配项。

例如，如果在输入字符串中找到任何一行，我想用字符串 COMPANY_NAME 替换它们：

google
microsoft
facebook
International Business Machine

例如，输入：S/he works at Google 将转换为 S/he works at COMPANY_NAME。您可以放心地假设，所有输入和要匹配的模式都已经过预处理，最重要的是它们现在是小写的。所以，区分大小写没有问题。

目前，我采用了滑动窗口技术。从左到右在输入字符串上传递一个滑动窗口，该窗口具有我们想要匹配的模式的大小。例如，当我想匹配International Business Machine 时，我从左到右运行大小为3 的滑动窗口，并尝试通过同时观察每个3 连续标记来找到最佳匹配，步幅为1。我确实相信，这不是最好的方法，也找不到 best 匹配。

那么，找到最佳可能匹配的有效方法是什么，以及对找到的匹配的量化（它们的相似程度）和匹配的位置，例如我们可以用给定的固定字符串替换它们（如果计算的相似度不小于阈值）？显然，单个输入可能包含多个要替换的部分，每个部分都将被单独替换，例如：Google and Microsoft are big companies 将变为 COMPANY_NAME and COMPANY_NAME are big companies 等。

【问题讨论】：

我认为它不是为了显示位置而创建的。您只能将文本分成较小的部分，并单独检查每个元素的单词并获得最佳匹配元素。似乎它具有process.extractOne(list, word) 功能来检查列表中所有元素的单词。对于单个单词，它可以更简单，因为您可以将全文拆分为单词列表。但是对于International Business Machine，您必须将全文拆分为单词列表，然后创建包含 3 个单词的列表 iwth 字符串 - 稍后您可以使用列表位置来计算全文中的位置。

标签： python nlp named-entity-recognition fuzzy-search fuzzywuzzy

【解决方案1】：

似乎模块fuzzywuzzy 和RapidFuzz 没有这个功能。您可以尝试使用process.extract() 或process.extractOne()，但它需要将文本分成更小的部分（即单词）并分别检查每个部分。对于像 International Business Machine 这样较长的词，它需要用 3 个词进行部分拆分 - 所以需要更多的工作。

我认为你需要相当模块fuzzysearch

import fuzzysearch

words = ['google', 'microsoft', 'facebook', 'International Business Machine']

text = 'Google and Microsoft are big companies like International Business Machine'

print(' text:', text)
print('---')
    
for word in sorted(words, key=len, reverse=True):
    print(' word:', word)
    
    results = fuzzysearch.find_near_matches(word, text, max_l_dist=1)
    print('found:', results)
    
    for item in reversed(results):
        text = text[:item.start] + 'COMPANY' + text[item.end:]
    print(' text:', text)
    
    print('---')

结果：

 text: Google and Microsoft are big companies like facebook International Business Machine
---
 word: International Business Machine
found: [Match(start=53, end=83, dist=0, matched='International Business Machine')]
 text: Google and Microsoft are big companies like facebook COMPANY
---
 word: microsoft
found: [Match(start=11, end=20, dist=1, matched='Microsoft')]
 text: Google and COMPANY are big companies like facebook COMPANY
---
 word: facebook
found: [Match(start=42, end=50, dist=0, matched='facebook')]
 text: Google and COMPANY are big companies like COMPANY COMPANY
---
 word: google
found: [Match(start=0, end=6, dist=1, matched='Google')]
 text: COMPANY and COMPANY are big companies like COMPANY COMPANY

如果它为一个单词找到很多结果，那么最好从最后一个位置开始替换，以将其他单词保持在同一个位置。这就是我使用reversed() 的原因。

我也会从最长的单词/名称开始，所以稍后它仍然可以搜索较短的单词，例如Business。这就是为什么我使用sorted(..., key=len, reverse=True)

但我不确定它是否完全符合您的要求。用词不正确的时候可能会有问题。

编辑：

我尝试为此使用fuzzywuzzy 并创建了此版本，但仅适用于单个单词的名称。对于International Business Machine，它需要一些其他的想法。

它将全文拆分为单词并比较单词。稍后替换具有配给的单词> 80

words = ['google', 'microsoft', 'facebook', 'International Business Machine']

text = 'Google and Microsoft are big companies like International Business Machine'

# ---

import fuzzywuzzy.fuzz as fuzz
#import fuzzywuzzy.process

new_words = []

for part in text.split():

    matches = []

    for word in words:
        result = fuzz.token_sort_ratio(part, word)
        matches.append([result, part, word])
        #print([result, part, word])

    matches = sorted(matches, reverse=True)

    if matches and matches[0][0] > 80:
        new_words.append('COMPANY')
    else:
        new_words.append(matches[0][1])
        
print(" ".join(new_words))

结果：

[100, 'Google', 'google']
[27, 'Google', 'microsoft']
[29, 'Google', 'facebook']
[17, 'Google', 'International Business Machine']
[0, 'and', 'google']
[0, 'and', 'microsoft']
[18, 'and', 'facebook']
[12, 'and', 'International Business Machine']
[27, 'Microsoft', 'google']
[100, 'Microsoft', 'microsoft']
[35, 'Microsoft', 'facebook']
[15, 'Microsoft', 'International Business Machine']
[22, 'are', 'google']
[17, 'are', 'microsoft']
[36, 'are', 'facebook']
[12, 'are', 'International Business Machine']
[22, 'big', 'google']
[17, 'big', 'microsoft']
[18, 'big', 'facebook']
[12, 'big', 'International Business Machine']
[27, 'companies', 'google']
[33, 'companies', 'microsoft']
[24, 'companies', 'facebook']
[26, 'companies', 'International Business Machine']
[40, 'like', 'google']
[15, 'like', 'microsoft']
[17, 'like', 'facebook']
[18, 'like', 'International Business Machine']
[21, 'International', 'google']
[27, 'International', 'microsoft']
[19, 'International', 'facebook']
[60, 'International', 'International Business Machine']
[14, 'Business', 'google']
[24, 'Business', 'microsoft']
[12, 'Business', 'facebook']
[42, 'Business', 'International Business Machine']
[15, 'Machine', 'google']
[25, 'Machine', 'microsoft']
[40, 'Machine', 'facebook']
[38, 'Machine', 'International Business Machine']
COMPANY and COMPANY are big companies like International Business Machine

编辑：

第二个版本也检查包含许多单词的名字

all_names = ['google', 'microsoft', 'facebook', 'International Business Machine']

text = 'Google and Microsoft are big companies like International Business Machine'

# ---

import fuzzywuzzy.fuzz as fuzz


for name in all_names:

    length = len(name.split(' ')) # how many words has name 
    print('name length:', length, '|', name)

    words = text.split()  # split text into words

    # compare name with all words in text
    
    matches = []
    
    for index in range(0, len(words)-length+1):
        # join words if name has more then 1 word
        part = " ".join(words[index:index+length])
        #print('part:', part)
        
        result = fuzz.token_sort_ratio(part, name)
        matches.append([result, name, part, [index, index+length]])

        print([result, name, part, [index, index+length]])
        
    # reverse to start at last position
    matches = list(reversed(matches))

    max_match = max(x[0] for x in matches)
    print('max match:', max_match)

    # replace
    if max_match > 80:
        for match in matches:
            if  match[0] == max_match:
                idx = match[3]  
                words = words[:idx[0]] + ['COMPANY'] + words[idx[1]:]

    text = " ".join(words)
    print('text:', text)
    print('---')

结果：

ame length: 1 | google
[100, 'google', 'Google', [0, 1]]
[0, 'google', 'and', [1, 2]]
[27, 'google', 'Microsoft', [2, 3]]
[22, 'google', 'are', [3, 4]]
[22, 'google', 'big', [4, 5]]
[27, 'google', 'companies', [5, 6]]
[40, 'google', 'like', [6, 7]]
[21, 'google', 'International', [7, 8]]
[14, 'google', 'Business', [8, 9]]
[15, 'google', 'Machine', [9, 10]]
max match: 100
text: COMPANY and Microsoft are big companies like International Business Machine
---
name length: 1 | microsoft
[25, 'microsoft', 'COMPANY', [0, 1]]
[0, 'microsoft', 'and', [1, 2]]
[100, 'microsoft', 'Microsoft', [2, 3]]
[17, 'microsoft', 'are', [3, 4]]
[17, 'microsoft', 'big', [4, 5]]
[33, 'microsoft', 'companies', [5, 6]]
[15, 'microsoft', 'like', [6, 7]]
[27, 'microsoft', 'International', [7, 8]]
[24, 'microsoft', 'Business', [8, 9]]
[25, 'microsoft', 'Machine', [9, 10]]
max match: 100
text: COMPANY and COMPANY are big companies like International Business Machine
---
name length: 1 | facebook
[27, 'facebook', 'COMPANY', [0, 1]]
[18, 'facebook', 'and', [1, 2]]
[27, 'facebook', 'COMPANY', [2, 3]]
[36, 'facebook', 'are', [3, 4]]
[18, 'facebook', 'big', [4, 5]]
[24, 'facebook', 'companies', [5, 6]]
[17, 'facebook', 'like', [6, 7]]
[19, 'facebook', 'International', [7, 8]]
[12, 'facebook', 'Business', [8, 9]]
[40, 'facebook', 'Machine', [9, 10]]
max match: 40
text: COMPANY and COMPANY are big companies like International Business Machine
---
name length: 3 | International Business Machine
[33, 'International Business Machine', 'COMPANY and COMPANY', [0, 3]]
[31, 'International Business Machine', 'and COMPANY are', [1, 4]]
[31, 'International Business Machine', 'COMPANY are big', [2, 5]]
[34, 'International Business Machine', 'are big companies', [3, 6]]
[38, 'International Business Machine', 'big companies like', [4, 7]]
[69, 'International Business Machine', 'companies like International', [5, 8]]
[88, 'International Business Machine', 'like International Business', [6, 9]]
[100, 'International Business Machine', 'International Business Machine', [7, 10]]
max match: 100
text: COMPANY and COMPANY are big companies like COMPANY

编辑：

fuzzywuzzy.process 的版本

这次我没有职位，我只是使用标准的text.replace(item[0], 'COMPANY')。

我认为在大多数情况下它都能正常工作，不需要更好的方法。

这次我检查有错误的文字：

'Gogle and Mikro-Soft are big companies like Fasebok and Internat. Businnes Machin'


all_names = ['google', 'microsoft', 'facebook', 'International Business Machine']

text = 'Google and Microsoft are big companies like Facebook and International Business Machine'

# text with mistakes
text = 'Gogle and Mikro-Soft are big companies like Fasebok and Internat. Businnes Machin'

# ---

import fuzzywuzzy.process
#import fuzzywuzzy.fuzz

for name in sorted(all_names, key=len, reverse=True):
    lenght = len(name.split())

    words = text.split()
    words = [" ".join(words[i:i+lenght]) for i in range(0, len(words)-lenght+1)]
    #print(words)

    #result = fuzzywuzzy.process.extractBests(name, words, scorer=fuzzywuzzy.fuzz.token_sort_ratio, score_cutoff=80)
    result = fuzzywuzzy.process.extractBests(name, words, score_cutoff=80)
    print(name, result)

    for item in result:
        text = text.replace(item[0], 'COMPANY')

print(text)

【讨论】：

我添加了带有 fuzzywuzzy.process.extractBests 的版本 - 它很短，也适用于 International Business Machine，但我使用的是普通的 text.replace()，在大多数情况下应该可以解决问题。
感谢@furas，在最后一次编辑中，我们可能还会考虑from difflib import get_close_matches，因为difflib 是一个内置的python 模块。更改为：result = get_close_matches(name, words) 和 for item in result: text = text.replace(item, 'COMPANY')。
get_close_matches(name, words, cutoff = CUTOFF_VALUE) 参考：docs.python.org/3/library/…
@hafiz031 您可以将此代码（完整版）与描述一起作为答案并标记为已接受。