【问题标题】:Efficient way to find an approximate string match and replacing with predefined string查找近似字符串匹配并替换为预定义字符串的有效方法
【发布时间】:2021-12-31 05:21:38
【问题描述】:

我需要构建一个NER 系统(Named Entity Recognition)。为简单起见,我通过使用近似字符串匹配来做到这一点,因为输入可以包含拼写错误和其他小的修改。我遇到了一些很棒的库,例如:fuzzywuzzy 甚至更快的RapidFuzz。但不幸的是,我没有找到返回匹配发生位置的方法。因为,就我的目的而言,我不仅需要找到匹配项,还需要知道匹配项发生在哪里。至于NER,我需要用一些预定义的字符串替换那些匹配项。

例如,如果在输入字符串中找到任何一行,我想用字符串 COMPANY_NAME 替换它们:

google
microsoft
facebook
International Business Machine

例如,输入:S/he works at Google 将转换为 S/he works at COMPANY_NAME。 您可以放心地假设,所有输入和要匹配的模式都已经过预处理,最重要的是它们现在是小写的。所以,区分大小写没有问题。

目前,我采用了滑动窗口技术。从左到右在输入字符串上传递一个滑动窗口,该窗口具有我们想要匹配的模式的大小。例如,当我想匹配International Business Machine 时,我从左到右运行大小为3 的滑动窗口,并尝试通过同时观察每个3 连续标记来找到最佳匹配,步幅为1。我确实相信,这不是最好的方法,也找不到 best 匹配。

那么,找到最佳可能匹配的有效方法是什么,以及对找到的匹配的量化(它们的相似程度)和匹配的位置,例如我们可以用给定的固定字符串替换它们(如果计算的相似度不小于阈值)?显然,单个输入可能包含多个要替换的部分,每个部分都将被单独替换,例如:Google and Microsoft are big companies 将变为 COMPANY_NAME and COMPANY_NAME are big companies 等。

【问题讨论】:

  • 我认为它不是为了显示位置而创建的。您只能将文本分成较小的部分,并单独检查每个元素的单词并获得最佳匹配元素。似乎它具有process.extractOne(list, word) 功能来检查列表中所有元素的单词。对于单个单词,它可以更简单,因为您可以将全文拆分为单词列表。但是对于International Business Machine,您必须将全文拆分为单词列表,然后创建包含 3 个单词的列表 iwth 字符串 - 稍后您可以使用列表位置来计算全文中的位置。

标签: python nlp named-entity-recognition fuzzy-search fuzzywuzzy


【解决方案1】:

似乎模块fuzzywuzzyRapidFuzz 没有这个功能。您可以尝试使用process.extract()process.extractOne(),但它需要将文本分成更小的部分(即单词)并分别检查每个部分。对于像 International Business Machine 这样较长的词,它需要用 3 个词进行部分拆分 - 所以需要更多的工作。


我认为你需要相当模块fuzzysearch

import fuzzysearch

words = ['google', 'microsoft', 'facebook', 'International Business Machine']

text = 'Google and Microsoft are big companies like International Business Machine'

print(' text:', text)
print('---')
    
for word in sorted(words, key=len, reverse=True):
    print(' word:', word)
    
    results = fuzzysearch.find_near_matches(word, text, max_l_dist=1)
    print('found:', results)
    
    for item in reversed(results):
        text = text[:item.start] + 'COMPANY' + text[item.end:]
    print(' text:', text)
    
    print('---')

结果:

 text: Google and Microsoft are big companies like facebook International Business Machine
---
 word: International Business Machine
found: [Match(start=53, end=83, dist=0, matched='International Business Machine')]
 text: Google and Microsoft are big companies like facebook COMPANY
---
 word: microsoft
found: [Match(start=11, end=20, dist=1, matched='Microsoft')]
 text: Google and COMPANY are big companies like facebook COMPANY
---
 word: facebook
found: [Match(start=42, end=50, dist=0, matched='facebook')]
 text: Google and COMPANY are big companies like COMPANY COMPANY
---
 word: google
found: [Match(start=0, end=6, dist=1, matched='Google')]
 text: COMPANY and COMPANY are big companies like COMPANY COMPANY

如果它为一个单词找到很多结果,那么最好从最后一个位置开始替换,以将其他单词保持在同一个位置。这就是我使用reversed() 的原因。

我也会从最长的单词/名称开始,所以稍后它仍然可以搜索较短的单词,例如Business。这就是为什么我使用sorted(..., key=len, reverse=True)

但我不确定它是否完全符合您的要求。用词不正确的时候可能会有问题。


编辑:

我尝试为此使用fuzzywuzzy 并创建了此版本,但仅适用于单个单词的名称。对于International Business Machine,它需要一些其他的想法。

它将全文拆分为单词并比较单词。稍后替换具有配给的单词> 80

words = ['google', 'microsoft', 'facebook', 'International Business Machine']

text = 'Google and Microsoft are big companies like International Business Machine'

# ---

import fuzzywuzzy.fuzz as fuzz
#import fuzzywuzzy.process

new_words = []

for part in text.split():

    matches = []

    for word in words:
        result = fuzz.token_sort_ratio(part, word)
        matches.append([result, part, word])
        #print([result, part, word])

    matches = sorted(matches, reverse=True)

    if matches and matches[0][0] > 80:
        new_words.append('COMPANY')
    else:
        new_words.append(matches[0][1])
        
print(" ".join(new_words))

结果:

[100, 'Google', 'google']
[27, 'Google', 'microsoft']
[29, 'Google', 'facebook']
[17, 'Google', 'International Business Machine']
[0, 'and', 'google']
[0, 'and', 'microsoft']
[18, 'and', 'facebook']
[12, 'and', 'International Business Machine']
[27, 'Microsoft', 'google']
[100, 'Microsoft', 'microsoft']
[35, 'Microsoft', 'facebook']
[15, 'Microsoft', 'International Business Machine']
[22, 'are', 'google']
[17, 'are', 'microsoft']
[36, 'are', 'facebook']
[12, 'are', 'International Business Machine']
[22, 'big', 'google']
[17, 'big', 'microsoft']
[18, 'big', 'facebook']
[12, 'big', 'International Business Machine']
[27, 'companies', 'google']
[33, 'companies', 'microsoft']
[24, 'companies', 'facebook']
[26, 'companies', 'International Business Machine']
[40, 'like', 'google']
[15, 'like', 'microsoft']
[17, 'like', 'facebook']
[18, 'like', 'International Business Machine']
[21, 'International', 'google']
[27, 'International', 'microsoft']
[19, 'International', 'facebook']
[60, 'International', 'International Business Machine']
[14, 'Business', 'google']
[24, 'Business', 'microsoft']
[12, 'Business', 'facebook']
[42, 'Business', 'International Business Machine']
[15, 'Machine', 'google']
[25, 'Machine', 'microsoft']
[40, 'Machine', 'facebook']
[38, 'Machine', 'International Business Machine']
COMPANY and COMPANY are big companies like International Business Machine

编辑:

第二个版本也检查包含许多单词的名字

all_names = ['google', 'microsoft', 'facebook', 'International Business Machine']

text = 'Google and Microsoft are big companies like International Business Machine'

# ---

import fuzzywuzzy.fuzz as fuzz


for name in all_names:

    length = len(name.split(' ')) # how many words has name 
    print('name length:', length, '|', name)

    words = text.split()  # split text into words

    # compare name with all words in text
    
    matches = []
    
    for index in range(0, len(words)-length+1):
        # join words if name has more then 1 word
        part = " ".join(words[index:index+length])
        #print('part:', part)
        
        result = fuzz.token_sort_ratio(part, name)
        matches.append([result, name, part, [index, index+length]])

        print([result, name, part, [index, index+length]])
        
    # reverse to start at last position
    matches = list(reversed(matches))

    max_match = max(x[0] for x in matches)
    print('max match:', max_match)

    # replace
    if max_match > 80:
        for match in matches:
            if  match[0] == max_match:
                idx = match[3]  
                words = words[:idx[0]] + ['COMPANY'] + words[idx[1]:]

    text = " ".join(words)
    print('text:', text)
    print('---')

结果:

ame length: 1 | google
[100, 'google', 'Google', [0, 1]]
[0, 'google', 'and', [1, 2]]
[27, 'google', 'Microsoft', [2, 3]]
[22, 'google', 'are', [3, 4]]
[22, 'google', 'big', [4, 5]]
[27, 'google', 'companies', [5, 6]]
[40, 'google', 'like', [6, 7]]
[21, 'google', 'International', [7, 8]]
[14, 'google', 'Business', [8, 9]]
[15, 'google', 'Machine', [9, 10]]
max match: 100
text: COMPANY and Microsoft are big companies like International Business Machine
---
name length: 1 | microsoft
[25, 'microsoft', 'COMPANY', [0, 1]]
[0, 'microsoft', 'and', [1, 2]]
[100, 'microsoft', 'Microsoft', [2, 3]]
[17, 'microsoft', 'are', [3, 4]]
[17, 'microsoft', 'big', [4, 5]]
[33, 'microsoft', 'companies', [5, 6]]
[15, 'microsoft', 'like', [6, 7]]
[27, 'microsoft', 'International', [7, 8]]
[24, 'microsoft', 'Business', [8, 9]]
[25, 'microsoft', 'Machine', [9, 10]]
max match: 100
text: COMPANY and COMPANY are big companies like International Business Machine
---
name length: 1 | facebook
[27, 'facebook', 'COMPANY', [0, 1]]
[18, 'facebook', 'and', [1, 2]]
[27, 'facebook', 'COMPANY', [2, 3]]
[36, 'facebook', 'are', [3, 4]]
[18, 'facebook', 'big', [4, 5]]
[24, 'facebook', 'companies', [5, 6]]
[17, 'facebook', 'like', [6, 7]]
[19, 'facebook', 'International', [7, 8]]
[12, 'facebook', 'Business', [8, 9]]
[40, 'facebook', 'Machine', [9, 10]]
max match: 40
text: COMPANY and COMPANY are big companies like International Business Machine
---
name length: 3 | International Business Machine
[33, 'International Business Machine', 'COMPANY and COMPANY', [0, 3]]
[31, 'International Business Machine', 'and COMPANY are', [1, 4]]
[31, 'International Business Machine', 'COMPANY are big', [2, 5]]
[34, 'International Business Machine', 'are big companies', [3, 6]]
[38, 'International Business Machine', 'big companies like', [4, 7]]
[69, 'International Business Machine', 'companies like International', [5, 8]]
[88, 'International Business Machine', 'like International Business', [6, 9]]
[100, 'International Business Machine', 'International Business Machine', [7, 10]]
max match: 100
text: COMPANY and COMPANY are big companies like COMPANY

编辑:

fuzzywuzzy.process 的版本

这次我没有职位,我只是使用标准的text.replace(item[0], 'COMPANY')

我认为在大多数情况下它都能正常工作,不需要更好的方法。

这次我检查有错误的文字:

'Gogle and Mikro-Soft are big companies like Fasebok and Internat. Businnes Machin'

all_names = ['google', 'microsoft', 'facebook', 'International Business Machine']

text = 'Google and Microsoft are big companies like Facebook and International Business Machine'

# text with mistakes
text = 'Gogle and Mikro-Soft are big companies like Fasebok and Internat. Businnes Machin'

# ---

import fuzzywuzzy.process
#import fuzzywuzzy.fuzz

for name in sorted(all_names, key=len, reverse=True):
    lenght = len(name.split())

    words = text.split()
    words = [" ".join(words[i:i+lenght]) for i in range(0, len(words)-lenght+1)]
    #print(words)

    #result = fuzzywuzzy.process.extractBests(name, words, scorer=fuzzywuzzy.fuzz.token_sort_ratio, score_cutoff=80)
    result = fuzzywuzzy.process.extractBests(name, words, score_cutoff=80)
    print(name, result)

    for item in result:
        text = text.replace(item[0], 'COMPANY')

print(text)

【讨论】:

  • 我添加了带有 fuzzywuzzy.process.extractBests 的版本 - 它很短,也适用于 International Business Machine,但我使用的是普通的 text.replace(),在大多数情况下应该可以解决问题。
  • 感谢@furas,在最后一次编辑中,我们可能还会考虑from difflib import get_close_matches,因为difflib 是一个内置的python 模块。更改为:result = get_close_matches(name, words)for item in result: text = text.replace(item, 'COMPANY')
  • get_close_matches(name, words, cutoff = CUTOFF_VALUE) 参考:docs.python.org/3/library/…
  • @hafiz031 您可以将此代码(完整版)与描述一起作为答案并标记为已接受。
猜你喜欢
  • 2011-05-11
  • 2010-09-08
  • 1970-01-01
  • 2012-02-07
  • 2013-02-04
  • 2013-07-10
  • 1970-01-01
  • 2017-07-13
  • 1970-01-01
相关资源
最近更新 更多