文本中的术语匹配答案

【问题标题】：Terminology matching in text文本中的术语匹配
【发布时间】：2017-09-03 00:43:44
【问题描述】：

我有一个术语列表如下：

a   
abc
a abc
a a abc
abc

我想匹配文本中的术语并将它们的名称更改为“term1，term2”。但我想找到最长的匹配作为正确的匹配。

Text: I have a and abc maybe abc again and also a a abc.
Output: I have term1 and term2 maybe term2 again and also a term3.

到目前为止，我使用了下面的代码，但没有找到最长的匹配：

for x in terms:
    if x in text:
       do blabla

【问题讨论】：

标签： python-2.7 pattern-matching

【解决方案1】：

您可以使用re.sub

import re

words = ["a", 
"abc",
"a abc",
"a a abc"
]

test_str = "I have a and abc maybe abc again and also a a abc."

for word in sorted(words, key=len, reverse=True):
    term = "\1term%i\2" % (words.index(word)+1)
    test_str = re.sub(r"(\b)%s(\b)"%word, term, test_str)

print(test_str)

它会得到你的“预期”结果（你在例子中犯了一个错误）

Input: I have a and abc maybe abc again and also a a abc.
Output: I have term1 and term2 maybe term2 again and also term4.

【讨论】：

【解决方案2】：

或使用 re.sub 替换功能：

import re

text = 'I have a and abc maybe abc again and also a a abc'
words = ['a', 'abc', 'a abc', 'a a abc']
regex = re.compile(r'\b' + r'\b|\b'.join(sorted(words, key=len, reverse=True)) + r'\b')


def replacer(m):
    print 'replacing : %s' % m.group(0)
    return 'term%d' % (words.index(m.group(0)) + 1)

print re.sub(regex, replacer, text)

结果：

replacing : a
replacing : abc
replacing : abc
replacing : a a abc
I have term1 and term2 maybe term2 again and also term4

或使用匿名替换器：

print re.sub(regex, lambda m: 'term%d' % (words.index(m.group(0)) + 1), text)

【讨论】：