正则表达式 - 将文本的子字符串与模式的子字符串匹配答案

【问题标题】：Regex - matching a substring of text to a substring of a pattern正则表达式 - 将文本的子字符串与模式的子字符串匹配
【发布时间】：2016-11-15 02:15:41
【问题描述】：

所以我处于一种反直觉的境地，我想得到一些建议。大多数情况下，我只是在做一些字符串匹配，使用提取的字符串作为正则表达式的模式。虽然通常我可以通过模糊正则表达式搜索总体上做得很好，但有时我会遇到这种情况：

假设我从一些数据（Python 正则表达式包）中提取了以下模式。

pattern = 'the quick brown fox jumps over the lazy dog'

现在，我需要让它匹配一个可能看起来像其中任何一个的字符串，尽管主要是第一个。

string = 'quick brown fox jumps over the lazy'
string2 = 'and then a quick brown fox jumps onto the cat'

由于开头和结尾的字符，如果我尝试做我一直在做的事情，显然我不会得到匹配，目前看起来像这样：

if re.search("("+pattern+"){e<=2}", string):
    print(True)

不幸的是，错误计数不一致，可能有许多字符引导和/或结束模式。鉴于我不知道先验是否会遇到这个问题，如果模式的足够子字符串匹配它，我能做些什么来获得匹配吗？我查看了 Levenshtein 距离来解释这一点，但它需要设置一些似乎对要匹配的字符串长度超级敏感的阈值（在按长度标准化之后），所以它最终只是一个折腾我是否在我想要的时候得到匹配。是否有其他选择或更好的方法来标准化结果？

另外，我不能做的一件事是总是选择最佳匹配，因为有时正确的条目实际上并没有出现在我正在检查的文本中。

我在正则表达式包中遗漏了什么可以帮助解决这个问题？

【问题讨论】：

你签出nltk了吗？听起来您想比较字符串中的词干频率（可能具有基于整体词频的权重）并返回最佳匹配。我认为nltk 对此表示支持。 textminingonline.com/…
什么是模式的充分子串？这是您通常必须自己计算并与 Levenstein 距离函数一起使用的值。
像string='quick blah brown blah fox blah blog jumps blow over blech the crazy'这样的单词交错怎么办？

标签： python regex string-matching fuzzy-search

【解决方案1】：

哎呀，我花了很长时间才完成这个（我不是 python 开发人员），但这应该可以解决问题：

import re

sentence = "the quick brown fox jumps over the lazy dog"
string = 'quick brown fox jumps over the lazy'
string2 = 'and then a quick brown fox jumps onto the cat'
count1 = 0
count2 = 0


pattern = re.sub(
    '(\w+\s*)',
    '\\1|',
    sentence
)

pattern ="(?:(?!" + pattern.rstrip("|") + ").|" + re.sub(
    '(\w+\s*)',
    '(\\1){0,1}',
    sentence
) + ")+"

results = re.match(
    pattern,
    string
)

total = len(results.groups())

for index in range(1, total):

    if results.group(index):
        count1 = count1 + 1

results = re.match(
    pattern,
    string2
)

for index in range(1, total):

    if results.group(index):
        count2 = count2 + 1

message = 'The following string:"' + string + '" matched ' + str(count1) + ' time and the following string:"' + string + '" matched ' + str(count2) + ' time.'

在这里测试：http://www.pythontutor.com/visualize.html#mode=edit

【讨论】：