【问题标题】:regex whole string match between numbers正则表达式整个字符串匹配数字之间
【发布时间】:2022-09-23 02:15:04
【问题描述】:

我想从一个句子中提取整个单词。 感谢this answer

import re

def findWholeWord(w):
    return re.compile(r\'\\b({0})\\b\'.format(w), flags=re.IGNORECASE).search

在以下情况下,我可以得到完整的单词:

findWholeWord(\'thomas\')(\'this is Thomas again\')   # -> <match object>
findWholeWord(\'thomas\')(\'this is,Thomas again\')   # -> <match object>
findWholeWord(\'thomas\')(\'this is,Thomas, again\')  # -> <match object>
findWholeWord(\'thomas\')(\'this is.Thomas, again\')  # -> <match object>
findWholeWord(\'thomas\')(\'this is ?Thomas again\')  # -> <match object>

单词旁边的符号不会打扰。

但是,如果有一个数字,它就找不到这个词。

我应该如何修改表达式以匹配单词旁边有数字的情况?喜欢:

findWholeWord(\'thomas\')(\'this is 9Thomas, again\')
findWholeWord(\'thomas\')(\'this is9Thomas again\')
findWholeWord(\'thomas\')(\'this is Thomas36 again\')
  • “提取”是什么意思?为什么需要使用正则表达式? pos = s.find(word) return s[pos:pos+len(word)] 呢?

标签: python regex string


【解决方案1】:

可以使用正则表达式(?:\d|\b){0}(?:\d|\b) 将目标词与词边界或两侧的数字相匹配。

import re

def findWholeWord(w):
    return re.compile(r'(?:\d|\b){0}(?:\d|\b)'.format(w), flags=re.IGNORECASE).search

for s in [
    'this is thomas',
    'this is Thomas again',
    'this is,Thomas again',
    'this is,Thomas, again',
    'this is.Thomas, again',
    'this is ?Thomas again',
    'this is 9Thomas, again',
    'this is9Thomas again',
    'this is Thomas36 again',
    'this is 1Thomas2 again',
    'this is -Thomas- again',
    'athomas is no match',
    'thomason no match']:
    print("match >" if findWholeWord('thomas')(s) else "*no match* >", s)

输出:

match > this is thomas
match > this is Thomas again
match > this is,Thomas again
match > this is,Thomas, again
match > this is.Thomas, again
match > this is ?Thomas again
match > this is 9Thomas, again
match > this is9Thomas again
match > this is Thomas36 again
match > this is 1Thomas2 again
match > this is -Thomas- again
*no match* > athomas is no match
*no match* > thomason no match

如果您想针对多个输入或循环重复使用相同的目标词,则可以分配查找整个单词()调用一个变量然后调用它。

matcher = findWholeWord('thomas')
print(matcher('this is Thomas again'))
print(matcher('this is,Thomas again'))

【讨论】:

  • 这会起作用,但也可以选择这个“这又是 Thomas36b”,所以稍作改动就可以了 re.compile(r'(?:\b\d+|\b){0}(?:\d+\b| \b)'.format(w), flags=re.I).search
  • @omuthu 好点,原始发布者需要审查各种边缘情况并细化匹配和不匹配的标准。
  • 谢谢@CodeMonkey!正是我想要的。在我的问题中,@omuthu 表示的情况不会导致(先验)任何问题,但这也是一个很好的考虑点!
【解决方案2】:

您可以使用以下代码:

import re

def findWholeWord(w):
    return re.compile(r'(?:\d+{0}|{0}\d+|\b{0}\b)'.format(w), flags=re.I).search


print ( findWholeWord('thomas')('this is 9Thomas, again') )
print ( findWholeWord('thomas')('this is9Thomas again') )
print ( findWholeWord('thomas')('this is Thomas36 again') )
print ( findWholeWord('thomas')('this is Thomas again') )
print ( findWholeWord('thomas')('this is,Thomas again') )
print ( findWholeWord('thomas')('this is,Thomas, again') )
print ( findWholeWord('thomas')('this is.Thomas, again') )
print ( findWholeWord('thomas')('this is ?Thomas again') )
print ( findWholeWord('thomas')('this is aThomas again') )

输出:

<re.Match object; span=(8, 15), match='9Thomas'>
<re.Match object; span=(7, 14), match='9Thomas'>
<re.Match object; span=(8, 16), match='Thomas36'>
<re.Match object; span=(8, 14), match='Thomas'>
<re.Match object; span=(8, 14), match='Thomas'>
<re.Match object; span=(8, 14), match='Thomas'>
<re.Match object; span=(8, 14), match='Thomas'>
<re.Match object; span=(9, 15), match='Thomas'>
None

(?:\d+{0}|{0}\d+|\b{0}\b) 将匹配给定的单词,其两侧有 1 个以上的数字或完整的单词。

【讨论】:

  • 谢谢@anubhava。该解决方案似乎也可以正常工作。我不知道与公认答案的主要区别是哪个,但似乎两者都做同样的事情(至少他们都做了我需要的事情)。
  • 实际上它是相同的方法。 CodeMonkey 进一步优化了这个正则表达式
猜你喜欢
  • 1970-01-01
  • 2020-10-19
  • 1970-01-01
  • 2010-11-19
  • 2012-09-26
  • 2020-09-19
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多