【问题标题】:Python SpaCy Regex does not pick up the token that contains a wordPython SpaCy Regex 不会提取包含单词的标记
【发布时间】:2019-07-27 11:56:21
【问题描述】:

我在下面运行简单的代码来获取包含单词的所有标记(例如,包含acompared、notcompared、thiscompared的单词)。

但是,spaCy 正则表达式不返回任何内容。 python re上的正则表达式单词fine。

能否让我知道这是否是 spaCy 问题或如何解决该问题?

它返回 [],空列表。

import plac
from spacy.lang.en import English
from spacy.matcher import PhraseMatcher, Matcher
from spacy.tokens import Doc, Span, Token
import spacy

nlp = spacy.load("en_core_web_sm")

text = """
"Net income was $9.4 million acompared to the prior year of $2.7
million.",
"Revenue exceeded twelve billion dollars, with a loss of $1b. run",
"""

doc = nlp(text)

pattern = [{"LOWER": {"REGEX": "\b\wcompared\w\b"}}]

matcher = Matcher(nlp.vocab)
matcher.add("item", None, pattern )
matches = matcher(doc)
print(matches)
print(matcher)

此代码应返回“比较”令牌的位置。

【问题讨论】:

  • 我没有看到这个正则表达式也适用于 python re,因为你有 \wcompared\w 所以它会尝试匹配 word followed by compared followed by word (surrounded by word boundaries ) 这在文本中不可用

标签: python regex spacy


【解决方案1】:

我什至没有看到这个正则表达式与 python re 一起工作,因为它试图匹配 word followed by compared followed by word (surrounded by word boundaries ) 你的文本中没有任何内容与以下模式匹配

\b\wcompared\w\b

您可以简单地将您的正则表达式更改为

\b(a|this|not)compared\b

Demo

【讨论】:

    【解决方案2】:

    正则表达式 1

    如果我们要查找其中包含 compared 的任何单词,也许这个表达式可能会起作用:

    \b\w*(?:compared)\w*\b
    

    Demo

    re.finditer测试

    import re
    
    regex = r"\b\w*(?:compared)\w*\b"
    
    test_str = "some text you wish before then compared or anythingcompared or any_thing_01_compared_anything_after_that "
    
    matches = re.finditer(regex, test_str)
    
    for matchNum, match in enumerate(matches, start=1):
    
        print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))
    
        for groupNum in range(0, len(match.groups())):
            groupNum = groupNum + 1
    
            print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))
    

    正则表达式 2

    如果我们可能想在其中找到带有compared 的字符串,我的猜测是s 模式下的这个表达式,

    ^(?=.*\bacompared\b|\bthiscompared\b|\bnotcompared\b).*$
    

    DEMO 2

    或者m模式下的这个

    ^(?=[\s\S]*\bacompared\b|\bthiscompared\b|\bnotcompared\b)[\s\S]*$
    

    可能是解决这个问题的开始。

    DEMO 3

    使用re.findall 测试 1

    import re
    
    regex = r"^(?=.*\bacompared\b|\bthiscompared\b|\bnotcompared\b).*$"
    
    test_str = ("Net income was $9.4 million acompared to the prior year of $2.7        million.,\n\n"
        "some other words with new lines")
    
    print(re.findall(regex, test_str, re.DOTALL))
    

    使用re.findall 测试 2

    import re
    
    regex = r"^(?=[\s\S]*\bacompared\b|\bthiscompared\b|\bnotcompared\b)[\s\S]*$"
    
    test_str = ("Net income was $9.4 million acompared to the prior year of $2.7        million.,\n\n"
        "some other words with new lines")
    
    print(re.findall(regex, test_str, re.MULTILINE))
    

    【讨论】:

    • 感谢您的帮助。不幸的是,这仍然不适用于 spaCy。
    【解决方案3】:

    虽然上述答案适用于 python re,但 SpaCy 需要特定类型的模式描述格式。该模式应包含单词“TEXT”。例如,

    pattern = [{"TEXT": {"REGEX": "compared*"}}].
    

    【讨论】:

      猜你喜欢
      • 2013-04-08
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2013-09-02
      • 1970-01-01
      • 1970-01-01
      • 2013-05-15
      相关资源
      最近更新 更多