【问题标题】:Define multiple word token and extract all tokens after the words with SpaCy Matcher使用 SpaCy Matcher 定义多个单词标记并提取单词之后的所有标记
【发布时间】:2020-05-03 01:36:35
【问题描述】:

我在理解 SpaCy Matcher 模块时遇到了一些问题。

我有一句话:I think this is great, but I would not do it again

我想返回 but I would not do it again 文本。

到目前为止我所拥有的是:

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

pattern = [{"LOWER": "but"}]
doc = nlp("I think this is great, but I would not do it again")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(span.text)

此代码仅返回but

此外,是否可以为模式匹配创建一个字符串列表,例如:

list_of_match_words = ['but', 'particularly']
pattern = [{'LOWER'}: list_of_match_words}] 

或者类似的?我知道上面不会运行。

【问题讨论】:

    标签: python nlp spacy


    【解决方案1】:

    您可以使用REGEX 运算符来匹配您选择的特定令牌,然后您可以使用{"OP": "*"} 来获取匹配令牌右侧的其余令牌:

    list_of_match_words = ['but', 'particularly']
    pattern = [{"TEXT" : {"REGEX": "(?i)^(?:{})$".format("|".join(list_of_match_words))}}, {"OP": "*"}]
    matcher.add("list_of_match_words", None, pattern)
    

    在这里,正则表达式看起来像 (?i)^(?:but|particularly)$ 匹配

    • (?i) - 不区分大小写模式开启
    • ^ - 字符串开头(此处为令牌)
    • (?:but|particularly) - 匹配 butparticularly 字符串的非捕获组
    • $ - 字符串结尾(此处为令牌)。

    {"OP": "*"} 部分匹配任何标记,0 次或更多次。

    完整的 spaCy sn-p

    import spacy
    from spacy.matcher import Matcher
    from itertools import *
    
    nlp = spacy.load("en_core_web_sm")
    matcher = Matcher(nlp.vocab)
    
    list_of_match_words = ['but', 'particularly']
    pattern = [{"TEXT" : {"REGEX": "(?i)^(?:{})$".format("|".join(list_of_match_words))}}, {"OP": "*"}]
    matcher.add("list_of_match_words", None, pattern)
    doc = nlp("I think this is great particularly, but I would not do it again")
    matches = matcher(doc)
    results = [max(list(group),key=lambda x: x[2]) for k, group in groupby(sorted(matches, key=lambda x: x[1]), lambda x: x[1])]
    print("Matches:", [doc[start:end].text for match_id, start, end in results])
    

    输出:

    Matches: ['particularly, but I would not do it again', 'but I would not do it again']
    

    【讨论】:

    • 嘿,感谢您的回答——我正在研究如何提取特定单词之后的完整句子?这可行吗?
    • @JOKKINATOR list_of_match_words = ['but', 'particularly']"I think this is great particularly, but I would not do it again" 文本的预期输出是什么?
    • "但我不会再这样做了"
    • @JOKKINATOR 为什么?为什么不['particularly, but I would not do it again', 'but I would not do it again']
    • 你的输出正是我的预期,我错了。当然。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-06-06
    • 1970-01-01
    • 1970-01-01
    • 2019-06-22
    • 1970-01-01
    相关资源
    最近更新 更多