使用 SpaCy Matcher 定义多个单词标记并提取单词之后的所有标记答案

【问题标题】：Define multiple word token and extract all tokens after the words with SpaCy Matcher使用 SpaCy Matcher 定义多个单词标记并提取单词之后的所有标记
【发布时间】：2020-05-03 01:36:35
【问题描述】：

我在理解 SpaCy Matcher 模块时遇到了一些问题。

我有一句话：I think this is great, but I would not do it again

我想返回 but I would not do it again 文本。

到目前为止我所拥有的是：

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

pattern = [{"LOWER": "but"}]
doc = nlp("I think this is great, but I would not do it again")
matches = matcher(doc)
for match_id, start, end in matches:
    string_id = nlp.vocab.strings[match_id]  # Get string representation
    span = doc[start:end]  # The matched span
    print(span.text)

此代码仅返回but。

此外，是否可以为模式匹配创建一个字符串列表，例如：

list_of_match_words = ['but', 'particularly']
pattern = [{'LOWER'}: list_of_match_words}]

或者类似的？我知道上面不会运行。

【问题讨论】：

标签： python nlp spacy

【解决方案1】：

您可以使用REGEX 运算符来匹配您选择的特定令牌，然后您可以使用{"OP": "*"} 来获取匹配令牌右侧的其余令牌：

list_of_match_words = ['but', 'particularly']
pattern = [{"TEXT" : {"REGEX": "(?i)^(?:{})$".format("|".join(list_of_match_words))}}, {"OP": "*"}]
matcher.add("list_of_match_words", None, pattern)

在这里，正则表达式看起来像 (?i)^(?:but|particularly)$ 匹配

(?i) - 不区分大小写模式开启
^ - 字符串开头（此处为令牌）
(?:but|particularly) - 匹配 but 或 particularly 字符串的非捕获组
$ - 字符串结尾（此处为令牌）。

{"OP": "*"} 部分匹配任何标记，0 次或更多次。

完整的 spaCy sn-p：

import spacy
from spacy.matcher import Matcher
from itertools import *

nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)

list_of_match_words = ['but', 'particularly']
pattern = [{"TEXT" : {"REGEX": "(?i)^(?:{})$".format("|".join(list_of_match_words))}}, {"OP": "*"}]
matcher.add("list_of_match_words", None, pattern)
doc = nlp("I think this is great particularly, but I would not do it again")
matches = matcher(doc)
results = [max(list(group),key=lambda x: x[2]) for k, group in groupby(sorted(matches, key=lambda x: x[1]), lambda x: x[1])]
print("Matches:", [doc[start:end].text for match_id, start, end in results])

输出：

Matches: ['particularly, but I would not do it again', 'but I would not do it again']

【讨论】：

嘿，感谢您的回答——我正在研究如何提取特定单词之后的完整句子？这可行吗？
@JOKKINATOR list_of_match_words = ['but', 'particularly'] 和 "I think this is great particularly, but I would not do it again" 文本的预期输出是什么？
"但我不会再这样做了"
@JOKKINATOR 为什么？为什么不['particularly, but I would not do it again', 'but I would not do it again']？
你的输出正是我的预期，我错了。当然。