如何“选择” spaCy 模式匹配的部分，而不是整个匹配？答案

【问题标题】：How do I "select" the parts of a spaCy pattern match, rather than the entire match?如何“选择” spaCy 模式匹配的部分，而不是整个匹配？
【发布时间】：2021-06-02 00:43:24
【问题描述】：

rule-based pattern matching in spaCy 返回匹配 ID 以及匹配跨度的开始和结束字符，但我在文档中没有看到任何内容说明如何确定该跨度的哪些部分构成了匹配的标记.

在正则表达式中，我可以在组周围放置括号以选择它们，并让它们“被选中”并脱离模式。 spaCy 可以做到这一点吗？

例如，我有这段文字（来自德古拉）：

他们穿着高筒靴，裤衩塞进去，留着长长的黑发和浓密的黑胡子。

我已经定义了一个实验：

import spacy
from spacy.matcher import Matcher

def test_match(text, patterns):
    nlp = spacy.load('en_core_web_sm')
    matcher = Matcher(nlp.vocab)
    matcher.add('Boots', None, patterns)

    doc = nlp(text)
    matches = matcher(doc)

    for match in matches:
        match_id, start, end = match
        string_id = nlp.vocab.strings[match_id]
        span = doc[start:end]
        print(match, span.text)

text_a = "They wore high boots, with their trousers tucked into them, " \
         "and had long black hair and heavy black moustaches."

patterns = [
    {'POS': 'PRON'},
    {'TAG': 'VBD'},
    {'POS': 'ADJ'},
    {'TAG': 'NNS'}
]

test_match(text_a, patterns)

这个输出：

(18231591219755621867, 0, 4) They wore high boots

对于像这样的简单模式，连续四个标记，我可以假设标记 0 是代词，标记 1 是过去时动词，等等。但是对于带有数量修饰符的模式，它变得模棱两可。但是是否有可能让 spaCy 告诉我哪些标记实际上与模式的组件匹配？

例如，将这个修改添加到上面的实验中，模式中有两个通配符，新版本的文本缺少形容词“high”：

text_b = "They wore boots, with their trousers tucked into them, " \
         "and had long black hair and heavy black moustaches."

patterns = [
    {'POS': 'PRON'},
    {'TAG': 'VBD'},
    {'POS': 'ADJ', 'OP': '*'},
    {'TAG': 'NNS', 'OP': '*'}
]

test_match(text_a, patterns)
print()
test_match(text_b, patterns)

哪些输出：

(18231591219755621867, 0, 2) They wore
(18231591219755621867, 0, 3) They wore high
(18231591219755621867, 0, 4) They wore high boots

(18231591219755621867, 0, 2) They wore
(18231591219755621867, 0, 3) They wore boots

在这两种输出情况下，都不清楚最后的标记中哪个是形容词，哪个是复数名词。我想我可以遍历跨度中的标记，然后手动匹配模式的搜索部分，但这绝对是重复的。既然我认为 spaCy 必须找到它们来匹配它们，它就不能告诉我哪个是哪个吗？

【问题讨论】：

标签： python nlp spacy

【解决方案1】：

从 spaCy v3.06 开始，现在可以将匹配对齐信息作为匹配元组 (api doc link) 的一部分获取。

matches = matcher(doc, with_alignments=True)

在您的示例中，它将生成以下输出：

(1618900948208871284, 0, 2, [0, 1])         They wore
(1618900948208871284, 0, 3, [0, 1, 2])      They wore high
(1618900948208871284, 0, 4, [0, 1, 2, 3])   They wore high boots

(1618900948208871284, 0, 2, [0, 1])         They wore
(1618900948208871284, 0, 3, [0, 1, 3])      They wore boots

【讨论】：

感谢您的提示，它以第一个近似值工作，可以或多或少地使用对齐列表。如果您可以处理正则表达式中的组之类的东西，那对于未来的 spacy 版本将是一个很大的好处！