使用 Spacy 提取动词短语答案

【问题标题】：Extract verb phrases using Spacy使用 Spacy 提取动词短语
【发布时间】：2021-03-11 04:27:46
【问题描述】：

我一直在使用 Spacy 使用 Spacy 提供的 Doc.noun_chunks 属性进行名词块提取。如何使用 Spacy 库（形式为 'VERB ? ADV * VERB +' ）从输入文本中提取动词短语？

【问题讨论】：

【解决方案1】：

这可能会对你有所帮助。

from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'The author is writing a new book.'
pattern = r'<VERB>?<ADV>*<VERB>+'
doc = textacy.Doc(sentence, lang='en_core_web_sm')
lists = textacy.extract.pos_regex_matches(doc, pattern)
for list in lists:
    print(list.text)

输出：

is writing

关于如何突出动词短语，请查看下面的链接。

Highlight verb phrases using spacy and html

另一种方法：

最近观察到 Textacy 对正则表达式匹配进行了一些更改。基于这种方法，我尝试了这种方式。

from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'The cat sat on the mat. He dog jumped into the water. The author is writing a book.'
pattern = [{'POS': 'VERB', 'OP': '?'},
           {'POS': 'ADV', 'OP': '*'},
           {'POS': 'VERB', 'OP': '+'}]
doc = textacy.make_spacy_doc(sentence, lang='en_core_web_sm')
lists = textacy.extract.matches(doc, pattern)
for list in lists:
    print(list.text)

输出：

sat
jumped
writing

我检查了此链接中的 POS 匹配，似乎结果不是预期的。

[https://explosion.ai/demos/matcher][1]

有没有人尝试使用 POS 标签而不是 Regexp 模式来查找动词短语？

编辑 2：

import spacy   
from spacy.matcher import Matcher
from spacy.util import filter_spans

nlp = spacy.load('en_core_web_sm') 

sentence = 'The cat sat on the mat. He quickly ran to the market. The dog jumped into the water. The author is writing a book.'
pattern = [{'POS': 'VERB', 'OP': '?'},
           {'POS': 'ADV', 'OP': '*'},
           {'POS': 'AUX', 'OP': '*'},
           {'POS': 'VERB', 'OP': '+'}]

# instantiate a Matcher instance
matcher = Matcher(nlp.vocab)
matcher.add("Verb phrase", None, pattern)

doc = nlp(sentence) 
# call the matcher to find matches 
matches = matcher(doc)
spans = [doc[start:end] for _, start, end in matches]

print (filter_spans(spans))

输出：

[sat, quickly ran, jumped, is writing]

基于 mdmjsh 回答的帮助。

Edit3：奇怪的行为。 以下句子用于以下模式，动词短语在https://explosion.ai/demos/matcher中被正确识别

pattern = [{'POS': 'VERB', 'OP': '?'},
           {'POS': 'ADV', 'OP': '*'},
           {'POS': 'VERB', 'OP': '+'}]

非常黑的猫一定是真的在喵喵叫在院子里非常响亮。

但从代码运行时会输出以下内容。

[必须，真的喵喵叫]

【讨论】：

我将模式更改为verb_clause_pattern = r'<VERB>*<ADV>*<PART>*<VERB>+<PART>*'，它似乎工作得更好一些。动词从句中有时有助词。
是的，这样更好。以下为您提供的更多模式，来自 textacy 文档。
名词短语：r'? (+ )* +' 复合名词：r'+' 动词短语：r'?*+' 介词短语：r' ？ (+)* +'
感谢您的模式，但它是用于文本，如何在 spaCy 中使用它们？

【解决方案2】：

上面的答案引用了textacy，这一切都可以用Spacy直接用Matcher实现，不需要包装库。

import spacy   
from spacy.matcher import Matcher                                                                                                                                                                                         
nlp = spacy.load('en_core_web_sm')  # download model first

sentence = 'The author was staring pensively as she wrote' 

pattern=[{'POS': 'VERB', 'OP': '?'},
 {'POS': 'ADV', 'OP': '*'},
 {'OP': '*'}, # additional wildcard - match any text in between
 {'POS': 'VERB', 'OP': '+'}]

# instantiate a Matcher instance
matcher = Matcher(nlp.vocab) 

# Add pattern to matcher
matcher.add("verb-phrases", None, pattern)
doc = nlp(sentence) 
# call the matcher to find matches 
matches = matcher(doc)

注意这将返回一个元组列表，其中包含匹配 ID 以及每个元组的开始、结束索引匹配，例如：

[(15658055046270554203, 0, 4),
 (15658055046270554203, 1, 4),
 (15658055046270554203, 2, 4),
 (15658055046270554203, 3, 4),
 (15658055046270554203, 0, 8),
 (15658055046270554203, 1, 8),
 (15658055046270554203, 2, 8),
 (15658055046270554203, 3, 8),
 (15658055046270554203, 4, 8),
 (15658055046270554203, 5, 8),
 (15658055046270554203, 6, 8),
 (15658055046270554203, 7, 8)]

您可以使用索引将这些匹配项转换为跨度。

spans = [doc[start:end] for _, start, end in matches] 

# output
"""
The author was staring
author was staring
was staring
staring
The author was staring pensively as she wrote
author was staring pensively as she wrote
was staring pensively as she wrote
staring pensively as she wrote
pensively as she wrote
as she wrote
she wrote
wrote
"""

注意，我在模式中添加了额外的{'OP': '*'},，当使用特定 POS/DEP 指定注释时用作通配符（即它将匹配任何文本）。这在这里很有用，因为问题是关于动词短语的 - VERB、ADV、VERB 的格式是一个不寻常的结构（试着想一些例句），但是 VERB、ADV、[其他文本]、VERB 很可能（如在例句“作者在写作时沉思地凝视着”）。或者，您可以将模式细化为更具体 (displacy is your friend here)。

进一步注意，由于匹配器的贪婪，匹配的所有排列都会返回。您可以选择使用filter_spans 将其减少到最长的形式，以删除重复或重叠。


from spacy.util import filter_spans                                                                                                                                                                                       

filter_spans(spans)    
# output                                                                                                                                                                                                   
[The author was staring pensively as she wrote]

【讨论】：

感谢您花时间发帖。似乎缺少一条关键行 `pattern = [{'POS': 'VERB', 'OP': '?'}, {'POS': 'ADV', 'OP': ''}, { 'OP': ''}, {'POS': 'VERB', 'OP': '+'}]; matcher.add("动词短语", None, pattern) `
谢谢@mikey，不知道我是怎么忘记包括这一步的，但很好发现！我已经编辑了答案。
让它在 Spacy V. 3.0 matcher.add("动词短语", [pattern]) 中工作
所以我知道 POS 表示“词性”，ADV 表示“副词”，VERB 表示“动词”，* 表示通配符，我认为 OP 表示“介词宾语”，但有人可以赶紧澄清什么？和+号做什么？我真的很感激！谢谢！
OP 实际上是 operator 的标准 - 它是一种控制要匹配的模式的特异性的方法。例如，您可能想要定义一个带有一些可选部分的模式，在这种情况下可以使用 "OP": "*"。见：spacy.io/usage/rule-based-matching#quantifiers