【发布时间】:2021-03-11 04:27:46
【问题描述】:
我一直在使用 Spacy 使用 Spacy 提供的 Doc.noun_chunks 属性进行名词块提取。 如何使用 Spacy 库(形式为 'VERB ? ADV * VERB +' )从输入文本中提取动词短语?
【问题讨论】:
我一直在使用 Spacy 使用 Spacy 提供的 Doc.noun_chunks 属性进行名词块提取。 如何使用 Spacy 库(形式为 'VERB ? ADV * VERB +' )从输入文本中提取动词短语?
【问题讨论】:
这可能会对你有所帮助。
from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'The author is writing a new book.'
pattern = r'<VERB>?<ADV>*<VERB>+'
doc = textacy.Doc(sentence, lang='en_core_web_sm')
lists = textacy.extract.pos_regex_matches(doc, pattern)
for list in lists:
print(list.text)
输出:
is writing
关于如何突出动词短语,请查看下面的链接。
Highlight verb phrases using spacy and html
另一种方法:
最近观察到 Textacy 对正则表达式匹配进行了一些更改。基于这种方法,我尝试了这种方式。
from __future__ import unicode_literals
import spacy,en_core_web_sm
import textacy
nlp = en_core_web_sm.load()
sentence = 'The cat sat on the mat. He dog jumped into the water. The author is writing a book.'
pattern = [{'POS': 'VERB', 'OP': '?'},
{'POS': 'ADV', 'OP': '*'},
{'POS': 'VERB', 'OP': '+'}]
doc = textacy.make_spacy_doc(sentence, lang='en_core_web_sm')
lists = textacy.extract.matches(doc, pattern)
for list in lists:
print(list.text)
输出:
sat
jumped
writing
我检查了此链接中的 POS 匹配,似乎结果不是预期的。
[https://explosion.ai/demos/matcher][1]
有没有人尝试使用 POS 标签而不是 Regexp 模式来查找动词短语?
编辑 2:
import spacy
from spacy.matcher import Matcher
from spacy.util import filter_spans
nlp = spacy.load('en_core_web_sm')
sentence = 'The cat sat on the mat. He quickly ran to the market. The dog jumped into the water. The author is writing a book.'
pattern = [{'POS': 'VERB', 'OP': '?'},
{'POS': 'ADV', 'OP': '*'},
{'POS': 'AUX', 'OP': '*'},
{'POS': 'VERB', 'OP': '+'}]
# instantiate a Matcher instance
matcher = Matcher(nlp.vocab)
matcher.add("Verb phrase", None, pattern)
doc = nlp(sentence)
# call the matcher to find matches
matches = matcher(doc)
spans = [doc[start:end] for _, start, end in matches]
print (filter_spans(spans))
输出:
[sat, quickly ran, jumped, is writing]
基于 mdmjsh 回答的帮助。
Edit3:奇怪的行为。 以下句子用于以下模式,动词短语在https://explosion.ai/demos/matcher中被正确识别
pattern = [{'POS': 'VERB', 'OP': '?'},
{'POS': 'ADV', 'OP': '*'},
{'POS': 'VERB', 'OP': '+'}]
非常黑的猫一定是真的在喵喵叫在院子里非常响亮。
但从代码运行时会输出以下内容。
[必须,真的喵喵叫]
【讨论】:
verb_clause_pattern = r'<VERB>*<ADV>*<PART>*<VERB>+<PART>*',它似乎工作得更好一些。动词从句中有时有助词。
上面的答案引用了textacy,这一切都可以用Spacy直接用Matcher实现,不需要包装库。
import spacy
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm') # download model first
sentence = 'The author was staring pensively as she wrote'
pattern=[{'POS': 'VERB', 'OP': '?'},
{'POS': 'ADV', 'OP': '*'},
{'OP': '*'}, # additional wildcard - match any text in between
{'POS': 'VERB', 'OP': '+'}]
# instantiate a Matcher instance
matcher = Matcher(nlp.vocab)
# Add pattern to matcher
matcher.add("verb-phrases", None, pattern)
doc = nlp(sentence)
# call the matcher to find matches
matches = matcher(doc)
注意这将返回一个元组列表,其中包含匹配 ID 以及每个元组的开始、结束索引 匹配,例如:
[(15658055046270554203, 0, 4),
(15658055046270554203, 1, 4),
(15658055046270554203, 2, 4),
(15658055046270554203, 3, 4),
(15658055046270554203, 0, 8),
(15658055046270554203, 1, 8),
(15658055046270554203, 2, 8),
(15658055046270554203, 3, 8),
(15658055046270554203, 4, 8),
(15658055046270554203, 5, 8),
(15658055046270554203, 6, 8),
(15658055046270554203, 7, 8)]
您可以使用索引将这些匹配项转换为跨度。
spans = [doc[start:end] for _, start, end in matches]
# output
"""
The author was staring
author was staring
was staring
staring
The author was staring pensively as she wrote
author was staring pensively as she wrote
was staring pensively as she wrote
staring pensively as she wrote
pensively as she wrote
as she wrote
she wrote
wrote
"""
注意,我在模式中添加了额外的{'OP': '*'},,当使用特定 POS/DEP 指定注释时用作通配符(即它将匹配任何文本)。这在这里很有用,因为问题是关于动词短语的 - VERB、ADV、VERB 的格式是一个不寻常的结构(试着想一些例句),但是 VERB、ADV、[其他文本]、VERB 很可能(如在例句“作者在写作时沉思地凝视着”)。或者,您可以将模式细化为更具体 (displacy is your friend here)。
进一步注意,由于匹配器的贪婪,匹配的所有排列都会返回。您可以选择使用filter_spans 将其减少到最长的形式,以删除重复或重叠。
from spacy.util import filter_spans
filter_spans(spans)
# output
[The author was staring pensively as she wrote]
【讨论】:
operator 的标准 - 它是一种控制要匹配的模式的特异性的方法。例如,您可能想要定义一个带有一些可选部分的模式,在这种情况下可以使用 "OP": "*"。见:spacy.io/usage/rule-based-matching#quantifiers