【发布时间】:2021-08-17 06:58:37
【问题描述】:
在大型文本语料库中,我有兴趣提取句子中某处具有(动词-名词)或(形容词-名词)特定列表的每个句子。我有一个很长的清单,但这里有一个示例。在我的 MWE 中,我试图用“write/wrote/writing/writes”和“book/s”提取句子。我有大约 30 对这样的词。
这是我尝试过的,但它没有捕捉到大部分句子:
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
doc = nlp(u'Graham Greene is his favorite author. He wrote his first book when he was a hundred and fifty years old.\
While writing this book, he had to fend off aliens and dinosaurs. Greene\'s second book might not have been written by him. \
Greene\'s cat in its deathbed testimony alleged that it was the original writer of the book. The fact that plot of the book revolves around \
rats conquering the world, lends credence to the idea that only a cat could have been the true writer of such an inane book.')
matcher = Matcher(nlp.vocab)
pattern1 = [{"LEMMA": "write"},{"TEXT": {"REGEX": ".+"}},{"LEMMA": "book"}]
matcher.add("testy", None, pattern)
for sent in doc.sents:
if matcher(nlp(sent.lemma_)):
print(sent.text)
不幸的是,我只得到了一场比赛:
“在写这本书时,他必须抵御外星人和恐龙。”
然而,我也希望得到“他写了他的第一本书”这句话。其他的书都把writer作为名词,好是不匹配。
【问题讨论】:
标签: python regex spacy match-phrase