【发布时间】:2020-05-06 11:00:29
【问题描述】:
我正在尝试使用 Stanza(使用 Stanford CoreNLP)从句子中提取名词短语。这只能通过 Stanza 中的 CoreNLPClient 模块来完成。
# Import client module
from stanza.server import CoreNLPClient
# Construct a CoreNLPClient with some basic annotators, a memory allocation of 4GB, and port number 9001
client = CoreNLPClient(annotators=['tokenize','ssplit','pos','lemma','ner', 'parse'], memory='4G', endpoint='http://localhost:9001')
这是一个句子的例子,我在客户端使用tregrex函数来获取所有的名词短语。 Tregex 函数在 python 中返回 dict of dicts。因此,我需要处理tregrex 的输出,然后再将其传递给NLTK 中的Tree.fromstring 函数,以正确提取名词短语作为字符串。
pattern = 'NP'
text = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."
matches = client.tregrex(text, pattern) ``
因此,我想出了 stanza_phrases 方法,它必须遍历 dict of dicts,这是 tregrex 的输出,并在 NLTK 中正确格式化 Tree.fromstring。
def stanza_phrases(matches):
Nps = []
for match in matches:
for items in matches['sentences']:
for keys,values in items.items():
s = '(ROOT\n'+ values['match']+')'
Nps.extend(extract_phrase(s, pattern))
return set(Nps)
生成一棵树供 NLTK 使用
from nltk.tree import Tree
def extract_phrase(tree_str, label):
phrases = []
trees = Tree.fromstring(tree_str)
for tree in trees:
for subtree in tree.subtrees():
if subtree.label() == label:
t = subtree
t = ' '.join(t.leaves())
phrases.append(t)
return phrases
这是我的输出:
{'Albert Einstein', 'He', 'a German-born theoretical physicist', 'relativity', 'the theory', 'the theory of relativity'}
有没有一种方法可以让代码更高效且行数更少(尤其是stanza_phrases 和extract_phrase 方法)
【问题讨论】:
标签: python nlp stanford-nlp stanford-stanza