【发布时间】:2019-10-02 09:58:42
【问题描述】:
我有一个 NLTK Parse 树,我想仅基于“S”标签来分离 Tree 的叶子。请注意,S 不应与叶子重叠。
鉴于“他赢得了 Gusher Maraton,在 30 分钟内完成。”
来自corenlp的树形是
tree = '(S
(NP (PRP He))
(VP
(VBD won)
(NP (DT the) (NNP Gusher) (NNP Marathon))
(, ,)
(S (VP (VBG finishing) (PP (IN in) (NP (CD 30) (NNS minutes))))))
(. .))'
想法是提取2个“S”和它们的叶子,但不相互重叠。所以预期的输出应该是“他赢得了 Gusher Marathon,”。 和“在 30 分钟内完成。”
# Tree manipulation
# Extract phrases from a parsed (chunked) tree
# Phrase = tag for the string phrase (sub-tree) to extract
# Returns: List of deep copies; Recursive
def ExtractPhrases( myTree, phrase):
myPhrases = []
if (myTree.label() == phrase):
myPhrases.append( myTree.copy(True) )
for child in myTree:
if (type(child) is Tree):
list_of_phrases = ExtractPhrases(child, phrase)
if (len(list_of_phrases) > 0):
myPhrases.extend(list_of_phrases)
return myPhrases
subtexts = set()
sep_tree = ExtractPhrases( Tree.fromstring(tree), 'S')
for sep in sep_tree:
for subtree in sep.subtrees():
if subtree.label()=="S":
print(subtree)
subtexts.add(' '.join(subtree.leaves()))
#break
subtexts = list(subtexts)
print(subtexts)
我得到了输出
['He won the Gusher Marathon , finishing in 30 minutes .', 'finishing in 30 minutes']
我不想在字符串级别操作它,而是在树级别操作,所以预期的输出是-
["He won the Gusher Marathon ,.", "finishing in 30 minutes."]
【问题讨论】:
标签: python tree nltk stanford-nlp parse-tree