基于标签分离 NLTK 子树答案

【问题标题】：Separate NLTK subtree based on label基于标签分离 NLTK 子树
【发布时间】：2019-10-02 09:58:42
【问题描述】：

我有一个 NLTK Parse 树，我想仅基于“S”标签来分离 Tree 的叶子。请注意，S 不应与叶子重叠。

鉴于“他赢得了 Gusher Maraton，在 30 分钟内完成。”

来自corenlp的树形是

tree = '(S
  (NP (PRP He))
  (VP
    (VBD won)
    (NP (DT the) (NNP Gusher) (NNP Marathon))
    (, ,)
    (S (VP (VBG finishing) (PP (IN in) (NP (CD 30) (NNS minutes))))))
  (. .))'

想法是提取2个“S”和它们的叶子，但不相互重叠。所以预期的输出应该是“他赢得了 Gusher Marathon，”。和“在 30 分钟内完成。”

# Tree manipulation

# Extract phrases from a parsed (chunked) tree
# Phrase = tag for the string phrase (sub-tree) to extract
# Returns: List of deep copies;  Recursive
def ExtractPhrases( myTree, phrase):
    myPhrases = []
    if (myTree.label() == phrase):
        myPhrases.append( myTree.copy(True) )
    for child in myTree:
        if (type(child) is Tree):
            list_of_phrases = ExtractPhrases(child, phrase)
            if (len(list_of_phrases) > 0):
                myPhrases.extend(list_of_phrases)
    return myPhrases

subtexts = set()
sep_tree = ExtractPhrases( Tree.fromstring(tree), 'S')
for sep in sep_tree:
    for subtree in sep.subtrees():
        if subtree.label()=="S":
            print(subtree)
            subtexts.add(' '.join(subtree.leaves()))
            #break

subtexts = list(subtexts)
print(subtexts)

我得到了输出

['He won the Gusher Marathon , finishing in 30 minutes .', 'finishing in 30 minutes']

我不想在字符串级别操作它，而是在树级别操作，所以预期的输出是-

["He won the Gusher Marathon ,.",  "finishing in 30 minutes."]

【问题讨论】：

标签： python tree nltk stanford-nlp parse-tree

【解决方案1】：

这是我的示例输入：

a = 

'''

FREEDOM FROM RELIGION FOUNDATION

Darwin fish bumper stickers and assorted other atheist paraphernalia are
available from the Freedom From Religion Foundation in the US.

EVOLUTION DESIGNS

Evolution Designs sell the "Darwin fish".  It's a fish symbol, like the ones
Christians stick on their cars, but with feet and the word "Darwin" written
inside.  The deluxe moulded 3D plastic fish is $4.95 postpaid in the US.

'''


    sentences = nltk.sent_tokenize(a)
    sentences = [nltk.word_tokenize(sent) for sent in sentences]
    tagged_sentences = nltk.pos_tag_sents(sentences)
    chunked_sentences = list(nltk.ne_chunk_sents(tagged_sentences))

    for sent in chunked_sentences:
    for subtree in sent.subtrees(filter=lambda t: t.label()=='S'):
        print(subtree)

这是我的输出：

(S
  (ORGANIZATION FREEDOM/NN)
  (ORGANIZATION FROM/NNP)
  RELIGION/NNP
  FOUNDATION/NNP
  Darwin/NNP
  fish/JJ
  bumper/NN
  stickers/NNS
  and/CC
  assorted/VBD
  other/JJ
  atheist/JJ
  paraphernalia/NNS
  are/VBP
  available/JJ
  from/IN
  the/DT
  (ORGANIZATION Freedom/NN From/NNP Religion/NNP Foundation/NNP)
  in/IN
  the/DT
  (GSP US/NNP)
  ./.)

(S
  (ORGANIZATION EVOLUTION/NNP)
  (ORGANIZATION DESIGNS/NNP Evolution/NNP)
  Designs/NNP
  sell/VB
  the/DT
  ``/``
  (PERSON Darwin/NNP)
  fish/NN
  ''/''
  ./.)

(S
  It/PRP
  's/VBZ
  a/DT
  fish/JJ
  symbol/NN
  ,/,
  like/IN
  the/DT
  ones/NNS
  Christians/NNPS
  stick/VBP
  on/IN
  their/PRP$
  cars/NNS
  ,/,
  but/CC
  with/IN
  feet/NNS
  and/CC
  the/DT
  word/NN
  ``/``
  (PERSON Darwin/NNP)
  ''/''
  written/VBN
  inside/RB
  ./.)

(S
  The/DT
  deluxe/NN
  moulded/VBD
  3D/CD
  plastic/JJ
  fish/NN
  is/VBZ
  $/$
  4.95/CD
  postpaid/NN
  in/IN
  the/DT
  (GSP US/NNP)
  ./.)

【讨论】：