【问题标题】:Separate NLTK subtree based on label基于标签分离 NLTK 子树
【发布时间】:2019-10-02 09:58:42
【问题描述】:

我有一个 NLTK Parse 树,我想仅基于“S”标签来分离 Tree 的叶子。请注意,S 不应与叶子重叠。

鉴于“他赢得了 Gusher Maraton,在 30 分钟内完成。”

来自corenlp的树形是

tree = '(S
  (NP (PRP He))
  (VP
    (VBD won)
    (NP (DT the) (NNP Gusher) (NNP Marathon))
    (, ,)
    (S (VP (VBG finishing) (PP (IN in) (NP (CD 30) (NNS minutes))))))
  (. .))'

想法是提取2个“S”和它们的叶子,但不相互重叠。所以预期的输出应该是“他赢得了 Gusher Marathon,”。 和“在 30 分钟内完成。”

# Tree manipulation

# Extract phrases from a parsed (chunked) tree
# Phrase = tag for the string phrase (sub-tree) to extract
# Returns: List of deep copies;  Recursive
def ExtractPhrases( myTree, phrase):
    myPhrases = []
    if (myTree.label() == phrase):
        myPhrases.append( myTree.copy(True) )
    for child in myTree:
        if (type(child) is Tree):
            list_of_phrases = ExtractPhrases(child, phrase)
            if (len(list_of_phrases) > 0):
                myPhrases.extend(list_of_phrases)
    return myPhrases
subtexts = set()
sep_tree = ExtractPhrases( Tree.fromstring(tree), 'S')
for sep in sep_tree:
    for subtree in sep.subtrees():
        if subtree.label()=="S":
            print(subtree)
            subtexts.add(' '.join(subtree.leaves()))
            #break

subtexts = list(subtexts)
print(subtexts)

我得到了输出

['He won the Gusher Marathon , finishing in 30 minutes .', 'finishing in 30 minutes']

我不想在字符串级别操作它,而是在树级别操作,所以预期的输出是-

["He won the Gusher Marathon ,.",  "finishing in 30 minutes."]

【问题讨论】:

    标签: python tree nltk stanford-nlp parse-tree


    【解决方案1】:

    这是我的示例输入:

    a = 
    
    '''
    
    FREEDOM FROM RELIGION FOUNDATION
    
    Darwin fish bumper stickers and assorted other atheist paraphernalia are
    available from the Freedom From Religion Foundation in the US.
    
    EVOLUTION DESIGNS
    
    Evolution Designs sell the "Darwin fish".  It's a fish symbol, like the ones
    Christians stick on their cars, but with feet and the word "Darwin" written
    inside.  The deluxe moulded 3D plastic fish is $4.95 postpaid in the US.
    
    '''
    
    
        sentences = nltk.sent_tokenize(a)
        sentences = [nltk.word_tokenize(sent) for sent in sentences]
        tagged_sentences = nltk.pos_tag_sents(sentences)
        chunked_sentences = list(nltk.ne_chunk_sents(tagged_sentences))
    
        for sent in chunked_sentences:
        for subtree in sent.subtrees(filter=lambda t: t.label()=='S'):
            print(subtree)
    

    这是我的输出:

    (S
      (ORGANIZATION FREEDOM/NN)
      (ORGANIZATION FROM/NNP)
      RELIGION/NNP
      FOUNDATION/NNP
      Darwin/NNP
      fish/JJ
      bumper/NN
      stickers/NNS
      and/CC
      assorted/VBD
      other/JJ
      atheist/JJ
      paraphernalia/NNS
      are/VBP
      available/JJ
      from/IN
      the/DT
      (ORGANIZATION Freedom/NN From/NNP Religion/NNP Foundation/NNP)
      in/IN
      the/DT
      (GSP US/NNP)
      ./.)
    
    (S
      (ORGANIZATION EVOLUTION/NNP)
      (ORGANIZATION DESIGNS/NNP Evolution/NNP)
      Designs/NNP
      sell/VB
      the/DT
      ``/``
      (PERSON Darwin/NNP)
      fish/NN
      ''/''
      ./.)
    
    (S
      It/PRP
      's/VBZ
      a/DT
      fish/JJ
      symbol/NN
      ,/,
      like/IN
      the/DT
      ones/NNS
      Christians/NNPS
      stick/VBP
      on/IN
      their/PRP$
      cars/NNS
      ,/,
      but/CC
      with/IN
      feet/NNS
      and/CC
      the/DT
      word/NN
      ``/``
      (PERSON Darwin/NNP)
      ''/''
      written/VBN
      inside/RB
      ./.)
    
    (S
      The/DT
      deluxe/NN
      moulded/VBD
      3D/CD
      plastic/JJ
      fish/NN
      is/VBZ
      $/$
      4.95/CD
      postpaid/NN
      in/IN
      the/DT
      (GSP US/NNP)
      ./.)
    

    【讨论】:

      猜你喜欢
      • 2018-06-18
      • 1970-01-01
      • 2014-06-27
      • 2014-06-04
      • 1970-01-01
      • 1970-01-01
      • 2010-12-25
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多