如何从依赖解析器的输出中生成树？答案

【问题标题】：How to make a tree from the output of a dependency parser?如何从依赖解析器的输出中生成树？
【发布时间】：2018-09-03 11:17:06
【问题描述】：

我正在尝试从依赖解析器的输出中创建一棵树（嵌套字典）。这句话是“我在睡梦中射杀了一头大象”。我能够得到链接中描述的输出： How do I do dependency parsing in NLTK?

nsubj(shot-2, I-1)
det(elephant-4, an-3)
dobj(shot-2, elephant-4)
prep(shot-2, in-5)
poss(sleep-7, my-6)
pobj(in-5, sleep-7)

要将此元组列表转换为嵌套字典，我使用了以下链接： How to convert python list of tuples into tree?

def build_tree(list_of_tuples):
    all_nodes = {n[2]:((n[0], n[1]),{}) for n in list_of_tuples}
    root = {}    
    print all_nodes
    for item in list_of_tuples:
        rel, gov,dep = item
        if gov is not 'ROOT':
            all_nodes[gov][1][dep] = all_nodes[dep]
        else:
            root[dep] = all_nodes[dep]
    return root

输出如下：

{'shot': (('ROOT', 'ROOT'),
  {'I': (('nsubj', 'shot'), {}),
   'elephant': (('dobj', 'shot'), {'an': (('det', 'elephant'), {})}),
   'sleep': (('nmod', 'shot'),
    {'in': (('case', 'sleep'), {}), 'my': (('nmod:poss', 'sleep'), {})})})}

为了找到从根到叶的路径，我使用了以下链接：Return root to specific leaf from a nested dictionary tree

[制作树和找到路径是两个不同的事情]第二个目标是找到根到叶节点的路径，就像完成Return root to specific leaf from a nested dictionary tree一样。但我想得到根到叶（依赖关系路径）因此，例如，当我调用 recurse_category(categories, 'an') 时，其中 categories 是嵌套的树结构，而 'an' 是树中的单词，我应该得到 ROOT-nsubj-dobj（直到根的依赖关系）作为输出。

【问题讨论】：

提示：DependencyGraph github.com/nltk/nltk/blob/develop/nltk/parse/…
@alvas 如果你能展示如何实现我的案例，我会迷路的。如果您希望我更改将元组转换为字典的方式，请展示这一点，而不是提供 github 链接
你想要的输出是什么？
@alvas 我正在寻找从根到叶的路径。如问题中所述（也给出了链接），如果我通过“an”，那么我应该得到“Root-nubj-dobj
我不明白为什么输入是 an 而预期输出是 `root-nubj-dobj` 你能详细说明一下吗？

标签： python dictionary nlp nltk stanford-nlp

【解决方案1】：

首先，如果您只是使用斯坦福 CoreNLP 依赖解析器的预训练模型，您应该使用来自nltk.parse.corenlp 的CoreNLPDependencyParser，并避免使用旧的nltk.parse.stanford 接口。

见Stanford Parser and NLTK

在终端中下载并运行Java服务器后，在Python中：

>>> from nltk.parse.corenlp import CoreNLPDependencyParser
>>> dep_parser = CoreNLPDependencyParser(url='http://localhost:9000')
>>> sent = "I shot an elephant with a banana .".split()
>>> parses = list(dep_parser.parse(sent))
>>> type(parses[0])
<class 'nltk.parse.dependencygraph.DependencyGraph'>

现在我们看到解析的类型是 DependencyGraph 来自 nltk.parse.dependencygraph https://github.com/nltk/nltk/blob/develop/nltk/parse/dependencygraph.py#L36

只需执行DependencyGraph.tree() 即可将DependencyGraph 转换为nltk.tree.Tree 对象：

>>> parses[0].tree()
Tree('shot', ['I', Tree('elephant', ['an']), Tree('banana', ['with', 'a']), '.'])

>>> parses[0].tree().pretty_print()
          shot                  
  _________|____________         
 |   |  elephant      banana    
 |   |     |       _____|_____   
 I   .     an    with         a

将其转换为括号中的解析格式：

>>> print(parses[0].tree())
(shot I (elephant an) (banana with a) .)

如果您正在寻找依赖三元组：

>>> [(governor, dep, dependent) for governor, dep, dependent in parses[0].triples()]
[(('shot', 'VBD'), 'nsubj', ('I', 'PRP')), (('shot', 'VBD'), 'dobj', ('elephant', 'NN')), (('elephant', 'NN'), 'det', ('an', 'DT')), (('shot', 'VBD'), 'nmod', ('banana', 'NN')), (('banana', 'NN'), 'case', ('with', 'IN')), (('banana', 'NN'), 'det', ('a', 'DT')), (('shot', 'VBD'), 'punct', ('.', '.'))]

>>> for governor, dep, dependent in parses[0].triples():
...     print(governor, dep, dependent)
... 
('shot', 'VBD') nsubj ('I', 'PRP')
('shot', 'VBD') dobj ('elephant', 'NN')
('elephant', 'NN') det ('an', 'DT')
('shot', 'VBD') nmod ('banana', 'NN')
('banana', 'NN') case ('with', 'IN')
('banana', 'NN') det ('a', 'DT')
('shot', 'VBD') punct ('.', '.')

CONLL 格式：

>>> print(parses[0].to_conll(style=10))
1   I   I   PRP PRP _   2   nsubj   _   _
2   shot    shoot   VBD VBD _   0   ROOT    _   _
3   an  a   DT  DT  _   4   det _   _
4   elephant    elephant    NN  NN  _   2   dobj    _   _
5   with    with    IN  IN  _   7   case    _   _
6   a   a   DT  DT  _   7   det _   _
7   banana  banana  NN  NN  _   2   nmod    _   _
8   .   .   .   .   _   2   punct   _   _

【讨论】：

所以，nltk.parse.corenlp 出于某种原因不起作用。它说No module named corenlp 但nltk.parse.stanford 对我有用。我已经解压缩了 stanford-corenlp-full-2018-02-27 和 stanford-parser-full-2018-02-27。我有链接中提到的 models.jar 和 parser.jar 文件。我还尝试了from nltk.parse import CoreNLPParser，但效果不佳。另外，我找不到englishPCFG 文件，但我有lexparser shell 脚本文件。我从 github 下载了 PCFG 文件。它说NLTK was unable to find the JAVA file Set the JAVAHOME environment variables
升级您的 NLTK pip3 install -U nltk。也不要在 python 代码中使用 jar 文件的链接，只需启动服务器。见stackoverflow.com/questions/13883277/stanford-parser-and-nltk/…
感谢它的工作！当我们在做parses[0].tree 时，我们正在失去单词之间的依赖关系。我正在尝试制作一个也将具有依赖关系的树，因为我想要 DEPENDENCY RELATIONSHIP 路径。例如，stackoverflow.com/questions/34395127/… 树有 POS 标签我认为，在我们的例子中它必须是依赖关系然后我们会找到路径。
在问题中显示所需的输出。因为依赖标签不能在分层路径中表示。有时依赖关系需要循环，有时需要跨子分支，不能简单地转换成树而不丢失信息。目前还不清楚您要实现什么目标，我认为保持图形结构不会更有益，除非您只需要一些可视化。
首先非常感谢您的关注。我已经发布了所需的输出。我所需要的只是我已经在问题中显示的依赖关系路径也许你是正确的，依赖关系不能用树形结构表示，但我什至不需要那个。我只需要根到叶的依赖关系路径。

【解决方案2】：

这会将输出转换为嵌套字典形式。如果我也能找到路径，我会及时通知你。也许这对你有帮助。

list_of_tuples = [('ROOT','ROOT', 'shot'),('nsubj','shot', 'I'),('det','elephant', 'an'),('dobj','shot', 'elephant'),('case','sleep', 'in'),('nmod:poss','sleep', 'my'),('nmod','shot', 'sleep')]

nodes={}

for i in list_of_tuples:
    rel,parent,child=i
    nodes[child]={'Name':child,'Relationship':rel}

forest=[]

for i in list_of_tuples:
    rel,parent,child=i
    node=nodes[child]

    if parent=='ROOT':# this should be the Root Node
            forest.append(node)
    else:
        parent=nodes[parent]
        if not 'children' in parent:
            parent['children']=[]
        children=parent['children']
        children.append(node)

print forest

输出是一个嵌套字典，

[{'Name': 'shot', 'Relationship': 'ROOT', 'children': [{'Name': 'I', 'Relationship': 'nsubj'}, {'Name': 'elephant', 'Relationship': 'dobj', 'children': [{'Name': 'an', 'Relationship': 'det'}]}, {'Name': 'sleep', 'Relationship': 'nmod', 'children': [{'Name': 'in', 'Relationship': 'case'}, {'Name': 'my', 'Relationship': 'nmod:poss'}]}]}]

以下函数可以帮助您找到从根到叶的路径：

def recurse_category(categories,to_find):
    for category in categories: 
        if category['Name'] == to_find:
            return True, [category['Relationship']]
        if 'children' in category:
            found, path = recurse_category(category['children'], to_find)
            if found:
                return True, [category['Relationship']] + path
    return False, []

【讨论】：