【问题标题】:How to find the shortest dependency path between two words in Python?如何在 Python 中找到两个单词之间的最短依赖路径?
【发布时间】:2015-12-26 10:06:32
【问题描述】:

我尝试在 Python 给定的依赖树中找到两个单词之间的依赖路径。

对于句子

流行文化中的机器人提醒我们 不受约束的人类机构。

我使用了 practnlptools (https://github.com/biplab-iitb/practNLPTools) 来获取依赖解析结果,如:

nsubj(are-5, Robots-1)
xsubj(remind-8, Robots-1)
amod(culture-4, popular-3)
prep_in(Robots-1, culture-4)
root(ROOT-0, are-5)
advmod(are-5, there-6)
aux(remind-8, to-7)
xcomp(are-5, remind-8)
dobj(remind-8, us-9)
det(awesomeness-12, the-11)
prep_of(remind-8, awesomeness-12)
amod(agency-16, unbound-14)
amod(agency-16, human-15)
prep_of(awesomeness-12, agency-16)

也可以形象化为(图片取自https://demos.explosion.ai/displacy/

“robots”和“are”之间的路径长度为1,“robots”和“awesomeness”之间的路径长度为4。

我的问题是上面的依赖解析结果,我怎样才能得到两个单词之间的依赖路径或依赖路径长度?

从我目前的搜索结果来看,nltk 的 ParentedTree 会有帮助吗?

谢谢!

【问题讨论】:

  • 嗨@Sean,如何使用nltk 获得一个句子的相似树结构?我目前没有任何语法,并且使用“treebank”语法,给我一些生产列表中不存在的单词的错误。谢谢。
  • @skadoosh 实际上,使用 NLTK 获得如此好的演示文稿很难。我建议使用 Spacy。

标签: python nltk text-parsing


【解决方案1】:

这个答案依赖于斯坦福 CoreNLP 来获取句子的依赖树。在使用networkx时,它从HugoMailhot的answer借用了相当多的代码。

在运行代码之前,需要:

  1. sudo pip install pycorenlp(Stanford CoreNLP 的 Python 接口)
  2. 下载Stanford CoreNLP
  3. 按如下方式启动 Stanford CoreNLP 服务器:

    java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 50000
    

然后可以运行以下代码来找到两个单词之间的最短依赖路径:

import networkx as nx
from pycorenlp import StanfordCoreNLP
from pprint import pprint

nlp = StanfordCoreNLP('http://localhost:{0}'.format(9000))
def get_stanford_annotations(text, port=9000,
                             annotators='tokenize,ssplit,pos,lemma,depparse,parse'):
    output = nlp.annotate(text, properties={
        "timeout": "10000",
        "ssplit.newlineIsSentenceBreak": "two",
        'annotators': annotators,
        'outputFormat': 'json'
    })
    return output

# The code expects the document to contains exactly one sentence.
document =  'Robots in popular culture are there to remind us of the awesomeness of'\
            'unbound human agency.'
print('document: {0}'.format(document))

# Parse the text
annotations = get_stanford_annotations(document, port=9000,
                                       annotators='tokenize,ssplit,pos,lemma,depparse')
tokens = annotations['sentences'][0]['tokens']

# Load Stanford CoreNLP's dependency tree into a networkx graph
edges = []
dependencies = {}
for edge in annotations['sentences'][0]['basic-dependencies']:
    edges.append((edge['governor'], edge['dependent']))
    dependencies[(min(edge['governor'], edge['dependent']),
                  max(edge['governor'], edge['dependent']))] = edge

graph = nx.Graph(edges)
#pprint(dependencies)
#print('edges: {0}'.format(edges))

# Find the shortest path
token1 = 'Robots'
token2 = 'awesomeness'
for token in tokens:
    if token1 == token['originalText']:
        token1_index = token['index']
    if token2 == token['originalText']:
        token2_index = token['index']

path = nx.shortest_path(graph, source=token1_index, target=token2_index)
print('path: {0}'.format(path))

for token_id in path:
    token = tokens[token_id-1]
    token_text = token['originalText']
    print('Node {0}\ttoken_text: {1}'.format(token_id,token_text))

输出是:

document: Robots in popular culture are there to remind us of the awesomeness of unbound human agency.
path: [1, 5, 8, 12]
Node 1  token_text: Robots
Node 5  token_text: are
Node 8  token_text: remind
Node 12 token_text: awesomeness

请注意,Stanford CoreNLP 可以在线测试:http://nlp.stanford.edu:8080/parser/index.jsp

此答案已在 Windows 7 SP1 x64 Ultimate 上使用 Stanford CoreNLP 3.6.0.、pycorenlp 0.3.0 和 python 3.5 x64 进行了测试。

【讨论】:

    【解决方案2】:

    HugoMailhot 的answer 很棒。我将为想要找到两个单词之间最短依赖路径的spacy 用户写类似的东西(而 HugoMailhot 的答案依赖于practNLPTools)。

    句子:

    流行文化中的机器人提醒我们 不受约束的人类机构。

    following dependency tree:

    这里是找到两个词之间最短依赖路径的代码:

    import networkx as nx
    import spacy
    nlp = spacy.load('en')
    
    # https://spacy.io/docs/usage/processing-text
    document = nlp(u'Robots in popular culture are there to remind us of the awesomeness of unbound human agency.', parse=True)
    
    print('document: {0}'.format(document))
    
    # Load spacy's dependency tree into a networkx graph
    edges = []
    for token in document:
        # FYI https://spacy.io/docs/api/token
        for child in token.children:
            edges.append(('{0}-{1}'.format(token.lower_,token.i),
                          '{0}-{1}'.format(child.lower_,child.i)))
    
    graph = nx.Graph(edges)
    
    # https://networkx.github.io/documentation/networkx-1.10/reference/algorithms.shortest_paths.html
    print(nx.shortest_path_length(graph, source='robots-0', target='awesomeness-11'))
    print(nx.shortest_path(graph, source='robots-0', target='awesomeness-11'))
    print(nx.shortest_path(graph, source='robots-0', target='agency-15'))
    

    输出:

    4
    ['robots-0', 'are-4', 'remind-7', 'of-9', 'awesomeness-11']
    ['robots-0', 'are-4', 'remind-7', 'of-9', 'awesomeness-11', 'of-12', 'agency-15']
    

    安装 spacy 和 networkx:

    sudo pip install networkx 
    sudo pip install spacy
    sudo python -m spacy.en.download parser # will take 0.5 GB
    

    关于 spacy 依赖解析的一些基准测试:https://spacy.io/docs/api/

    【讨论】:

      【解决方案3】:

      您的问题可以很容易地理解为图问题,我们必须找到两个节点之间的最短路径。

      要将您的依赖解析转换为图表,我们首先必须处理它以字符串形式出现的事实。你想得到这个:

      'nsubj(are-5, Robots-1)\nxsubj(remind-8, Robots-1)\namod(culture-4, popular-3)\nprep_in(Robots-1, culture-4)\nroot(ROOT-0, are-5)\nadvmod(are-5, there-6)\naux(remind-8, to-7)\nxcomp(are-5, remind-8)\ndobj(remind-8, us-9)\ndet(awesomeness-12, the-11)\nprep_of(remind-8, awesomeness-12)\namod(agency-16, unbound-14)\namod(agency-16, human-15)\nprep_of(awesomeness-12, agency-16)'
      

      看起来像这样:

      [('are-5', 'Robots-1'), ('remind-8', 'Robots-1'), ('culture-4', 'popular-3'), ('Robots-1', 'culture-4'), ('ROOT-0', 'are-5'), ('are-5', 'there-6'), ('remind-8', 'to-7'), ('are-5', 'remind-8'), ('remind-8', 'us-9'), ('awesomeness-12', 'the-11'), ('remind-8', 'awesomeness-12'), ('agency-16', 'unbound-14'), ('agency-16', 'human-15'), ('awesomeness-12', 'agency-16')]
      

      通过这种方式,您可以将元组列表提供给来自 networkx 模块的图形构造函数,该模块将分析列表并为您构建图形,此外还为您提供了一个简洁的方法,可以为您提供两个之间最短路径的长度给定节点。

      必要的进口

      import re
      import networkx as nx
      from practnlptools.tools import Annotator
      

      如何以所需的元组列表格式获取字符串

      annotator = Annotator()
      text = """Robots in popular culture are there to remind us of the awesomeness of unbound human agency."""
      dep_parse = annotator.getAnnotations(text, dep_parse=True)['dep_parse']
      
      dp_list = dep_parse.split('\n')
      pattern = re.compile(r'.+?\((.+?), (.+?)\)')
      edges = []
      for dep in dp_list:
          m = pattern.search(dep)
          edges.append((m.group(1), m.group(2)))
      

      如何构建图表

      graph = nx.Graph(edges)  # Well that was easy
      

      如何计算最短路径长度

      print(nx.shortest_path_length(graph, source='Robots-1', target='awesomeness-12'))
      

      此脚本将显示给定依赖项解析的最短路径实际上长度为 2,因为您可以通过 remind-8Robots-1awesomeness-12

      1. xsubj(remind-8, Robots-1) 
      2. prep_of(remind-8, awesomeness-12)
      

      如果您不喜欢这个结果,您可能需要考虑过滤一些依赖项,在这种情况下不允许将 xsubj 依赖项添加到图中。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2018-02-10
        • 2021-05-25
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多