使用 SpaCy 获取实体的左右侧词答案

【问题标题】：Obtaining left and right side words of entities with SpaCy使用 SpaCy 获取实体的左右侧词
【发布时间】：2019-11-22 21:21:37
【问题描述】：

我一直在使用 SpaCy 进行 NLP 项目，以获取所有实体的左侧和右侧单词并将它们转储为 JSON 格式。

这是我尝试过的功能：

def __init__(self):
    self.new_side_words_json = dict()

def side_words(self, text):
    words = nlp(text).ents[0]
    side_words_json = [{'LeftSideWord': str(words[entity.start - 1]),
                        'Entity': str(entity),
                        'RightSideWord': str(words[entity.end])}
                       if not words[entity.start - 1].is_punct 
                       and not words[entity.start - 1].is_space 
                       and not words[entity.end].is_punct
                       and not words[entity.end].is_space
                       else
                       {'LeftSideWord': str(words[entity.start - 2]),
                        'Entity': str(entity),
                        'RightSideWord': str(words[entity.end + 1])}
                       for entity in nlp(text).ents]
    self.new_side_words_json['SideWords'] = side_words_json

在某些情况下，此算法有效。但是，在我看来，这是一个非常丑陋的解决方案，因为它对条件的控制不够。该算法高度依赖文本格式。我想构建一些适用于每个文档的可靠的东西。

我的意思是，在文本文件中，可以有很多标点符号或空格。我只是控制上下两个级别。

我想做的是，创建一个算法来查找实体之前和之后的有意义的单词，但不是标点符号或空格，甚至可能不是停用词。

如何调整此算法以获取所有实体的上一个和下一个有意义的单词？

【问题讨论】：

标签： python-3.x spacy

【解决方案1】：

我在最后找到了解决方案。它仍然很丑陋。但是，它可以按我的意愿工作。

我在这里发布代码，以便遇到相同类型问题的任何人都可以给出解决方案的想法。

    for entity in doc.ents:
        self.entity_list = [entity]
    right = [
        {'Right': str(words[entity.end])} if (entity.end < self.entity_list[-1].end) and not words[entity.end].is_punct and not words[entity.end].is_space
        else
        {'Right': str(words[entity.end + 1])} if (entity.end + 1 < self.entity_list[-1].end) and not words[entity.end + 1].is_punct and not words[entity.end + 1].is_space
        else
        {'Right': str(words[entity.end + 2])} if (entity.end + 2 < self.entity_list[-1].end) and not words[entity.end + 2].is_punct and not words[entity.end + 2].is_space
        else
        {'Right': 'null'}
        for entity in nlp(text).ents]
    result = [{**dict_left, **dict_entities, **dict_right} for
              dict_left, dict_entities, dict_right in
              zip(left, entities, right)]

问题是索引正确的单词，在最后一个实体之后，没有单词。它抱怨试图到达最后一个物体。我添加了索引大小控制器来解决问题。

我还必须为 JSON 标签分离 if 的块，以便为每个标签获得更精确的结果。然后简单地使用zip()合并它们

【讨论】：