在文本中查找语义相关的命名实体答案

【问题标题】：Finding semantically related named entities in text在文本中查找语义相关的命名实体
【发布时间】：2021-06-17 18:41:43
【问题描述】：

我有一组带有标记的命名实体的文本文档，例如“person”、“organization”、“location”、“product”、“amount”、“price”等。我已经对 BERT 进行了微调模型来识别这些命名实体。但我还需要解决在文本中查找相关命名实体的问题。例如，假设我们有一段这样的文本：

嘿，杰克！有工作给你。 Big Corporation 的 Thomas Smith 今天早上打来电话，订购了四个 比萨饼，价格为 15 美元，28th Street的Andy点了寿司。

BERT 将在此文本中找到以下命名实体及其位置：

杰克 - 人
Thomas Smith - 人
大公司 - 组织
四量
比萨 - 产品
十五美元 - 价格
安迪 - 人
28 街 - 位置
寿司 - 产品

我需要一个可以将这些实体分成组的模型，其中包含语义相关的实体，如下所示：

{杰克}
{Thomas Smith，大公司，四个，比萨饼，十五美元}
{安迪，第 28 街，寿司}

如果我有一个包含实体之间链接的训练数据集，是否可以解决这样的问题？有没有可以在 BERT 模型嵌入之上使用的神经网络架构来解决这个问题？也许是图模型？

【问题讨论】：

标签： python nlp named-entity-recognition information-extraction

【解决方案1】：

在您的示例中，所有相关实体都在同一个句子中（但并非同一句子中的所有实体都是相关的）。

如果是这种情况，那么我建议将一个句子分成组件，并将属于同一组件的实体标记为相关。

要构造组件，您可以构建句子的语法依赖树，然后通过删除一些依赖边缘来切割树。例如，如果句子有不同的主语，您可以将它们拆分为子句。

我使用spacy 来查找实体和构建语法树（但 spacy 不会将产品名称识别为实体，因此您应该使用自己的 NER 模型）。此外，您可能想发明自己的规则来将句子分成几部分。

from collections import defaultdict
import spacy
nlp = spacy.load("en_core_web_sm")

text = "Hey, Jack! There is work for you. Thomas Smith of the Big Corporation called this morning and ordered four pizzas for fifteen dollars, and Andy on 28th Street ordered sushi."
doc = nlp(text)

def find_cluster(token):
    # this token is a head of a sentence
    if token.dep_ == 'ROOT' or token.head == token:
        return token.idx
    # this token is a head of autonomous sub-sentence
    if token.dep_ == 'conj' and any(child.dep_ == 'nsubj' for child in token.children):
        return token.idx
    return find_cluster(token.head)

clusters = defaultdict(list)
for e in doc.ents:
    clusters[find_cluster(e[0])].append(e)

for c in clusters.values():
    print(c)

预期的输出是：

# [Jack]
# [Thomas Smith, the Big Corporation, this morning, four, fifteen dollars]
# [Andy, 28th Street]

【讨论】：