如何分析使用 Spacy 训练的 NER？答案

【问题标题】：How to analyse an NER that is trained using Spacy?如何分析使用 Spacy 训练的 NER？
【发布时间】：2019-01-28 04:05:05
【问题描述】：

这是一个从教程文档中摘录的简单代码（或多或少）。使用以下训练代码训练 NER 模型后，我在 for 循环中使用 nlp(sentence).ents 来获取命名实体。如您所见，我使用了一个空白模型spacy.blank('en') 这是因为我正在添加新实体。但是从测试集中没有检测到实体。

import spacy
import random
from spacy.util import compounding
from spacy.util import minibatch
def get_batches(train_data, model_type):
    max_batch_sizes = {'tagger': 32, 'parser': 16, 'ner': 16, 'textcat': 64}
    max_batch_size = max_batch_sizes[model_type]
    if len(train_data) < 1000:
        max_batch_size /= 2
    if len(train_data) < 500:
        max_batch_size /= 2
    batch_size = compounding(1, max_batch_size, 1.001)
    batches = minibatch(train_data, size=batch_size)
    return batches

nlp = spacy.blank('en')
nlp.vocab.vectors.name = 'blank_vector'
optimizer = nlp.begin_training()
for i in range(20):
    random.shuffle(TRAIN_DATA)
    batches = get_batches(TRAIN_DATA, 'ner')
    for batch in batches:
        texts, annotations = zip(*batch)
        nlp.update(texts, annotations, drop=0.5, sgd=optimizer)
#     for text, annotations in TRAIN_DATA:
#         nlp.update([text], [annotations], drop=0.5, sgd=optimizer)
nlp.to_disk('model')

如何分析在 spacy 中创建的模型？ 我确实尝试通过查看由 @987654325 创建的 model 来理解其中的一些内容@。但不幸的是，我不知道如何添加我需要的必要信息。

我的要求：考虑诸如 [20%、0.5% 等] 之类的百分比和诸如 [100 美元、100 美元等] 之类的美元金额，此类事件将被记为 MONEY、PERCENT em> 由预训练的 NER 完成，但我需要它们根据使用情况来检测实体，例如 ['HOME_LOAN_INTEREST_RATE'、'CAR_LOAN_INTEREST_RATE' 等]。现在我的问题仍然可能是因为词汇表中没有所有的美元金额。如果是这种情况，我该如何解决这个问题。

对此的任何帮助将不胜感激。

【问题讨论】：

嘿，我可以帮你，但你能更清楚一点吗？“我需要他们根据 ['HOME_LOAN_INTEREST_RATE'、'CAR_LOAN_INTEREST_RATE' 等] 的使用情况检测实体。”
嘿 Gideon，那些只是新实体。我已经弄清楚代码出了什么问题。我需要为 ner 创建一个管道，而我在上面的代码中没有这样做。此外，更新函数中有一个损失参数，可用于了解模型的进度。我还没有弄清楚使用的损失函数，但是嘿..一旦我们将管道添加到代码中，它就可以工作

标签： nlp spacy named-entity-recognition

【解决方案1】：

update 函数有一个losses 参数，可用于找出模型在每次迭代中的损失。

此外，我的 NER 模型在我的数据集中找不到标签的原因是（可能）根本没有执行 NER 操作，因为我没有在创建的模型中找到 ner 文件夹。为了解决这个问题，我们必须创建一个叫做管道的东西。

if "ner" not in nlp.pipe_names:
    ner = nlp.create_pipe("ner")
    nlp.add_pipe(ner, last=True)
# otherwise, get it so we can add labels
else:
    ner = nlp.get_pipe("ner")

# add labels
for _, annotations in TRAIN_DATA:
    for ent in annotations.get("entities"):
        ner.add_label(ent[2])

# get names of other pipes to disable them during training
other_pipes = [pipe for pipe in nlp.pipe_names if pipe != "ner"]

我相信其他人可以更好地解释分析模型性能的方法，但这是我为解决我的问题所做的。

with nlp.disable_pipes(*other_pipes):
    nlp.vocab.vectors.name = 'blank_vector'
    optimizer = nlp.begin_training()
    for i in range(10):
        random.shuffle(TRAIN_DATA)
        losses = {}
        batches = get_batches(TRAIN_DATA, 'ner')
        for batch in batches:
            texts, annotations = zip(*batch)
            nlp.update(texts, annotations, losses=losses, drop=0.1, sgd=optimizer)
            print('Losses:', losses)
nlp.to_disk('model')

将阅读有关文档的更多信息，以了解优化器和损失函数。但请随意添加另一个答案/编辑它以提供更好的解释。

【讨论】：