在训练 NER 模型时添加 Retokenize 管道答案

【问题标题】：Adding a Retokenize pipe while training NER model在训练 NER 模型时添加 Retokenize 管道
【发布时间】：2019-11-13 17:18:28
【问题描述】：

我目前正在尝试训练以属性描述为中心的 NER 模型。我可以得到一个经过全面训练的模型来满足我的喜好，但是，我现在想在模型中添加一个重新标记化管道，以便我可以设置模型来训练其他东西。

从这里开始，我在让 retokenize 管道实际工作时遇到问题。这是定义：

def retok(doc):
    ents = [(ent.start, ent.end, ent.label) for ent in doc.ents]
    with doc.retokenize() as retok:
        string_store = doc.vocab.strings
    for start, end, label in ents:
        retok.merge(
                doc[start: end],
                attrs=intify_attrs({'ent_type':label},string_store))
    return doc

我将它添加到我的训练中，如下所示：

nlp.add_pipe(retok, after="ner")

我正在像这样将它添加到语言工厂中：

Language.factories['retok'] = lambda nlp, **cfg: retok(nlp)

我不断遇到的问题是“AttributeError: 'English' object has no attribute 'ents'”。现在我假设我收到了这个错误，因为通过这个函数传递的参数不是文档，而是 NLP 模型本身。我不太确定在培训期间让医生流入这个管道。在这一点上，我真的不知道从这里去哪里才能让管道按我想要的方式运行。

感谢任何帮助，谢谢。

【问题讨论】：

标签： spacy

【解决方案1】：

您可以使用内置的merge_entities 管道组件：https://spacy.io/api/pipeline-functions#merge_entities

从文档中复制的示例：

texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David", "Bowie"]

merge_ents = nlp.create_pipe("merge_entities")
nlp.add_pipe(merge_ents)

texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David Bowie"]

如果您需要进一步自定义，merge_entities (v2.2) 的当前实现是一个很好的起点：

def merge_entities(doc):
    """Merge entities into a single token.

    doc (Doc): The Doc object.
    RETURNS (Doc): The Doc object with merged entities.

    DOCS: https://spacy.io/api/pipeline-functions#merge_entities
    """
    with doc.retokenize() as retokenizer:
        for ent in doc.ents:
            attrs = {"tag": ent.root.tag, "dep": ent.root.dep, "ent_type": ent.l
abel}
            retokenizer.merge(ent, attrs=attrs)
    return doc

附：您正在将nlp 传递给下面的retok()，这是错误的来源：

Language.factories['retok'] = lambda nlp, **cfg: retok(nlp)

查看相关问题：Spacy - Save custom pipeline

【讨论】：

这很棒。我一定很轻松地解决了这个问题，因为我没有立即将 NER 生成实体放在一起。添加后，我得到了一个新问题。 gold = GoldParse.from_annot_tuples(doc, zip(*gold.orig_annot)) File "gold.pyx", line 540, in spacy.gold.GoldParse.from_annot_tuples ValueError: need more than 6 values to unpack
没有足够的上下文来确定，但我认为这是一个已知的错误。它应该是一个单独的问题，或者先搜索问题跟踪器。（答案可能是升级，不要使用gold_preproc。）
你能澄清一下我要升级什么吗？我正在使用最新版本的 spacy。此外，训练代码似乎没有引用 gold_preproc。
这有点离题了：新问题或错误报告会更好！