如何将组合的 spacy ner 标签转换为 BIO 格式？答案

【问题标题】：How to convert combined spacy ner tags to BIO format?如何将组合的 spacy ner 标签转换为 BIO 格式？
【发布时间】：2020-09-23 03:12:46
【问题描述】：

如何将其转换为 BIO 格式？我曾尝试使用 spacy biluo_tags_from_offsets，但它未能捕获所有实体，我想我知道原因。

tags = biluo_tags_from_offsets(doc, annot['entities'])

BSc（理学学士）- 这两个结合在一起，但是当有空格时，spacy 会拆分文本。所以现在单词会像 (BSc(Bachelor, of, science) 这就是为什么 spacy biluo_tags_from_offsets 失败并返回 -

现在，当它检查 (80, 83, 'Degree') 时，它无法单独找到 BSc 单词。同样，(84, 103, 'Degree') 将再次失败。

如何解决这些情况？如果有人可以，请提供帮助。

EDUCATION: · Master of Computer Applications (MCA) from NV, *********, *****. · BSc(Bachelor of science) from NV, *********, *****

{'entities': [(13, 44, 'Degree'), (46, 49, 'Degree'), (80, 83, 'Degree'), (84, 103, 'Degree')]}

【问题讨论】：

您可以尝试将令牌与Doc.retokenize() 结合起来，例如stackoverflow.com/a/63982729/4317058 吗？有趣的是，预训练模型是否仍能识别新的组合标记。
@SergeyBushmanov 你能提供一个有效的例子吗，我无法从那个链接正确理解它，retokenize 到底是做什么的（）
@SergeyBushmanov 我在网上看到，spacy 不支持重叠实体？有什么办法可以解决这些问题。我找不到任何关于如何解决这些问题的好文章？如果你熟悉，请帮助我。
你也不妨看看spacy.io/api/pipeline-functions#merge_entities
@SergeyBushmanov 我在研究时读过它。但在我的情况下，重叠的实体是两个不同的标签。如何将两个实体合并到一个单词中？我无法理解如何围绕它建立一个ner。如果您熟悉工作流程，请帮助我。我已经坚持了好几个星期了。我的数据集有两个问题，一个是我在上面列出的另一个是重叠实体。

标签： python python-3.x nlp spacy named-entity-recognition

【解决方案1】：

通常，您将数据传递到biluo_tags_from_offsets(doc, entities)，其中entities 类似于[(14, 44, 'ORG'), (51, 54, 'ORG')]。您可以根据需要编辑此参数（您可以从编辑doc.ents 开始并从那里继续）。您可以添加、删除、组合此列表中的任何实体，如下例所示：

import spacy
from spacy.gold import biluo_tags_from_offsets
nlp = spacy.load("en_core_web_md")

text = "I have a BSc (Bachelors of Computer Sciences) from NYU"
doc = nlp(text)
print("Entities before adding new entity:", doc.ents)

entities = []
for ent in doc.ents:
    entities.append((ent.start_char, ent.end_char, ent.label_))
print("BILUO before adding new entity:", biluo_tags_from_offsets(doc, entities))

entities.append((9,12,'ORG')) # add a desired entity

print("BILUO after adding new entity:", biluo_tags_from_offsets(doc, entities))

Entities before adding new entity: (Bachelors of Computer Sciences, NYU)
BILUO before adding new entity: ['O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'U-ORG']
BILUO after adding new entity: ['O', 'O', 'O', 'U-ORG', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'U-ORG']

如果您希望合并实体的过程基于规则，您可以尝试entityruler 使用以下简化示例（取自上面的链接）：

from spacy.lang.en import English
from spacy.pipeline import EntityRuler

nlp = English()
ruler = EntityRuler(nlp)
patterns = [{"label": "ORG", "pattern": "Apple"},
            {"label": "GPE", "pattern": [{"LOWER": "san"}, {"LOWER": "francisco"}]}]
ruler.add_patterns(patterns)
nlp.add_pipe(ruler)

doc = nlp("Apple is opening its first big office in San Francisco.")
print([(ent.text, ent.label_) for ent in doc.ents])

然后再次将重新定义（在您的情况下合并）实体的列表传递给biluo_tags_from_offsets，就像在第一个代码 sn-p 中一样

【讨论】：

@user_12 它回答了你的问题吗？有帮助吗？请考虑stackoverflow.com/help/someone-answers