执行 NER（命名实体识别）的过程 - NLP答案

【问题标题】：Process of performing NER (Named Enitity Recognition) - NLP执行 NER（命名实体识别）的过程 - NLP
【发布时间】：2019-11-25 21:01:24
【问题描述】：

所以我的文本如下所示：

他也可能有应使用 ativan IV 或 IM 治疗的反复发作并且不一定表明患者需要返回除非他们持续超过 5 分钟或他有多次反复发作或并发症，例如愿望。

还有注释文件，例如：

T1 原因 16 33 反复发作

上面的注解告诉了实体的ID、span（字符位置）和实体本身。我的目标是对上述数据进行 NER（命名实体识别）。根据我的研究，我知道我必须对数据进行 BIO（开始、内部和外部）标记，这将使我的数据如下所示：

O - also O - may O - have B - recurrent I - seizures

在 BIO 标记之后，我想使用数据来获取一些词嵌入并将其输入到分类器中，这将让我获得带有测试数据的实体类型。

我给出的流程大纲是否正确，或者谁能解释我如何解决这个问题？

【问题讨论】：

标签： python machine-learning nlp

【解决方案1】：

您提到的方法可行，但更可靠的方法是使用基于统计模型的方法，而不是 BIO 标记。您可能想查看 spaCy 库以执行此类 NLP 任务。 spaCy 可以预测一个词（在 NLP 术语中称为标记）是否是给定句子（在 NLP 术语中称为文档）中的实体（如果是，是什么类型）。为了使用这个库对您的文档执行 NER，您可以按如下方式进行：

# pip install spacy
# python -m spacy download en_core_web_sm

import spacy

# Load English tokenizer, tagger, parser, NER and word vectors
nlp = spacy.load("en_core_web_sm")

# Process your document
text = ("He also may have recurrent seizures which should be treated with ativan IV or IM and do not neccessarily indicate patient needs to return to hospital unless they continue for greater than 5 minutes or he has multiple recurrent seizures or complications such as aspiration.")
doc = nlp(text)

# Find named entities in the document
for entity in doc:
    print(entity.text, entity.label_)

请务必检查this，以了解您在处理文档后得到的输出结果。每个可能的标签代表什么的字典可以在here找到。

【讨论】：