对大小写不佳的句子执行命名实体识别以提取公司名称答案

【问题标题】：Performing named-entity recognition on sentences that are poorly cased to extract company names对大小写不佳的句子执行命名实体识别以提取公司名称
【发布时间】：2021-03-31 19:52:41
【问题描述】：

我有一个句子数据库，我试图从中提取所有公司名称。到目前为止，我正在使用 spaCy 的命名实体识别，并且对具有标准大写的句子取得了很好的效果。当我尝试对没有标准大写的句子做同样的事情时，问题就出现了。特别是，对于使用“标题大小写”的数据库子集（即除介词/冠词/等之外的所有单词都大写），我的性能很差。

以下是这类句子的一些示例，以及我使用 spaCy 获得的当前结果以及我想要的结果：

Sentence	Current Extraction	Desired Extraction(s)
Caribbean Airlines Transforms its Revenue Accounting Process	Caribbean Airlines Transforms its Revenue Accounting	Caribbean Airlines
Scoular Drives Employee Development With Absorb LMS	Scoular Drives Employee Development With Absorb	(Scoular, Absorb LMS)
Oracle Solution Reduces Operating Costs by 25 Percent	Oracle Solution Reduces Operating Costs	Oracle
Pandora CFO Cuts Procurement Time with Coupa	Pandora CFO Cuts Procurement Time	(Pandora, Coupa)

如您所见，过度大写使 spaCy 认为实体名称中的单词比实际情况多得多。所以我的问题是如何缓解这个问题？是否有其他库可能对这种大写不太敏感，或者我可以通过“truecasing”它们来预处理句子。标准流程是什么？

为了完整起见，这里是我使用 spaCy 库的方式

nlp = spacy.load("en_core_web_md")
for sentence in sentences:
    doc = nlp(sentence)
    for ent in doc.ents:
        ... store in database (ORG) ...

【问题讨论】：

标签： python nlp spacy named-entity-recognition

【解决方案1】：

您似乎已经对可能的解决方案有所了解...

我的建议：

预处理和真格句子（相当复杂/困难）
使用其他具有未封装模型的框架（例如 BERT）
用足够的数据（必须标注）重新训练 Spacy 的模型

【讨论】：