扩展 NLP 实体提取答案

【问题标题】：extending NLP entity extraction扩展 NLP 实体提取
【发布时间】：2017-11-29 12:24:13
【问题描述】：

我们希望通过简单的搜索来识别各个城市的社区和街道。我们不仅使用英语，还使用其他各种西里尔语言。我们需要能够识别位置的拼写错误。在查看 python 库时，我发现了这个： http://polyglot.readthedocs.io/en/latest/NamedEntityRecognition.html

我们尝试过使用它，但找不到扩展实体识别数据库的方法。怎么办？
如果没有，对于多语言 nlp 是否有任何其他建议可以帮助进行拼写检查并提取与自定义数据库匹配的各种实体？

【问题讨论】：

来自他们的文档：Polyglot requires a model for each task and language. These models are essential for the library to function. 不幸的是，我没有看到任何有关训练其他模型的参考信息。
正是我的问题，你如何自己训练这些模型......
- 我们为您提供可以扩充的多种语言的训练数据集以及您拥有的新数据源。 sites.google.com/site/rmyeid/projects/polylgot-ner - 我们提供用作特征的词嵌入 sites.google.com/site/rmyeid/projects/polyglot - 如果您需要训练新模型，请复制此处描述的工作：arxiv.org/abs/1410.3791

标签： python machine-learning nlp polyglot named-entity-extraction

【解决方案1】：

看看HuggingFace 的预训练模型。

他们有一个多语言 NER 模型，训练了 40 种语言，包括俄语等西里尔语。这是 RoBERTa 的微调版本，因此准确度似乎非常好。在此处查看详细信息：https://huggingface.co/jplu/tf-xlm-r-ner-40-lang
他们还有一个基于 GitHub Typo Corpus 训练的多语言 DistilBERT 模型，用于错字检测。该语料库似乎包含来自 15 种不同语言的拼写错误，包括俄语。在此处查看详细信息：https://huggingface.co/mrm8488/distilbert-base-multi-cased-finetuned-typo-detection

以下是文档中的一些示例代码，针对您的用例稍作修改：

from transformers import pipeline

typo_checker = pipeline("ner", model="mrm8488/distilbert-base-multi-cased-finetuned-typo-detection",
                        tokenizer="mrm8488/distilbert-base-multi-cased-finetuned-typo-detection")

result = typo_checker("я живу в Мосве")
result[1:-1]

 #[{'word': 'я', 'score': 0.7886862754821777, 'entity': 'ok', 'index': 1},
 #{'word': 'жив', 'score': 0.6303715705871582, 'entity': 'ok', 'index': 2},
 #{'word': '##у', 'score': 0.7259598970413208, 'entity': 'ok', 'index': 3},
 #{'word': 'в', 'score': 0.7102937698364258, 'entity': 'ok', 'index': 4},
 #{'word': 'М', 'score': 0.5045614242553711, 'entity': 'ok', 'index': 5},
 #{'word': '##ос', 'score': 0.560469925403595, 'entity': 'typo', 'index': 6},
 #{'word': '##ве', 'score': 0.8228507041931152, 'entity': 'ok', 'index': 7}]

result = typo_checker("I live in Moskkow")
result[1:-1]

 #[{'word': 'I', 'score': 0.7598089575767517, 'entity': 'ok', 'index': 1},
 #{'word': 'live', 'score': 0.8173692226409912, 'entity': 'ok', 'index': 2},
 #{'word': 'in', 'score': 0.8289134502410889, 'entity': 'ok', 'index': 3},
 #{'word': 'Mo', 'score': 0.7344270944595337, 'entity': 'ok', 'index': 4},
 #{'word': '##sk', 'score': 0.6559176445007324, 'entity': 'ok', 'index': 5},
 #{'word': '##kow', 'score': 0.8762879967689514, 'entity': 'ok', 'index': 6}]

不幸的是，它似乎并不总是有效，但对于您的用例来说可能已经足够了。

另一个选项是SpaCy。他们没有针对不同语言的那么多模型，但使用 SpaCy's EntityRuler 可以轻松手动定义新实体，即“扩展实体识别数据库”。

【讨论】：