在 Spacy 中查找开始和结束字符索引答案

【问题标题】：Finding the Start and End char indices in Spacy在 Spacy 中查找开始和结束字符索引
【发布时间】：2021-02-18 12:47:44
【问题描述】：

我正在 Spacy 中训练自定义模型以提取自定义实体，但是虽然我需要提供包含我的实体和索引位置的输入训练数据，但我想了解是否有更快的方法来分配索引值对于我在训练数据中的特定句子中寻找的关键字。

我的训练数据示例：

TRAIN_DATA = [

('Behaviour Skills include Communication, Conflict Resolution, Work Life Balance,
 {'entities': [(25, 37, 'BS'),(40, ,60, 'BS'),(62, 79, 'BS')]
 })
            ]

现在要在我的训练数据中传递特定关键字的索引位置，我目前正在手动计算它以提供我的关键字的位置。

例如：在我说行为技能包括沟通等的第一行的情况下，我正在手动计算“沟通”一词的索引位置，即 25,37。

我确信必须有另一种方法来通过其他一些方法来识别这些索引的位置，而不是手动计算它。有什么想法可以实现吗？

【问题讨论】：

你试过用Python的str.find()吗？
不。但是要使用它，我不是必须分别查找每个单词吗？

标签： python-3.x nlp spacy indices named-entity-recognition

【解决方案1】：

使用str.find() 可以提供帮助。但是，您必须遍历句子和关键字

keywords = ['Communication', 'Conflict Resolution', 'Work Life Balance']
texts = ['Behaviour Skills include Communication, Conflict Resolution, Work Life Balance', 
        'Some sentence where lower case conflict resolution is included']

LABEL = 'BS'
TRAIN_DATA = []

for text in texts:
    entities = []
    t_low = text.lower()
    for keyword in keywords:
        k_low = keyword.lower()
        begin = t_low.find(k_low) # index if substring found and -1 otherwise
        if begin != -1:
            end = begin + len(keyword)
            entities.append((begin, end, LABEL))
    TRAIN_DATA.append((text, {'entities': entities}))

输出：

[('Behaviour Skills include Communication, Conflict Resolution, Work Life Balance', 
{'entities': [(25, 38, 'BS'), (40, 59, 'BS'), (61, 78, 'BS')]}), 
('Some sentence where lower case conflict resolution is included', 
{'entities': [(31, 50, 'BS')]})]

我添加了str.lower() 以防万一您可能需要它。

【讨论】：