删除句子并更新索引答案

【问题标题】：Deleting the sentence and updating the index删除句子并更新索引
【发布时间】：2021-10-22 08:27:52
【问题描述】：

我正在研究这样的数据格式。

data = [{"content":'''Hello I am Aniyya. I enjoy playing Football.
I love eating grapes''',"annotations":[{"id":1,"start":11,"end":17,"tag":"name"},
                                {"id":2,"start":59,"end":65,"tag":"fruit"}]}]

我确实想要这样的数据格式。必须删除没有任何实体的句子。并根据删除的句子更新其他实体的开始和结束。

result_data = data = [{"content":'''Hello I am Aniyya. I love eating grapes''',"annotations":[{"id":1,"start":11,"end":17,"tag":"name"},
                                {"id":2,"start":33,"end":39,"tag":"fruit"}]}]

我没有得到任何特定的逻辑。我知道这就像要求为我编写代码，但如果有人有时间帮助我，我将不胜感激。我有点坚持这一点。我之前问过一个类似的问题，但它也没有解决我的问题。所以想到描述更多细节。对此的解决方案将对所有准备与 NLP 任务相关的数据集的人有所帮助。提前致谢。

可视化是用spacy displacy完成的，代码在visualizing NER training data and entity using displacy

【问题讨论】：

标签： python python-3.x string nlp spacy

【解决方案1】：

从我在问题中看到的是，有一个分隔符来分隔一个句子，即“。” （点）。这样，您可以将句子分成不同的单元，然后对于每个句子，您可以尝试检查它是否是带有注释的有效句子，否则从字符串中删除或拼接该句子。

我已经为此编写了解决方案的草稿，它可以完成工作。随意提出任何改变。另外，您可能需要根据您的确切要求对其进行调整

data = [{"content":'''Hello I am Aniyya. I enjoy playing Football.I love eating grapes''',"annotations":[{"id":1,"start":11,"end":17,"tag":"name"},                {"id":2,"start":59,"end":65,"tag":"fruit"}]}]
identifier = '#'

def processRow(row):
    annotations = row["annotations"]
    temp = row["content"]
    startIndex = 0;
    endIndex = 0;
    annotationMap = dict()
    for annotation in annotations:
        start = annotation["start"]
        end = annotation["end"] - 1
        temp = temp[:end] + identifier + temp[end+1:]
        
    result = ""
    temp = temp.split(".")
    content = row["content"].split(".")
    
    for tempRow,row in zip(temp,content):
        if identifier in tempRow:
            result = result + row + "."
            
    return result

def processData(data):
    for row in data:
        temp = processRow(row)
        row["content"] = temp
    print(data)
    
    
processData(data)

【讨论】：

第二句的开始和结束标签不根据新句子更新。剩下的很棒

【解决方案2】：

import re

data = [{"content":'''Hello I am Aniyya. I enjoy playing Football.
I love eating grapes. Aniyya is great.''',"annotations":[{"id":1,"start":11,"end":17,"tag":"name"},
                                {"id":2,"start":59,"end":65,"tag":"fruit"},
                                {"id":3,"start":67,"end":73,"tag":"name"}]}]
         
         
         
for idx, each in enumerate(data[0]['annotations']):
    start = each['start']
    end = each['end']
    word = data[0]['content'][start:end]
    data[0]['annotations'][idx]['word'] = word
    
sentences = [ {'sentence':x.strip() + '.','checked':False} for x in data[0]['content'].split('.')]

new_data = [{'content':'', 'annotations':[]}]
for idx, each in enumerate(data[0]['annotations']):
    for idx_alpha, sentence in enumerate(sentences):
        if sentence['checked'] == True:
            continue
        temp = each.copy()
        check_word = temp['word']
        if check_word in sentence['sentence']:
            start_idx = re.search(r'\b({})\b'.format(check_word), sentence['sentence']).start()
            end_idx = start_idx + len(check_word)
            
            current_len = len(new_data[0]['content'])
            
            new_data[0]['content'] += sentence['sentence'] + ' '
            temp.update({'start':start_idx + current_len, 'end':end_idx + current_len})
            new_data[0]['annotations'].append(temp)
            
            sentences[idx_alpha]['checked'] = True
            break

输出：

print(new_data)
[{'content': 'Hello I am Aniyya. I love eating grapes. Aniyya is great. ', 'annotations': [{'id': 1, 'start': 11, 'end': 17, 'tag': 'name', 'word': 'Aniyya'}, {'id': 2, 'start': 33, 'end': 39, 'tag': 'fruit', 'word': 'grapes'}, {'id': 3, 'start': 41, 'end': 47, 'tag': 'name', 'word': 'Aniyya'}]}]

【讨论】：

干得好。但是第二句的开始和结束位置有一个小问题。 [{'content'：'你好，我是 Aniyya。我喜欢吃葡萄。 ', '注解': [{'id': 1, 'start': 11, 'end': 17, 'tag': 'name', 'word': 'Aniyya'}, {'id': 2, “开始”：14，“结束”：20，“标签”：“水果”，“单词”：“葡萄”}]}]
近乎完美。正如@Nebu-Lin 建议的第二句的开始和结束键没有正确更新。
啊。是的，我明白了。我把单个句子的 idx 开头，而它需要在完整的内容中。给我一分钟修复
@aniyya08，已更新。现在开始工作
我为此提供了另一个解决方案here