【问题标题】:Deleting the sentence and updating the index删除句子并更新索引
【发布时间】:2021-10-22 08:27:52
【问题描述】:

我正在研究这样的数据格式。

data = [{"content":'''Hello I am Aniyya. I enjoy playing Football.
I love eating grapes''',"annotations":[{"id":1,"start":11,"end":17,"tag":"name"},
                                {"id":2,"start":59,"end":65,"tag":"fruit"}]}]

我确实想要这样的数据格式。必须删除没有任何实体的句子。并根据删除的句子更新其他实体的开始和结束。

result_data = data = [{"content":'''Hello I am Aniyya. I love eating grapes''',"annotations":[{"id":1,"start":11,"end":17,"tag":"name"},
                                {"id":2,"start":33,"end":39,"tag":"fruit"}]}]

我没有得到任何特定的逻辑。我知道这就像要求为我编写代码,但如果有人有时间帮助我,我将不胜感激。我有点坚持这一点。我之前问过一个类似的问题,但它也没有解决我的问题。所以想到描述更多细节。对此的解决方案将对所有准备与 NLP 任务相关的数据集的人有所帮助。提前致谢。

可视化是用spacy displacy完成的,代码在visualizing NER training data and entity using displacy

【问题讨论】:

    标签: python python-3.x string nlp spacy


    【解决方案1】:

    从我在问题中看到的是,有一个分隔符来分隔一个句子,即“。” (点)。这样,您可以将句子分成不同的单元,然后对于每个句子,您可以尝试检查它是否是带有注释的有效句子,否则从字符串中删除或拼接该句子。

    我已经为此编写了解决方案的草稿,它可以完成工作。随意提出任何改变。另外,您可能需要根据您的确切要求对其进行调整

    data = [{"content":'''Hello I am Aniyya. I enjoy playing Football.I love eating grapes''',"annotations":[{"id":1,"start":11,"end":17,"tag":"name"},                {"id":2,"start":59,"end":65,"tag":"fruit"}]}]
    identifier = '#'
    
    def processRow(row):
        annotations = row["annotations"]
        temp = row["content"]
        startIndex = 0;
        endIndex = 0;
        annotationMap = dict()
        for annotation in annotations:
            start = annotation["start"]
            end = annotation["end"] - 1
            temp = temp[:end] + identifier + temp[end+1:]
            
        result = ""
        temp = temp.split(".")
        content = row["content"].split(".")
        
        for tempRow,row in zip(temp,content):
            if identifier in tempRow:
                result = result + row + "."
                
        return result
    
    def processData(data):
        for row in data:
            temp = processRow(row)
            row["content"] = temp
        print(data)
        
        
    processData(data)
    

    【讨论】:

    • 第二句的开始和结束标签不根据新句子更新。剩下的很棒
    【解决方案2】:
    import re
    
    data = [{"content":'''Hello I am Aniyya. I enjoy playing Football.
    I love eating grapes. Aniyya is great.''',"annotations":[{"id":1,"start":11,"end":17,"tag":"name"},
                                    {"id":2,"start":59,"end":65,"tag":"fruit"},
                                    {"id":3,"start":67,"end":73,"tag":"name"}]}]
             
             
             
    for idx, each in enumerate(data[0]['annotations']):
        start = each['start']
        end = each['end']
        word = data[0]['content'][start:end]
        data[0]['annotations'][idx]['word'] = word
        
    sentences = [ {'sentence':x.strip() + '.','checked':False} for x in data[0]['content'].split('.')]
    
    new_data = [{'content':'', 'annotations':[]}]
    for idx, each in enumerate(data[0]['annotations']):
        for idx_alpha, sentence in enumerate(sentences):
            if sentence['checked'] == True:
                continue
            temp = each.copy()
            check_word = temp['word']
            if check_word in sentence['sentence']:
                start_idx = re.search(r'\b({})\b'.format(check_word), sentence['sentence']).start()
                end_idx = start_idx + len(check_word)
                
                current_len = len(new_data[0]['content'])
                
                new_data[0]['content'] += sentence['sentence'] + ' '
                temp.update({'start':start_idx + current_len, 'end':end_idx + current_len})
                new_data[0]['annotations'].append(temp)
                
                sentences[idx_alpha]['checked'] = True
                break
    

    输出:

    print(new_data)
    [{'content': 'Hello I am Aniyya. I love eating grapes. Aniyya is great. ', 'annotations': [{'id': 1, 'start': 11, 'end': 17, 'tag': 'name', 'word': 'Aniyya'}, {'id': 2, 'start': 33, 'end': 39, 'tag': 'fruit', 'word': 'grapes'}, {'id': 3, 'start': 41, 'end': 47, 'tag': 'name', 'word': 'Aniyya'}]}]
    

    【讨论】:

    • 干得好。但是第二句的开始和结束位置有一个小问题。 [{'content':'你好,我是 Aniyya。我喜欢吃葡萄。 ', '注解': [{'id': 1, 'start': 11, 'end': 17, 'tag': 'name', 'word': 'Aniyya'}, {'id': 2, “开始”:14,“结束”:20,“标签”:“水果”,“单词”:“葡萄”}]}]
    • 近乎完美。正如@Nebu-Lin 建议的第二句的开始和结束键没有正确更新。
    • 啊。是的,我明白了。我把单个句子的 idx 开头,而它需要在完整的内容中。给我一分钟修复
    • @aniyya08,已更新。现在开始工作
    • 我为此提供了另一个解决方案here
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2016-03-17
    • 1970-01-01
    • 1970-01-01
    • 2015-01-23
    • 1970-01-01
    • 1970-01-01
    • 2011-03-01
    相关资源
    最近更新 更多