【问题标题】:How can I get indexes after getting NER results?获取 NER 结果后如何获取索引?
【发布时间】:2021-12-08 09:54:24
【问题描述】:
model = AutoModelForTokenClassification.from_pretrained("dbmdz/bert-large-cased-finetuned-conll03-english")
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")
from transformers import LukeTokenizer
from transformers import PreTrainedTokenizerFast



label_list = [
    "O",       # Outside of a named entity
    "B-MISC",  # Beginning of a miscellaneous entity right after another miscellaneous entity
    "I-MISC",  # Miscellaneous entity
    "B-PER",   # Beginning of a person's name right after another person's name
    "I-PER",   # Person's name
    "B-ORG",   # Beginning of an organisation right after another organisation
    "I-ORG",   # Organisation
    "B-LOC",   # Beginning of a location right after another location
    "I-LOC"    # Location
]

sequence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
           "close to the Manhattan Bridge."

# Bit of a hack to get the tokens with the special tokens
tokens = tokenizer.tokenize(tokenizer.decode(tokenizer.encode(sequence)))
inputs = tokenizer.encode(sequence, return_tensors="pt")

outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)

print([(token, label_list[prediction]) for token, prediction in zip(tokens, predictions[0].tolist())])

output:    [('[CLS]', 'O'), ('Hu', 'I-ORG'), ('##gging', 'I-ORG'), ('Face', 'I-ORG'), ('Inc', 'I-ORG'), 
    ('.', 'O'), ('is', 'O'), ('a', 'O'), ('company', 'O'), ('based', 'O'), ('in', 'O'), ('New', 'I-
    LOC'), ('York', 'I-LOC'), ('City', 'I-LOC'), ('.', 'O'), ('Its', 'O'), ('headquarters', 'O'), 
    ('are', 'O'), ('in', 'O'), ('D', 'I-LOC'), ('##UM', 'I-LOC'), ('##BO', 'I-LOC'), (',', 'O'), 
    ('therefore', 'O'), ('very', 'O'), ('##c', 'O'), ('##lose', 'O'), ('to', 'O'), ('the', 'O'), 
    ('Manhattan', 'I-LOC'), ('Bridge', 'I-LOC'), ('.', 'O'), ('[SEP]', 'O')]

我从 Hugging Face Transformers 文档中举了一个例子,以了解该库的工作原理。但是我遇到了一个很长时间都无法解决的问题。在获得“打印”中的输出后,我想获得“序列”变量的已识别实体的索引。我怎样才能做到这一点?在文档中没有找到任何方法,是我遗漏了什么吗?

例如:

('Hu', 'I-ORG'), ('##gging', 'I-ORG'), ('Face', 'I-ORG'), ('Inc', 'I-ORG') --->(开始:0,结束:16)

其他问题:我应该在我的结果中去掉##(例如:('##gging', 'I-ORG'))吗?或者这样可以吗?

【问题讨论】:

    标签: python huggingface-transformers


    【解决方案1】:

    您想要实现的所有目标都已作为tokenclassificationpipeline 提供:

    from transformers import pipeline
    
    ner =  pipeline('token-classification', model='dbmdz/bert-large-cased-finetuned-conll03-english', tokenizer='dbmdz/bert-large-cased-finetuned-conll03-english')
    
    sentence = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, therefore very" \
               "close to the Manhattan Bridge."
    
    ner(sentence)
    

    输出:

    [{'end': 2,
      'entity': 'I-ORG',
      'index': 1,
      'score': 0.9995108,
      'start': 0,
      'word': 'Hu'},
     {'end': 7,
      'entity': 'I-ORG',
      'index': 2,
      'score': 0.98959744,
      'start': 2,
      'word': '##gging'},
     {'end': 12,
      'entity': 'I-ORG',
      'index': 3,
      'score': 0.9979704,
      'start': 8,
      'word': 'Face'},
     {'end': 16,
      'entity': 'I-ORG',
      'index': 4,
      'score': 0.9993759,
      'start': 13,
      'word': 'Inc'},
     {'end': 43,
      'entity': 'I-LOC',
      'index': 11,
      'score': 0.9993406,
      'start': 40,
      'word': 'New'},
     {'end': 48,
      'entity': 'I-LOC',
      'index': 12,
      'score': 0.99919283,
      'start': 44,
      'word': 'York'},
     {'end': 53,
      'entity': 'I-LOC',
      'index': 13,
      'score': 0.99934113,
      'start': 49,
      'word': 'City'},
     {'end': 80,
      'entity': 'I-LOC',
      'index': 19,
      'score': 0.9863364,
      'start': 79,
      'word': 'D'},
     {'end': 82,
      'entity': 'I-LOC',
      'index': 20,
      'score': 0.939624,
      'start': 80,
      'word': '##UM'},
     {'end': 84,
      'entity': 'I-LOC',
      'index': 21,
      'score': 0.9121385,
      'start': 82,
      'word': '##BO'},
     {'end': 122,
      'entity': 'I-LOC',
      'index': 29,
      'score': 0.983919,
      'start': 113,
      'word': 'Manhattan'},
     {'end': 129,
      'entity': 'I-LOC',
      'index': 30,
      'score': 0.99242425,
      'start': 123,
      'word': 'Bridge'}]
    

    您还可以通过定义聚合策略对令牌进行分组:

    ner(sentence, aggregation_strategy='simple')
    

    输出:

    [{'end': 16,
      'entity_group': 'ORG',
      'score': 0.9966136,
      'start': 0,
      'word': 'Hugging Face Inc'},
     {'end': 53,
      'entity_group': 'LOC',
      'score': 0.9992916,
      'start': 40,
      'word': 'New York City'},
     {'end': 84,
      'entity_group': 'LOC',
      'score': 0.946033,
      'start': 79,
      'word': 'DUMBO'},
     {'end': 129,
      'entity_group': 'LOC',
      'score': 0.98817164,
      'start': 113,
      'word': 'Manhattan Bridge'}]
    

    【讨论】:

      猜你喜欢
      • 2012-04-10
      • 2016-07-12
      • 1970-01-01
      • 2015-12-29
      • 1970-01-01
      • 2018-01-22
      • 1970-01-01
      • 1970-01-01
      • 2022-10-25
      相关资源
      最近更新 更多