【问题标题】:How to extract document embeddings from HuggingFace Longformer如何从 HuggingFace Longformer 中提取文档嵌入
【发布时间】:2020-12-21 19:08:57
【问题描述】:

想做类似的事情

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
input_ids = torch.tensor(tokenizer.encode("Hello, my dog is cute")).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

(来自this thread) 使用longformer

文档示例似乎做了类似的事情,但令人困惑(尤其是如何设置注意掩码,我假设我想将其设置为 [CLS] 令牌,该示例将全局注意设置为随机我认为的价值观)

>>> import torch
>>> from transformers import LongformerModel, LongformerTokenizer

>>> model = LongformerModel.from_pretrained('allenai/longformer-base-4096', return_dict=True)
>>> tokenizer = LongformerTokenizer.from_pretrained('allenai/longformer-base-4096')

>>> SAMPLE_TEXT = ' '.join(['Hello world! '] * 1000)  # long input document
>>> input_ids = torch.tensor(tokenizer.encode(SAMPLE_TEXT)).unsqueeze(0)  # batch of size 1

>>> # Attention mask values -- 0: no attention, 1: local attention, 2: global attention
>>> attention_mask = torch.ones(input_ids.shape, dtype=torch.long, device=input_ids.device) # initialize to local attention
>>> attention_mask[:, [1, 4, 21,]] = 2  # Set global attention based on the task. For example,
...                                     # classification: the <s> token
...                                     # QA: question tokens
...                                     # LM: potentially on the beginning of sentences and paragraphs
>>> outputs = model(input_ids, attention_mask=attention_mask)
>>> sequence_output = outputs.last_hidden_state
>>> pooled_output = outputs.pooler_output

(来自here

【问题讨论】:

  • 你有想过这个问题吗?

标签: huggingface-transformers


【解决方案1】:

您不需要弄乱这些值(除非您想优化 longformer 处理不同令牌的方式)。在您上面列出的示例中,它将强制对第 1、第 4 和第 21 个令牌进行全局关注。他们在这里放了随机数,但有时您可能希望全局参加某种类型的标记,例如一系列标记中的问题标记(例如: + 但仅全局参加第一部分)。

如果您只是在寻找嵌入,您可以关注我们讨论过的内容here :The last layers of longformer for document embeddings

【讨论】:

    猜你喜欢
    • 2021-01-20
    • 2018-06-16
    • 1970-01-01
    • 2013-03-20
    • 1970-01-01
    • 1970-01-01
    • 2021-11-14
    • 2022-10-15
    • 2011-03-10
    相关资源
    最近更新 更多