【发布时间】:2021-05-06 20:24:06
【问题描述】:
我正在关注this post提取句子的嵌入,对于单个句子,步骤描述如下:
text = "After stealing money from the bank vault, the bank robber was seen " \
"fishing on the Mississippi river bank."
# Add the special tokens.
marked_text = "[CLS] " + text + " [SEP]"
# Split the sentence into tokens.
tokenized_text = tokenizer.tokenize(marked_text)
# Mark each of the 22 tokens as belonging to sentence "1".
segments_ids = [1] * len(tokenized_text)
# Convert inputs to PyTorch tensors
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',
output_hidden_states = True,
)
# Put the model in "evaluation" mode, meaning feed-forward operation.
model.eval()
with torch.no_grad():
outputs = model(tokens_tensor, segments_tensors)
hidden_states = outputs[2]
我想为一批序列执行此操作。这是我的示例代码:
seql = ['this is an example', 'today was sunny and', 'today was']
encoded = [tokenizer.encode(seq, max_length=5, pad_to_max_length=True) for seq in seql]
encoded
[[2, 2511, 1840, 3251, 3],
[2, 1663, 2541, 1957, 3],
[2, 1663, 2541, 3, 0]]
但由于我使用的是批处理,因此序列需要具有相同的长度。所以我介绍了一个填充标记(第 3 句),它让我对以下几点感到困惑:
-
pad_token (0) 的段 id 应该是什么?
-
在将张量提供给模型时是否应该使用注意力掩蔽以忽略填充?在示例中,仅使用了标记和分段张量。
outputs = model(tokens_tensor, segments_tensors) -
如果我不使用批处理而是使用单个句子,那么我可能不需要填充标记。与批量相比,这样做会更好吗?
【问题讨论】:
标签: pytorch bert-language-model