从 PyTorch 中的 BiLSTM (BiGRU) 获取最后一个状态答案

【问题标题】：Taking the last state from BiLSTM (BiGRU) in PyTorch从 PyTorch 中的 BiLSTM (BiGRU) 获取最后一个状态
【发布时间】：2018-11-24 04:46:21
【问题描述】：

在阅读了几篇文章之后，我仍然对我从 BiLSTM 获取最后隐藏状态的实现的正确性感到困惑。

最后一个来源（4）的方法对我来说似乎是最干净的，但我仍然不确定我是否正确理解了这个线程。我是否使用了来自 LSTM 和反向 LSTM 的正确最终隐藏状态？这是我的实现

# pos contains indices of words in embedding matrix
# seqlengths contains info about sequence lengths
# so for instance, if batch_size is 2 and pos=[4,6,9,3,1] and 
# seqlengths contains [3,2], we have batch with samples
# of variable length [4,6,9] and [3,1]

all_in_embs = self.in_embeddings(pos)
in_emb_seqs = pack_sequence(torch.split(all_in_embs, seqlengths, dim=0))
output,lasthidden = self.rnn(in_emb_seqs)
if not self.data_processor.use_gru:
    lasthidden = lasthidden[0]
# u_emb_batch has shape batch_size x embedding_dimension
# sum last state from forward and backward  direction
u_emb_batch = lasthidden[-1,:,:] + lasthidden[-2,:,:]

对吗？

【问题讨论】：

标签： python lstm pytorch

【解决方案1】：

一般情况下，如果您想创建自己的 BiLSTM 网络，您需要创建两个常规 LSTM，并为一个提供常规输入序列，另一个提供反向输入序列。完成两个序列的输入后，您只需从两个网络中获取最后一个状态，并以某种方式将它们连接在一起（求和或连接）。

据我了解，您在 this example 中使用内置 BiLSTM（在 nn.LSTM 构造函数中设置 bidirectional=True）。然后，您会在输入批次后获得串联输出，因为 PyTorch 会为您处理所有麻烦。

如果是这样，并且你想对隐藏状态求和，那么你必须

u_emb_batch = (lasthidden[0, :, :] + lasthidden[1, :, :])

假设您只有一层。如果您有更多层，您的变体看起来会更好。

这是因为结果是结构化的（参见documentation）：

h_n 形状 (num_layers * num_directions, batch, hidden_size)：包含 t = seq_len 的隐藏状态的张量

顺便说一句，

u_emb_batch_2 = output[-1, :, :HIDDEN_DIM] + output[-1, :, HIDDEN_DIM:]

应该提供相同的结果。

【讨论】：

您的假设是正确的，我正在使用内置的 BiLSTM（分别为 BiGRU）来避免麻烦，并且我正在尝试使用多层架构。谢谢您的答复。 @toBTW：是的，但是如果我要使用“输出”，我首先需要将它从打包序列解包到填充序列中，这也是我想避免的。我还有一堆问题：@1：为什么我的变体不适用于 1 层？ @2：我错过的文档（或其他官方来源）中是否有任何地方说明了在最后隐藏变量中 BiRNN 的情况下对隐藏状态进行排序？
1.没关系，但我很惊讶地看到不寻常的数字 2。对于所有层，首先（索引 0）是正常的 RNN，然后反转。查看nn.RNNBase的源码

【解决方案2】：

下面是对那些使用解压序列的人的详细解释：

output 的形状为(seq_len, batch, num_directions * hidden_size)（参见documentation）。这意味着 GRU 的前向和后向传播的输出沿第三维连接。

假设您的示例中有batch=2 和hidden_size=256，您可以通过以下方式轻松分离正向和反向传递的输出：

output = output.view(-1, 2, 2, 256)   # (seq_len, batch_size, num_directions, hidden_size)
output_forward = output[:, :, 0, :]   # (seq_len, batch_size, hidden_size)
output_backward = output[:, :, 1, :]  # (seq_len, batch_size, hidden_size)

（注意：-1 告诉 pytorch 从其他维度推断该维度。请参阅this 问题。）

等效地，您可以在形状为(seq_len, batch, num_directions * hidden_size) 的原始output 上使用torch.chunk 函数：

# Split in 2 tensors along dimension 2 (num_directions)
output_forward, output_backward = torch.chunk(output, 2, 2)

现在您可以使用seqlengths（在对其进行整形后）torch.gather 前向传播的最后一个隐藏状态，并通过选择位置0 的元素来选择后向传播的最后一个隐藏状态

# First we unsqueeze seqlengths two times so it has the same number of
# of dimensions as output_forward
# (batch_size) -> (1, batch_size, 1)
lengths = seqlengths.unsqueeze(0).unsqueeze(2)

# Then we expand it accordingly
# (1, batch_size, 1) -> (1, batch_size, hidden_size) 
lengths = lengths.expand((1, -1, output_forward.size(2)))

last_forward = torch.gather(output_forward, 0, lengths - 1).squeeze(0)
last_backward = output_backward[0, :, :]

请注意，由于基于 0 的索引，我从 lengths 中减去了 1

last_forward 和 last_backward 这一点的形状都是 (batch_size, hidden_dim)

【讨论】：

我猜你在上面弄错了。这不应该是：output = output.view(-1, 2, 2, 256) # (seq_len, batch_size, num_directions,hidden_size) 吗？然后就是：output_forward = output[:, :, 0, :] # (seq_len, batch_size, direction) output_backward = output[:, :, 1, :] # (seq_len, batch_size, direction)
我在 pytorch 处于 v0.4.0 时写了这个答案，如果您查看当时的文档 (pytorch.org/docs/0.4.0/nn.html#gru)，输出尺寸为 seq_len, batch, hidden_size * num_directions，但在当前版本中它们是 @ 987654348@。考虑到新的顺序，我更新了答案。谢谢！

【解决方案3】：

我测试了 biLSTM 输出和 h_n：

# shape of x is size(batch_size, time_steps, input_size)
# shape of output (batch_size, time_steps, hidden_size * num_directions)
# shape of h_n is size(num_directions, batch_size, hidden_size)
output, (h_n, _c_n) = biLSTM(x) 

print('first step (element) of output from reverse == h_n from reverse?', 
    output[:, 0, hidden_size:] == h_n[1])
print('last step (element) of output from reverse == h_n from reverse?', 
    output[:, -1, hidden_size:] == h_n[1])

输出

first step (element) of output from reverse == h_n from reverse? True
last step (element) of output from reverse == h_n from reverse? False

这证实了反方向的h_n是第一个时间步的隐藏状态。

所以，如果你真的需要从正向和反向两个方向的最后一个时间步的隐藏状态，你应该使用：

sum_lasthidden = output[:, -1, :hidden_size] + output[:, -1, hidden_size:]

不是

h_n[0,:,:] + h_n[1,:,:]

因为h_n[1,:,:]是从反方向开始的第一个时间步的隐藏状态。

所以@igrinis 的答案

u_emb_batch = (lasthidden[0, :, :] + lasthidden[1, :, :])

不正确。

但理论上，反向的最后一个时间步隐藏状态只包含序列最后一个时间步的信息。

【讨论】：