连接矩阵乘法是否比多个非连接矩阵乘法更快？如果是这样，为什么？答案

【问题标题】：Is concatenated matrix multiplication faster than multiple non-concatenated matmul? If so, why?连接矩阵乘法是否比多个非连接矩阵乘法更快？如果是这样，为什么？
【发布时间】：2019-02-21 04:42:39
【问题描述】：

LSTM 单元的定义涉及与输入的 4 次矩阵乘法，以及与输出的 4 次矩阵乘法。我们可以通过连接 4 个小矩阵（现在矩阵大 4 倍）使用单个矩阵乘法来简化表达式。

我的问题是：这会提高矩阵乘法的效率吗？如果是这样，为什么？因为我们可以把它们放在连续的记忆中？还是因为代码简洁？

无论我们是否连接矩阵，我们相乘的项目数都不会改变。（因此复杂性不应该改变。）所以我想知道我们为什么要这样做..

这是来自torch.nn.LSTM(*args, **kwargs) 的pytorch 文档的摘录。 W_ii, W_if, W_ig, W_io 被连接起来。

weight_ih_l[k] – the learnable input-hidden weights of the \text{k}^{th}k 
th
  layer (W_ii|W_if|W_ig|W_io), of shape (4*hidden_size x input_size)

weight_hh_l[k] – the learnable hidden-hidden weights of the \text{k}^{th}k 
th
  layer (W_hi|W_hf|W_hg|W_ho), of shape (4*hidden_size x hidden_size)

bias_ih_l[k] – the learnable input-hidden bias of the \text{k}^{th}k 
th
  layer (b_ii|b_if|b_ig|b_io), of shape (4*hidden_size)

bias_hh_l[k] – the learnable hidden-hidden bias of the \text{k}^{th}k 
th
  layer (b_hi|b_hf|b_hg|b_ho), of shape (4*hidden_size)

【问题讨论】：

标签： tensorflow matrix lstm pytorch gpu

【解决方案1】：

LSTM 的结构不是为了提高乘法效率，而是为了绕过递减梯度/爆炸梯度 (https://stats.stackexchange.com/questions/185639/how-does-lstm-prevent-the-vanishing-gradient-problem)。正在进行一些研究来减轻梯度减小的影响，而 GRU / LSTM 单元 + 窥视孔很少尝试减轻这种影响。

【讨论】：

你没有理解我的问题...我在问为什么我们要连接 LSTM 的参数用于matmul。