使用 theano 扫描实现 LSTM，比使用循环慢得多答案

【问题标题】：implementing LSTM with theano scan, way slower then using loops使用 theano 扫描实现 LSTM，比使用循环慢得多
【发布时间】：2015-07-01 18:30:12
【问题描述】：

我正在使用 Theano/Pylearn2 在我自己的网络中实现 LSTM 模型。但是，我发现 Theano 扫描比使用普通循环要慢得多。我使用了 Theano 分析器

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  95.4%    95.4%      25.255s       4.31e-02s     Py     586       3   theano.scan_module.scan_op.Scan
   1.8%    97.2%       0.466s       4.72e-05s     C     9864      41   theano.sandbox.cuda.basic_ops.GpuElemwise
   0.8%    97.9%       0.199s       8.75e-05s     C     2276      10   theano.sandbox.cuda.basic_ops.GpuAlloc
   0.7%    98.7%       0.196s       1.14e-04s     C     1724       8   theano.sandbox.cuda.blas.GpuDot22
   0.3%    99.0%       0.087s       1.06e-04s     C      828       3   theano.sandbox.cuda.basic_ops.GpuIncSubtensor
   0.2%    99.2%       0.051s       1.66e-04s     Py     310       2   theano.sandbox.cuda.basic_ops.GpuAdvancedSubtensor1

和操作，

<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  77.2%    77.2%      20.433s       7.40e-02s     Py     276        1   forall_inplace,gpu,grad_of_lstm__layers}
  18.2%    95.4%       4.822s       1.56e-02s     Py     310        2   forall_inplace,gpu,lstm__layers}

所以在 Scan 上花费了很多时间（这有点像预期的那样，但我没想到它会这么慢）。

我的代码主体是

        def fprop(self, state_below, state_prev = 0, cell_prev = 0):
            if state_prev == None:
              state_prev = self.state_prev;
            if cell_prev == None:
              cell_prev = self.cell_prev;
            i_gate = T.nnet.sigmoid(T.dot(state_below,self.Wi) +
                                                            T.dot(state_prev,self.Ui));
            f_gate = T.nnet.sigmoid(T.dot(state_below,self.Wf) +
                                                            T.dot(state_prev,self.Uf));
            C = T.tanh(T.dot(state_below, self.Wc) +
                               T.dot(state_prev, self.Uc));
            C = i_gate * C + f_gate  * cell_prev;
            o_gate = T.nnet.sigmoid(T.dot(state_below,self.Wo) +
                                                            T.dot(state_prev,self.Uo) +
                                                            T.dot(C, self.Vo));
            h_out = o_gate * T.tanh(C);
            return h_out, C

我把我的扫描写成：

[h,c,out], _ = theano.scan(fn=self.fprop_with_output,
               sequences=[X.T,Y[:,1:].T],
               outputs_info=[dict(initial=h_,taps=[-1]), dict(initial=c_,taps=[-1]),None],n_steps=X.shape[1]-1);

我注意到的一件事是 Theano 扫描的类型使用 Python 实现（？）这就是为什么这慢得离谱的原因？还是我做错了什么？为什么 Theano python 实现 Scan 而不是 C 的。

（我说使用循环更快，但在运行时更快，对于大型模型，我没有设法在合理的时间内编译使用循环的版本）。

【问题讨论】：

标签： machine-learning theano pylearn lstm

【解决方案1】：

Theano 开发人员使用 C 和 GPU 实现扫描和梯度扫描需要时间，因为它比其他功能复杂得多。这就是为什么当您对其进行分析时，它会显示 GpuElemwise、GpuGemv、GpuDot22 等，但您看不到 GpuScan 或 GpuGradofScan。

同时，您只能退回到 for 循环。

【讨论】：

【解决方案2】：

这是不久前被问到的，但我有/遇到了同样的问题。答案是 GPU 上的扫描速度很慢。

见：https://github.com/Theano/Theano/issues/1168

【讨论】：