TensorFlow 中的 SVD 比 numpy 慢答案

【问题标题】：SVD in TensorFlow is slower than in numpyTensorFlow 中的 SVD 比 numpy 慢
【发布时间】：2018-03-03 10:36:38
【问题描述】：

我观察到，在我的机器上，tensorflow 中的 SVD 运行速度明显慢于 numpy。我有 GTX 1080 GPU，并期望 SVD 至少与使用 CPU (numpy) 运行代码时一样快。

环境信息

操作系统

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.10
Release:    16.10
Codename:   yakkety

CUDA 和 cuDNN 的安装版本：

ls -l /usr/local/cuda-8.0/lib64/libcud*
-rw-r--r-- 1 root      root    556000 Feb 22  2017 /usr/local/cuda-8.0/lib64/libcudadevrt.a
lrwxrwxrwx 1 root      root        16 Feb 22  2017 /usr/local/cuda-8.0/lib64/libcudart.so -> libcudart.so.8.0
lrwxrwxrwx 1 root      root        19 Feb 22  2017 /usr/local/cuda-8.0/lib64/libcudart.so.8.0 -> libcudart.so.8.0.61
-rwxr-xr-x 1 root      root    415432 Feb 22  2017 /usr/local/cuda-8.0/lib64/libcudart.so.8.0.61
-rw-r--r-- 1 root      root    775162 Feb 22  2017 /usr/local/cuda-8.0/lib64/libcudart_static.a
lrwxrwxrwx 1 voldemaro users       13 Nov  6  2016 /usr/local/cuda-8.0/lib64/libcudnn.so -> libcudnn.so.5
lrwxrwxrwx 1 voldemaro users       18 Nov  6  2016 /usr/local/cuda-8.0/lib64/libcudnn.so.5 -> libcudnn.so.5.1.10
-rwxr-xr-x 1 voldemaro users 84163560 Nov  6  2016 /usr/local/cuda-8.0/lib64/libcudnn.so.5.1.10
-rw-r--r-- 1 voldemaro users 70364814 Nov  6  2016 /usr/local/cuda-8.0/lib64/libcudnn_static.a

TensorFlow 设置

python -c "import tensorflow; print(tensorflow.__version__)"
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcublas.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcudnn.so.5 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcufft.so.8.0 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:135] successfully opened CUDA library libcurand.so.8.0 locally
1.0.0

代码：

'''
Created on Sep 21, 2017

@author: voldemaro
'''
import numpy as np
import tensorflow as tf
import time;
import numpy.linalg as NLA;




N=1534;

svd_array = np.random.random_sample((N,N));
svd_array = svd_array.astype(complex);

specVar = tf.Variable(svd_array, dtype=tf.complex64);

[D2, E1,  E2] = tf.svd(specVar);

init_OP = tf.global_variables_initializer();

with tf.Session() as sess:
    # Initialize all tensorflow variables
    start = time.time();
    sess.run(init_OP);
    print 'initializing variables: {} s'.format(time.time()-start);

    start_time = time.time();
    [d, e1, e2]  = sess.run([D2, E1,  E2]);
    print("Tensorflow SVD ---: {} s" . format(time.time() - start_time));


# Equivalent numpy 
start = time.time();

u, s, v = NLA.svd(svd_array);   
print 'numpy SVD  ---: {} s'.format(time.time() - start);

代码跟踪：

W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:910] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties: 
name: GeForce GTX 1080
major: 6 minor: 1 memoryClockRate (GHz) 1.7335
pciBusID 0000:01:00.0
Total memory: 7.92GiB
Free memory: 7.11GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:906] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:916] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)
initializing variables: 0.230546951294 s
Tensorflow SVD ---: 6.56117296219 s
numpy SVD  ---: 4.41714000702 s

【问题讨论】：

标签： python numpy tensorflow svd

【解决方案1】：

GPU 执行通常仅在并行化有效时才优于 CPU。

然而，SVD 算法的并行化仍处于积极研究之中，这意味着尚未发现并行版本比串行实现更优越。

NumPy 版本很可能基于优化得非常好的 FORTRAN 实现，而我相信 TensorFlow 有自己的 C++ 实现，显然它没有像 NumPy 调用的代码那样优化。

编辑：与 FORTRAN 实现相比，您可能不是第一个观察 poorer performances of TensorFlow with SVD 的人。

【讨论】：

当我分析代码时，我看到 numpy 将负载分散到所有 8 个 CPU 内核 (Intel i7)，所以我有点期待看到拥有这么多 (2560) 个 CUDA 内核的好处.
看起来早些时候有一些努力利用 GPU 显示出比 Intel MKL 提高 5 倍的性能 - s3.amazonaws.com/academia.edu.documents/30806706/…

【解决方案2】：

它看起来像 TensorFlow op implementsgesvd，而如果您使用启用 MKL 的 numpy/scipy（即，如果您使用 conda），它默认更快（但在数值上不太稳健）gesdd

您可以尝试在 scipy 中与gesvd 进行比较：

from scipy import linalg
u0, s0, vt0 = linalg.svd(target0, lapack_driver="gesvd")

我在 MKL 版本中也体验到了更好的结果，所以我一直使用这个助手 class 在 TensorFlow 和 SVD 的 numpy 版本之间透明地切换，使用 tf.Variable 来存储结果

你是这样用的

result = SvdWrapper(tensor)
result.update()
sess.run([result.u, result.s, result.v])

关于速度慢的更多细节问题：https://github.com/tensorflow/tensorflow/issues/13222

【讨论】：