为什么这个脚本在 Python 中这么慢？答案

【问题标题】：Why is this script so slow in Python?为什么这个脚本在 Python 中这么慢？
【发布时间】：2018-10-08 12:33:17
【问题描述】：

我编写了一个在 MATLAB 中训练一维 Kohonen 网络的脚本，它的作用就像一个魅力。然后我尝试将它翻译成 Python 2.7，这是一种我很新的语言，而且脚本需要很长时间才能运行。

我会解释我正在做什么，看看这里是否有人可以对此事有所了解。我在矩阵y 中有一个给定的数据集，我想用它训练不同的 SOM。 SOM 是一维的（一条线），其神经元数量各不相同。我首先训练了一个大小为 N=2 的 SOM，最后训练了 N=NMax，总共得到了 NMax-2+1 SOM。对于每个 SOM，我想在训练结束后存储权重，然后再转到下一个 SOM。

在 MATLAB 中，NMax = 5 和 iterMax = 50 需要 9.74 秒。在 Python 中，54.04 秒。这种差异是巨大的，而实际的数据集、SOM 数量和迭代次数甚至更大，因此 Python 代码需要永远结束。

我当前的代码如下：

import numpy as np
import time
y = np.random.rand(2500,3) # Create random dataset to test
def A(d,s): # Neighborhood function
    return np.exp(-d**2 / (2*s**2))
sigma_0 = float(5) # Initial standard deviation for A
eta_0 = float(1) # Initial learning rate
iterMax = 250 # Maximum number of iterations
NMax = 10 # Maximum number of neurons
w = range(NMax - 1) # Initialize the size of the weight matrix (it will store NMax-2+1 sets of weights, each of varying size depending on the value of N)
#%% KOHONEN 1D
t = time.time() # Start time
for N in np.arange(2,NMax + 1): # Size of the network
    w[N - 2] = np.random.uniform(0,1,(N,np.size(y,axis=1))) - 0.5 # Initial weights
    iterCount = 1; # Iteration counter
    while iterCount < iterMax:
        # Mix the datapoints to choose them in random order
        mixInputs = y[np.random.permutation(np.size(y,axis = 0)),:]
        # Decrease the value of the variance and the learning rate
        sigma = sigma_0 - (sigma_0/(iterMax + 1)) * iterCount
        eta = eta_0 - (eta_0/(iterMax + 1)) * iterCount
        for kk in range(np.size(mixInputs,axis = 0)): # Picking one datapoint at a time
            selectedInput = mixInputs[kk,:]
            # These two lines calculate the weight that is the nearest to the datapoint selected
            aux = np.absolute(np.array(np.kron(np.ones((N,1)),selectedInput)) - np.array(w[N - 2]))
            aux = np.sum(np.abs(aux)**2,axis=-1)
            ii = np.argmin(aux) # The node ii is the winner
            for jj in range(N):
                dist = min(np.absolute(ii-jj) , np.absolute(np.absolute(ii-jj)-N)) # Centering the neighborhood function in the winner
                w[N - 2][jj,:] = w[N - 2][jj,:] + eta * A(dist,sigma) * (selectedInput - w[N - 2][jj,:]) # Updating the weights
        print(N,iterCount)
        iterCount = iterCount + 1 

elapsedTime = time.time() - t

在 MATLAB 中，每次迭代（每次变量 iterCount 增加 1）几乎都是即时的。在 Python 中，每一个都需要很长时间。我不知道为什么它们的表现如此不同，但我想看看是否可以加快 Python 版本的速度。有什么建议吗？

编辑：根据 cmets 的要求，这里是更快的 MATLAB 版本的代码。

y = rand(2500,3) % Random dataset
A = @(d,s) exp(-d^2 / (2*s^2));
sigma_0 = 5;
eta_0 = 1;
iterMax = 250;
NMax = 10;
w = cell(NMax - 1,1);

%% KOHONEN 1D
tic();
for N = 2 : NMax 
    w{N - 1} = rand(N,size(y,2)) - 0.5;
    iterCount = 1; 
    while (iterCount < iterMax)
        mixInputs = y(randperm(size(y,1)),:);
        sigma = sigma_0 - (sigma_0/(iterMax + 1)) * iterCount;
        eta = eta_0 - (eta_0/(iterMax + 1)) * iterCount;
        for kk = 1 : size(mixInputs,1)
            input = mixInputs(kk,:);
            % [~,ii] = min(pdist2(input,w{N - 1}));
            aux = abs(repmat(input,N,1) - w{N - 1});
            [~,ii] = min((sum(aux.^2,2))); 
            for jj = 1 : N
                dist = min(abs(ii-jj) , abs(abs(ii-jj)-N));
                w{N - 1}(jj,:) = w{N - 1}(jj,:) + eta * A(dist,sigma) * (input - w{N - 1}(jj,:));
            end
        end
        N % Show N
        iterCount = iterCount + 1 % Show iterCount
    end
end
toc();

【问题讨论】：

你使用的是什么版本的东西？我在 Python 3.4.5 上，收到错误 Traceback (most recent call last): File "so.py", line 19, in <module> w[N - 2] = np.random.uniform(0,1,(N,np.size(y,axis=1))) - 0.5 # Initial weights
@Prune 我忘了写下来，抱歉。我正在使用 Python 2.7。我已将该信息添加到问题中，谢谢！
您是否对它进行了检测/分析？我现在正在跑步，如果可以的话，我会在答案中提供分析。 docs.python.org/2/library/profile.html
@Attie 不，我没有（我什至不知道这些词的存在！）。我会检查那个链接，看看我可以在问题中添加什么，谢谢！
@Cleb 我已经添加了！

标签： python matlab optimization

【解决方案1】：

这里是一些加速的快速测试 - 我认为输出是相同的，但肯定需要一些时间来仔细检查它：

import numpy as np
import time

np.random.seed(1234)
y = np.random.rand(2500,3) # Create random dataset to test

sigma_0 = float(5) # Initial standard deviation for A
eta_0 = float(1) # Initial learning rate
iterMax = 10 # Maximum number of iterations
NMax = 10 # Maximum number of neurons
w = {} # Initialize the size of the weight matrix (it will store NMax-2+1 sets of weights, each of varying size depending on the value of N)
#%% KOHONEN 1D
t = time.time() # Start time
for N in np.arange(2,NMax + 1): # Size of the network
    w[N - 2] = np.random.uniform(0,1,(N,np.size(y,axis=1))) - 0.5 # Initial weights
    iterCount = 1; # Iteration counter
    while iterCount < iterMax:
        # Mix the datapoints to choose them in random order
        mixInputs = y[np.random.permutation(np.size(y,axis = 0)),:]
        # Decrease the value of the variance and the learning rate
        sigma = sigma_0 - (sigma_0/(iterMax + 1)) * iterCount
        s2 = 2*sigma**2
        eta = eta_0 - (eta_0/(iterMax + 1)) * iterCount
        for kk in range(np.size(mixInputs,axis = 0)): # Picking one datapoint at a time
            selectedInput = mixInputs[kk,:]
            # These two lines calculate the weight that is the nearest to the datapoint selected
            aux = np.sum((selectedInput - np.array(w[N - 2]))**2, axis = -1)
            ii = np.argmin(aux)
            jjs = np.abs(ii - np.arange(N))
            dists = np.min(np.vstack([jjs , np.absolute(jjs-N)]), axis = 0)
            w[N - 2] = w[N - 2] + eta * np.exp((-dists**2)/s2).T[:,np.newaxis] * (selectedInput - w[N - 2]) # Updating the weights

        print(N,iterCount)
        iterCount = iterCount + 1 

elapsedTime = time.time() - t

主要加速主要在于使用广播来减少循环/函数调用。

我们可以换行：

 aux = np.absolute(np.array(np.kron(np.ones((N,1)),selectedInput)) - np.array(w[N - 2]))

与：

aux = np.abs(selectedInput - np.array(w[N - 2]))

（我已将其进一步填充到接下来的几个步骤中）。 Numpy 广播给我们同样的结果，而不必使用 kron 产品。

举个例子：

np.kron(np.ones((3,1)), np.array([6,5,4])) - np.arange(-9,0).reshape(3,3)

与以下输出相同：

np.array([6,5,4]) - np.arange(-9,0).reshape(3,3)

kron(np.ones(N,1),x) 给出了一个带有 N 个 x 副本的 N * x.shape[0] 数组。广播以更便宜的方式解决了这个问题。

另一个主要的加速是减少：

for jj in range(N):

到矩阵运算。我们在每个循环中计算一次2*sigma**2，将 A 函数替换为原生 numpy 调用，然后向量化其余部分。

【讨论】：

有趣...你能量化加速吗？我也对简化的aux 分配很感兴趣，如果您介意进一步解释吗？ - 删除 kron() 会带来很好的提升...
对我来说大约是原始时间的 1/3。查看编辑了解克朗的解释
您好，感谢您的详细解答。不过，我没有得到相同的结果。在ii= np.argmin(np.sum(selectedInput - np.array(w[N - 2])**2, axis = -1)) 行中，np.argmin() 的参数没有返回我在脚本中调用的aux。尽管如此，提升还是很大的！我得到了 21 秒，而我的 54 秒。但是，我仍然不明白为什么这与 MATLAB（9 秒）相比如此缓慢，即使使用您的改进版本也是如此。这种差异有解释吗？
很好 - 我错过了一个括号 - 在新的编辑中，我使用相同的种子得到了完全相同的答案。不确定 9 秒对 21 秒，可能完全重写会使它变得更好。你可以看到 numpy 有一些技巧 - 可能你知道 matlab 技巧。
非常感谢，这很有帮助。最后一个问题。在 MATLAB 中，将事物向量化总是更好。蟒蛇呢？是否也建议避免使用循环？还是仅在使用 numpy 时？

【解决方案2】：

使用profiling module 找出正在调用的函数，以及它们需要多长时间。

在下面的输出中，各列的含义如下：

ncalls 对于通话次数，

总时间在给定函数中花费的总时间（不包括调用子函数的时间）

每次通话是 tottime 除以 ncalls 的商

业余时间是在这个和所有子函数中花费的累积时间（从调用到退出）。即使对于递归函数，这个数字也是准确的。

每次通话是 cumtime 除以原始调用的商

文件名:lineno(函数) 提供了各个函数各自的数据

您似乎多次多次调用A()...通常使用相同的值。

python2.7 -m cProfile -s tottime ${YOUR_SCRIPT}的输出

         5481855 function calls (5481734 primitive calls) in 4.986 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.572    1.572    4.986    4.986 x.py:1(<module>)
   214500    0.533    0.000    0.533    0.000 x.py:8(A)
   107251    0.462    0.000    1.986    0.000 shape_base.py:686(kron)
   107251    0.345    0.000    0.456    0.000 numeric.py:1015(outer)
   214502    0.266    0.000    0.563    0.000 {sorted}
...

尝试缓存值：

A_vals = {}

def A(d,s): # Neighborhood function
    t = (d,s)
    if t in A_vals:
        return A_vals[t]
    ret = np.exp(-d**2 / (2*s**2))
    A_vals[t] = ret
    return ret

现在我们看到了：

         6206113 function calls (6205992 primitive calls) in 4.986 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.727    1.727    4.986    4.986 x.py:1(<module>)
   121451    0.491    0.000    2.180    0.000 shape_base.py:686(kron)
   121451    0.371    0.000    0.496    0.000 numeric.py:1015(outer)
   242902    0.293    0.000    0.621    0.000 {sorted}
   121451    0.265    0.000    0.265    0.000 {method 'reduce' of 'numpy.ufunc' objects}
...
   242900    0.091    0.000    0.091    0.000 x.py:7(A)
...

此时它变成了一个简单的优化练习！...

你名单上的下一个是kron() - 享受吧！

您还会发现将脚本分解为更小的函数很有帮助（从样式的角度和分析）。我做了以下纯粹是出于分析的原因 - 所以你最好使用合理的名称，并可能做出更好的分割。

import numpy as np
import time

y = np.random.rand(2500,3) # Create random dataset to test

A_vals = {}
def A(d,s): # Neighborhood function
    t = (d,s)
    if t in A_vals:
        return A_vals[t]
    ret = np.exp(-d**2 / (2*s**2))
    A_vals[t] = ret
    return ret

def a():
    sigma_0 = float(5) # Initial standard deviation for A
    eta_0 = float(1) # Initial learning rate
    iterMax = 250 # Maximum number of iterations
    NMax = 10 # Maximum number of neurons
    w = range(NMax - 1) # Initialize the size of the weight matrix (it will store NMax-2+1 sets of weights, each of varying size depending on the value of N)
    #%% KOHONEN 1D
    t = time.time() # Start time
    for N in np.arange(2,NMax + 1): # Size of the network
        b(w, N, sigma_0, eta_0, iterMax)

    elapsedTime = time.time() - t

def b(w, N, sigma_0, eta_0, iterMax):
    w[N - 2] = np.random.uniform(0,1,(N,np.size(y,axis=1))) - 0.5 # Initial weights
    for iterCount in range(1, iterMax):
        c(N, sigma_0, eta_0, iterMax, iterCount, w)

def c(N, sigma_0, eta_0, iterMax, iterCount, w):
    # Mix the datapoints to choose them in random order
    mixInputs = y[np.random.permutation(np.size(y,axis = 0)),:]
    # Decrease the value of the variance and the learning rate
    sigma = sigma_0 - (sigma_0/(iterMax + 1)) * iterCount
    eta = eta_0 - (eta_0/(iterMax + 1)) * iterCount
    for kk in range(np.size(mixInputs,axis = 0)): # Picking one datapoint at a time
        d(N, w, mixInputs, sigma, eta, kk)
    print(N,iterCount)

def d(N, w, mixInputs, sigma, eta, kk):
    selectedInput = mixInputs[kk,:]
    # These two lines calculate the weight that is the nearest to the datapoint selected
    aux = np.absolute(np.array(np.kron(np.ones((N,1)),selectedInput)) - np.array(w[N - 2]))
    aux = np.sum(np.abs(aux)**2,axis=-1)
    ii = np.argmin(aux) # The node ii is the winner
    for jj in range(N):
        e(N, w, sigma, eta, selectedInput, ii, jj)

def e(N, w, sigma, eta, selectedInput, ii, jj):
    dist = min(np.absolute(ii-jj) , np.absolute(np.absolute(ii-jj)-N)) # Centering the neighborhood function in the winner
    f(N, w, sigma, eta, selectedInput, jj, dist)

def f(N, w, sigma, eta, selectedInput, jj, dist):
    w[N - 2][jj,:] = w[N - 2][jj,:] + eta * A(dist,sigma) * (selectedInput - w[N - 2][jj,:]) # Updating the weights

a()

         6701974 function calls (6701853 primitive calls) in 4.985 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
   238921    0.691    0.000    0.777    0.000 x.py:56(f)
   119461    0.613    0.000    4.923    0.000 x.py:43(d)
   119461    0.498    0.000    2.144    0.000 shape_base.py:686(kron)
   238921    0.462    0.000    1.280    0.000 x.py:52(e)
   119461    0.369    0.000    0.495    0.000 numeric.py:1015(outer)

这将f() 标识为最大时间。

【讨论】：

tottime 和cumtime 有什么区别？为什么模块的tottime 会随着你的改进而上升？为什么cumtime 保持不变？
有趣的问题 - 我已经更新了答案。我怀疑时间分布由于修改而发生了变化，但我没想到......
您好，阿蒂，感谢您的回答。正如我在 jeremycg 的回答中评论的那样，按照您在此处推荐的操作非常有帮助。然而，MATLAB 的版本仍然快得多，即使它也调用了很多函数（脚本完全相同，只是语法发生了变化）。为什么是这样？对于这段特定的代码，有没有办法在 Python 中实现 MATLAB 的性能？
@Tendero：据我了解，Python 的性能仍然远不及 MATLAB。只要一切都在 Python 中矢量化，您可能不会注意到太多差异，但使用循环等您会注意到。如果您想要原始性能，您需要编写从 Python 调用的编译代码（例如，查看 Cython 或 Pybind11），或者转向具有良好 JIT 的脚本语言（例如，查看 Julia）。
我无法对 MATLAB 发表评论，但我认为它可以编译成比 Python 更高效的东西——可能会在此过程中进行优化。如果您追求原始效率，那么 python 并不是真正适合您的平台 - 可以相当轻松地使用 C 函数来增强 python，这可能会提供您所追求的速度提升...