Cython 速度与 numpy答案

【问题标题】：Cython speed vs numpyCython 速度与 numpy
【发布时间】：2015-06-09 17:24:50
【问题描述】：

我是第一次尝试 cython。并尝试将函数从使用纯 numpy 转换为 cython

这是两个函数：

from __future__ import division
import numpy as np
cimport numpy as np

DTYPEf = np.float64
ctypedef np.float64_t DTYPEf_t

DTYPEi = np.int64
ctypedef  np.int64_t DTYPEi_t

DTYPEu = np.uint8
ctypedef np.uint8_t DTYPEu_t

cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)

def twodcitera(np.ndarray[DTYPEf_t, ndim=3] data, int res, int indexl, int indexu, float radius1, float radius2, output, float height1, float height2 ):  
'''
Function to return correlation for fixed radius using Cython
'''
cdef float sum_mask = 0
cdef int i,j,k
cdef int a, b, c
cdef np.ndarray[DTYPEi_t, ndim=3] x
cdef np.ndarray[DTYPEi_t, ndim=3] y
cdef np.ndarray[DTYPEi_t, ndim=3] z
cdef np.ndarray[DTYPEu_t, ndim=3, cast=True] R

a,b,c = res//2,res//2,res//2   
x,y,z = np.ogrid[-a:a,-b:b,-c:c]    

for i in xrange(indexl,indexu):
  for j in xrange(1):
    for k in xrange(1):
      R = np.roll(np.roll(np.roll(np.logical_and(np.logical_or(np.logical_and(z>height1,z<=height2), np.logical_and(z<-height1,z>=-height2)), np.logical_and(x**2 + y**2<= radius2**2, x**2 + y**2 > radius1**2)), (i-a), axis =0), (j-a), axis =1), (k-a), axis =2)
      sum_mask += (data[i][j][k] * np.average(data[R]))

output.put(sum_mask)

对于 numpy 实现：

def no_twodcitera(data, res, indexl, indexu, radius1, radius2, output, height1, height2 ):  
'''
Function to return correlation for fixed radius
'''
a,b,c = res/2,res/2,res/2    
x,y,z = np.ogrid[-a:a,-b:b,-c:c]    
sum_mask = 0
for i in xrange(indexl,indexu):
  for j in xrange(1):
    for k in xrange(1):
      R = np.roll(np.roll(np.roll(np.logical_and(np.logical_or(np.logical_and(z>height1,z<=height2), np.logical_and(z<-height1,z>=-height2)), np.logical_and(x**2 + y**2<= radius2**2, x**2 + y**2 > radius1**2)), (i-a), axis =0), (j-a), axis =1), (k-a), axis =2)
      sum_mask += (data[i][j][k] * np.average(data[R]))

output.put(sum_mask)

这两个函数实际上给了我相同的完成时间。

%timeit -n200 -r10 twodcitera(dd, tes_res,in1,in2,r[k],r[k+1], output, r[l], r[l+1])
200 loops, best of 10: 1.57 ms per loop

%timeit -n200 -r10 no_twodcitera(dd, tes_res,in1,in2,r[k],r[k+1], output, r[l], r[l+1])
200 loops, best of 10: 1.57 ms per loop

我想知道我做错了什么，或者在尝试实现 cython 时我没有正确理解。输入是：

dd  = np.random.randn(64,64,64) 
res = 64
r   = np.arange(0,21,2)
in1 = 0
in2 = 1
l   = 5
k   = 7
output = mp.Queue()

谢谢你在这里指出我的误解。

【问题讨论】：

j 和 k 的 xrange 保留为 1 仅用于测试目的，最终它将是 xrange(res) 中的 j 和 xrange(res) 中的 k
您是否尝试使用 cython -a 运行代码？ docs.cython.org/src/quickstart/…
还有什么是 in1,in2 等..
所以 r = np.arange(0,21,2)。 in1 和 in2 以及遍历数组的两个索引，例如 in1 = 0，in2 =1。 Radius1 和 radius2 是圆形壳的半径，而 height1 和 height2 是圆柱壳的长度。我已将代码并行化以在多个内核上运行，因此使用了输出，即 output=mp.Queue()。希望这有助于更好地理解代码。
np.roll , np.logical_and 等：所有这些都是慢速 python 函数，如果将它们转换为 cython，则不会有太大的加速。如果您真的想要加速，请将其表达为对 R 数组的每个索引进行循环以执行相同的操作。此外，data[i][j][k] 对 Cython 速度来说很糟糕，请改用data[i,j,k]。

标签： python numpy optimization cython

【解决方案1】：

在不知道您的输入和输出的情况下，按照 cython guide 为我编译的以下内容如果您解释如何创建测试输入，我可能会提供更多帮助。

编辑：我的第一个想法是 cython 编译可能有问题。但我找不到任何真正有用的东西。因此，这个答案对于改善速度问题并没有真正的帮助。无论如何，我把它留给那些对测试和理解感兴趣的人。

将代码放入test.pyx

cimport cython
import numpy as np
cimport numpy as np

DTYPEf = np.float64
ctypedef np.float64_t DTYPEf_t

DTYPEi = np.int64
ctypedef  np.int64_t DTYPEi_t

DTYPEu = np.uint8
ctypedef np.uint8_t DTYPEu_t


@cython.boundscheck(False)
@cython.wraparound(False)
def twodcitera(np.ndarray[DTYPEf_t, ndim=3] data, int res, int indexl, int indexu, float radius1, float radius2, output, float height1, float height2 ):
    '''
    Function to return correlation for fixed radius using Cython
    '''
    cdef float sum_mask = 0
    cdef int i,j,k
    cdef int a, b, c
    cdef np.ndarray[DTYPEi_t, ndim=3] x
    cdef np.ndarray[DTYPEi_t, ndim=3] y
    cdef np.ndarray[DTYPEi_t, ndim=3] z
    cdef np.ndarray[DTYPEu_t, ndim=3, cast=True] R
    a,b,c = res//2,res//2,res//2
    x,y,z = np.ogrid[-a:a,-b:b,-c:c]
    for i in xrange(indexl,indexu):
        for j in xrange(1):
            for k in xrange(1):
                R = np.roll(np.roll(np.roll(np.logical_and(np.logical_or(np.logical_and(z>height1,z<=height2), np.logical_and(z<-height1,z>=-height2)), np.logical_and(x**2 + y**2<= radius2**2, x**2 + y**2 > radius1**2)), (i-a), axis =0), (j-a), axis =1), (k-a), axis =2)
                sum_mask += (data[i][j][k] * np.average(data[R]))
    output.put(sum_mask)

创建make文件setup.py并放置

from distutils.core import setup
from Cython.Build import cythonize

setup(
    name = "testapp",
    ext_modules = cythonize('test.pyx'),  # accepts a glob pattern
    )

进入shell并编译它：

$python setup.py build_ext --inplace

转到 ipython 并尝试导入：

from test import *

帮我跑了。

速度测试显示：

In [28]: %timeit -n200 -r10 no_twodcitera(dd, res,in1,in2,r[k],r[k+1], output, r[l], r[l+1])
200 loops, best of 10: 1.29 ms per loop

In [29]: %timeit -n200 -r10 test.twodcitera(dd, res,in1,in2,r[k],r[k+1], output, r[l], r[l+1])
200 loops, best of 10: 1.31 ms per loop

所以结果是一样的，没有太大的区别。我还进行了一项 cProfile 研究，以查看调用堆栈的运行时是否出现了一些问题。不得不承认，以毫秒为单位的速度很难解释 cProfile！不过让我们试一试。

In [34]: cProfile.run("""no_twodcitera(dd, res,in1,in2,r[k],r[k+1], output, r[l], r[l+1])""")
         82 function calls in 0.004 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    0.004    0.004 <ipython-input-27-663e142d15fb>:1(no_twodcitera)
        1    0.000    0.000    0.004    0.004 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 _methods.py:43(_count_reduce_items)
        1    0.000    0.000    0.000    0.000 _methods.py:53(_mean)
        1    0.000    0.000    0.000    0.000 function_base.py:436(average)
        1    0.000    0.000    0.000    0.000 index_tricks.py:151(__getitem__)
        3    0.000    0.000    0.002    0.001 numeric.py:1279(roll)
        1    0.000    0.000    0.000    0.000 numeric.py:394(asarray)
        4    0.000    0.000    0.000    0.000 numeric.py:464(asanyarray)
        1    0.000    0.000    0.000    0.000 queues.py:99(put)
        1    0.000    0.000    0.000    0.000 threading.py:299(_is_owned)
        1    0.000    0.000    0.000    0.000 threading.py:372(notify)
        1    0.000    0.000    0.000    0.000 threading.py:63(_note)
        1    0.000    0.000    0.000    0.000 {hasattr}
       18    0.000    0.000    0.000    0.000 {isinstance}
        1    0.000    0.000    0.000    0.000 {issubclass}
        5    0.000    0.000    0.000    0.000 {len}
        3    0.000    0.000    0.000    0.000 {math.ceil}
        1    0.000    0.000    0.000    0.000 {method 'acquire' of '_multiprocessing.SemLock' objects}
        2    0.000    0.000    0.000    0.000 {method 'acquire' of 'thread.lock' objects}
        1    0.000    0.000    0.000    0.000 {method 'append' of 'collections.deque' objects}
        3    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'mean' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'reduce' of 'numpy.ufunc' objects}
        1    0.000    0.000    0.000    0.000 {method 'release' of 'thread.lock' objects}
        3    0.002    0.001    0.002    0.001 {method 'take' of 'numpy.ndarray' objects}
        9    0.000    0.000    0.000    0.000 {numpy.core.multiarray.arange}
        5    0.000    0.000    0.000    0.000 {numpy.core.multiarray.array}
        3    0.000    0.000    0.000    0.000 {numpy.core.multiarray.concatenate}
        4    0.000    0.000    0.000    0.000 {range}
        1    0.000    0.000    0.000    0.000 {zip}



In [35]: cProfile.run("""test.twodcitera(dd, res,in1,in2,r[k],r[k+1], output, r[l], r[l+1])""")
         82 function calls in 0.003 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.003    0.003 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 _methods.py:43(_count_reduce_items)
        1    0.000    0.000    0.000    0.000 _methods.py:53(_mean)
        1    0.000    0.000    0.000    0.000 function_base.py:436(average)
        1    0.000    0.000    0.000    0.000 index_tricks.py:151(__getitem__)
        3    0.000    0.000    0.001    0.000 numeric.py:1279(roll)
        1    0.000    0.000    0.000    0.000 numeric.py:394(asarray)
        4    0.000    0.000    0.000    0.000 numeric.py:464(asanyarray)
        1    0.000    0.000    0.000    0.000 queues.py:99(put)
        1    0.000    0.000    0.000    0.000 threading.py:299(_is_owned)
        1    0.000    0.000    0.000    0.000 threading.py:372(notify)
        1    0.000    0.000    0.000    0.000 threading.py:63(_note)
        1    0.000    0.000    0.000    0.000 {hasattr}
       18    0.000    0.000    0.000    0.000 {isinstance}
        1    0.000    0.000    0.000    0.000 {issubclass}
        5    0.000    0.000    0.000    0.000 {len}
        3    0.000    0.000    0.000    0.000 {math.ceil}
        1    0.000    0.000    0.000    0.000 {method 'acquire' of '_multiprocessing.SemLock' objects}
        2    0.000    0.000    0.000    0.000 {method 'acquire' of 'thread.lock' objects}
        1    0.000    0.000    0.000    0.000 {method 'append' of 'collections.deque' objects}
        3    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'mean' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 {method 'reduce' of 'numpy.ufunc' objects}
        1    0.000    0.000    0.000    0.000 {method 'release' of 'thread.lock' objects}
        3    0.001    0.000    0.001    0.000 {method 'take' of 'numpy.ndarray' objects}
        9    0.000    0.000    0.000    0.000 {numpy.core.multiarray.arange}
        5    0.000    0.000    0.000    0.000 {numpy.core.multiarray.array}
        3    0.000    0.000    0.000    0.000 {numpy.core.multiarray.concatenate}
        4    0.000    0.000    0.000    0.000 {range}
        1    0.001    0.001    0.003    0.003 {test.twodcitera}
        1    0.000    0.000    0.000    0.000 {zip}

遗憾的是，没有弹出任何内容。我会得出结论，原因可能是 numpy 已经很好地实现了，并且大部分时间都没有丢失在嵌套循环中。此外，cPython 主要受益于静态类型。由于我们在这里使用 numpy，这可能不是一个很大的好处。

【讨论】：

正如@moarningsun 所说，我能够编译，但只是在研究如何提高我的性能。我已经编辑了我的问题，以清楚地说明我使用的输入。谢谢。
你们是对的人，很抱歉造成混乱。我首先认为原因可能是编译但没有测试数据。我还查看了运行时配置文件，但没有太多。也许有人可以尝试 scipy weave 来加速这件事或 fortran 实现。