Python：并行运行嵌套循环，2D移动窗口答案

【问题标题】：Python: Running nested loop, 2D moving window, in ParallelPython：并行运行嵌套循环，2D移动窗口
【发布时间】：2020-02-18 19:09:49
【问题描述】：

我使用地形数据。对于一个特定的问题，我用 Python 编写了一个函数，它使用特定大小的移动窗口来压缩矩阵（高程网格）。然后我必须对此窗口进行分析，并将窗口中心的单元格设置为结果值。

我的最终输出是一个与我的原始矩阵大小相同的矩阵，该矩阵已根据我的分析进行了更改。这个问题需要 11 个小时才能在一个小区域上运行，所以我认为并行化内部循环会加快速度。 另外，也可能有一个聪明的矢量化解决方案......

请看下面我的函数，DEM 是一个 2D numpy 数组，w 是窗口的大小。

def RMSH_det(DEM, w):
    import numpy as np
    from scipy import signal
    [nrows, ncols] = np.shape(DEM)

    #create an empty array to store result
    rms = DEM*np.nan

#    nw=(w*2)**2
#    x = np.arange(0,nw)

    for i in np.arange(w+1,nrows-w):


        for j in np.arange(w+1,ncols-w):

            d1 = np.int64(np.arange(i-w,i+w))
            d2 = np.int64(np.arange(j-w,j+w))

            win = DEM[d1[0]:d1[-1],d2[0]:d2[-1]]

            if np.max(np.isnan(win)) == 1:
                rms[i,j] = np.nan

            else:
                win = signal.detrend(win, type = 'linear')
                z = np.reshape(win,-1)
                nz = np.size(z)
                rootms = np.sqrt(1 / (nz - 1) * np.sum((z-np.mean(z))**2))
                rms[i,j] = rootms


    return(rms)

我已经在 SO/SE 中搜索了我的问题的解决方案，并遇到了许多嵌套 for 循环并尝试并行运行它们的示例。我一直在努力调整我的代码以匹配示例，并希望得到一些帮助。这个问题的解决方案将帮助我使用我拥有的其他几个移动窗口函数。

到目前为止，我已经将内部循环移到了它自己的函数中，可以从外部循环中调用它：

def inLoop(i, w, DEM,rms,ncols):
        for j in np.arange(w+1,ncols-w):

            d1 = np.int64(np.arange(i-w,i+w))
            d2 = np.int64(np.arange(j-w,j+w))

            win = DEM[d1[0]:d1[-1],d2[0]:d2[-1]]

            if np.max(np.isnan(win)) == 1:
                rms[i,j] = np.nan

            else:
                win = signal.detrend(win, type = 'linear')
                z = np.reshape(win,-1)
                nz = np.size(z)
                rootms = np.sqrt(1 / (nz - 1) * np.sum((z-np.mean(z))**2))
                rms[i,j] = rootms


        return(rms)

但我不确定使用需要输入内部循环的必要变量对 Pool 调用进行编码的正确方法。请参阅下面的外循环：

 for i in np.arange(w+1,nrows-w):
        number_of_workers = 8

        with Pool(number_of_workers) as p:
            #call the pool
            p.starmap(inLoop, [i, w, DEM, rms, ncols])

剩下的问题：

甚至可以通过并行化优化这段代码吗？
如何成功存储并行化嵌套 for 循环的结果？

【问题讨论】：

您介意更具体一些吗（未能在此处发布 MCVE - 无法重现）？ [nrows, ncols], dem.dtype, dem.flags 和 w 的用例显式值是什么，[hours] 中的目标运行时间是什么，您的目标是设计您的解决方案吗？
[nrows, ncols] = 1355, 1165 dem.dtype = float32 dem.flags = C_CONTIGUOUS : True F_CONTIGUOUS : False OWNDATA : True WRITEABLE : True ALIGNED : True WRITEBACKIFCOPY : False UPDATEIFCOPY : False w = 10 目前运行需要 11 小时，我希望看到结果在 6 或更快。
是否也允许使用 Numba 解决方案？可以在 10 秒内完成（稍微重写）。
@max9111 您是否介意检查以下代码中有关性能提示的注释并调整您建议的代码以不破坏 DEM 中的原始数据（通过“通过” np.view()- only ) 并且只有在这被纠正之后 - 你介意在更新的 DEM 尺度上测试它吗？ 13k1 x 11k4 x 4B float32 数据类型？感谢您的重新考虑和 numba 编译的 scipy.signal.detrend() 技巧。 先生，对此表示敬意。
isnan(win) 中出现另一个错误。 np.nan==np.nan 总是错误的......

标签： python-3.x performance numpy parallel-processing nested-loops

【解决方案1】：

Q : 这个问题需要 11 小时 在小范围内运行，... _{敬请期待，我们可以我们将在 20 [分钟] 内完成！}

提供了适当的解释，我感谢O / P作者：

# DEM.shape = [nrows, ncols] = [ 1355, 1165 ]
# DEM.dtype = float32 
#    .flags = C_CONTIGUOUS    : True
#             F_CONTIGUOUS    : False
#             OWNDATA         : True
#             WRITEABLE       : True
#             ALIGNED         : True
#             WRITEBACKIFCOPY : False
#             UPDATEIFCOPY    : False

我尝试审查代码并设置一个更高效的代码模型，然后再将所有流行和即用型 numpy + numba 类固醇放入其中numpy-only result works
在[100,100] DEM-grid的样本上在上述内核窗口宽度~ 6 [s]上工作@987654328 @

同样，对于[200,200] DEM-grid，采用 ~ 36 [s] - 显然，缩放比例是 ~ O( N^2 )

同样，对于 [1000,1000] DEM-grid，采用 ~ 1077 [s] ~ 17.6 [min] wow !

.jit[1000,1000] DEM-grid 上的字段目前正在测试中，一旦完成就会更新帖子 + 一旦 numba.jit() 代码将享受运行进一步加速的结果

到目前为止，很有希望，不是吗？

如果您 @morrismc 现在在 [100,100]-matrix 上测试您的原样代码，我们已经可以猜测主体加速的实现范围，甚至在运行测试完成之前。

>>> pass;    import numpy as np
>>> from zmq import Stopwatch; clk = Stopwatch()
>>>
>>> size =  100; demF32 = np.random.random( ( size, size ) ).astype( np.float32 ); resF32 = demF32.copy(); clk.start(); _ = RMSH_det( demF32, 10, resF32 ); t = clk.stop(); print( "{1:>13d} [us]\nNumOf_np.nan-s was {0:d}".format( _, t ) )
      6492192 [us]
NumOf_np.nan-s was 0

>>> size =  200; demF32 = np.random.random( ( size, size ) ).astype( np.float32 ); resF32 = demF32.copy(); clk.start(); _ = RMSH_det( demF32, 10, resF32 ); t = clk.stop(); print( "{1:>13d} [us]\nNumOf_np.nan-s was {0:d}".format( _, t ) )
     35650629 [us]
NumOf_np.nan-s was 0

>>> size = 1000; demF32 = np.random.random( ( size, size ) ).astype( np.float32 ); resF32 = demF32.copy(); clk.start(); _ = RMSH_det( demF32, 10, resF32 ); t = clk.stop(); print( "{1:>13d} [us]\nNumOf_np.nan-s was {0:d}".format( _, t ) )
   1058702889 [us]
NumOf_np.nan-s was 0

所有这些都在 scipy 1.2.1 上，因此没有 1.3.1 可能进一步加速的好处

`numba.jit()` LLVM 编译的代码。哎呀，更慢？

numba.jit()-acceleration 在[100,100] DEM-grid 上显示了大约 200 [ms] 更糟 运行时间，并指定了签名（因此没有临时分析此处产生了成本）和nogil = True（'0.43.1+0.g8dabe7abe.dirty' 还不是最新的）

猜猜这里没有更多的收获，无需将游戏移动到编译的 Cython 区域，但只有大约几十分钟而不是几十小时，Alea Iacta Est - 只是 numpy 智能矢量化代码规则！

结语：

如果原始算法是正确的（并且对于任何进一步的改进工作在源代码中留下了一些疑问），任何尝试运行其他形式的[PARALLEL] code-execution-flow 在这里都无济于事 （ kernel-windows[w,w] 是 DEM 网格内存布局的非常小的非连续区域，内存 I/O 成本是这里运行时预算的主要部分，还有一些更好的索引可能会提高缓存行的重用率，但总体工作量远远超出预算，因为从~ 11 [hrs] 下降到大约~ 6 [hrs] 的目标已经成功实现关于[1300,1100] float32 DEM-grids 可实现的 ~ 20 [min] 运行时

代码保持原样（非 PEP-8），因为 [DOC.me], [TEST.me] 和 [PERF.me] 的所有附加教学价值QA 阶段，所以所有的 PEP-isto-evangelisators 都支持 O/P 作者对左侧全屏宽度布局的看法，以便理解 WHY 并改进代码, 剥离 cmets 将失去她/他进一步提高代码性能的前进方向。谢谢。

@jit( [ "int32( float32[:,:], int32, float32[:,:] )", ], nogil    = True )                  # numba.__version__ '0.43.1+0.g8dabe7abe.dirty'
def RMSH_det_jit( DEMf32, w, rmsRESULTf32 ):                            # pre-allocate rmsRESULTf32[:,:] externally
    #import numpy as np
    #from scipy import signal
    #
    # [nrows, ncols] = np.shape( DEM )                                  # avoid ~ [ 1355, 1165 ]
    #                                                                   # DEM.dtype = float32 
    #                                                                   #    .flags = C_CONTIGUOUS    : True
    #                                                                   #             F_CONTIGUOUS    : False
    #                                                                   #             OWNDATA         : True
    #                                                                   #             WRITEABLE       : True
    #                                                                   #             ALIGNED         : True
    #                                                                   #             WRITEBACKIFCOPY : False
    #                                                                   #             UPDATEIFCOPY    : False
    #
    rmsRESULTf32[:,:] = np.nan                                          #        .STO[:,:] np.nan-s, using in-place assignment into the by-ref passed, externally pre-allocated np.ndarray
    dtdWIN            = np.ones( ( 2 * w - 1,                           #        .ALLOC once, re-use 1M+ times
                                   2 * w - 1 ) )
    a_div_by_nz_minus1 = 1. / ( dtdWIN.size - 1  )                      #        .SET float CONST with about a ~1M+ re-use
    a_num_of_NaNs      = 0                                              #        .SET i4 bonus value, ret'd as a side-effect of the signature ... 
    # rms = DEM*np.nan                                                  # avoid ( pre-alloc rmsRESULTf32 ) externally create and pass a right-sized, empty array to store all results
    # nw  = ( w * 2 )**2
    # x   = np.arange( 0, nw )

    #                        11..1344
    #or     i in np.arange( w+1,           nrows-w ):                   # w ~ 10 -> [11:1344, 11:1154]
    for     i in np.arange( w+1, DEMf32.shape[0]-w ):                   #         ??? never touches DEM-row/column[0]?? or off-by-one indexing error ???
        fromI = i - w                                                   #        .UPD ALAP
        tillI = i + w - 1                                               #        .UPD ALAP upper bound index excluded ( this is how a code in [ np.arange(...)[0]:np.arange(...)[-1] ] works )
        #                    11..1154
        #or j in np.arange( w+1,           ncols-w ):
        for j in np.arange( w+1, DEMf32.shape[1]-w ):
            fromJ = j - w                                               #        .UPD ALAP
            tillJ = j + w - 1                                           #        .UPD ALAP upper bound index excluded ( this is how a code in [ np.arange(...)[0]:np.arange(...)[-1] ] works )
            #                       1..1334:21..1354                    #         ??? never touches first/last DEM-row/column??
            # d1 = np.int64( np.arange( i-w, i+w ) )                    # AVOID: 1M+ times allocated, yet never consumed, but their edge values
            # d2 = np.int64( np.arange( j-w, j+w ) )                    # AVOID: 1M+ times allocated, yet never consumed, but their edge values

            # win = DEM[ d1[0]:d1[-1],                                  # AVOID: while a .view-only, no need to 1M+ times instantiate a "kernel"-win(dow] ( this will create a np.view into the original DEM, not a copy ! )
            #            d2[0]:d2[-1]                                   # ?.or.?   NOT a .view-only, but a new .copy() instantiated, so as to call .detrend() w/o in-place modifying DEMf32 ???
            #            ]                                              # ?.or.?   NOT a .view-only, but a new .copy() instantiated, so as to call .detrend() w/o in-place modifying DEMf32 ???
            dtdWIN[:,:] = DEMf32[fromI:tillI, fromJ:tillJ]              #          NOT a .view-only, but a     .copy() re-populated into a just once and only once pre-allocated dtdWIN, via an in-place copy
            #f np.max( np.isnan(    win ) ) == 1:                       # AVOID: 1M+ times full-range scan, while any first np.nan decides the game and no need to scan "the rest"
            if np.any( np.isnan( dtdWIN ) ):                            #        "density" of np.nan-s determine, if this is a good idea to pre-store
               a_num_of_NaNs += 1                                       # .INC
               continue                                                 #        .NOP/LOOP from here, already pre-stored np.nan-s for this case
               # rms[i,j] = np.nan                                      # DUP ( already stored in initialisation ... )
            else:
               #in    = signal.detrend(    win, type = 'linear' )       # REALLY?: in-place modification of DEM-matrix ???
               dtdWIN = signal.detrend( dtdWIN, type = 'linear'   )     #    in scipy-v1.3.1+ can mod in-place,   overwrite_data = True ) # REMOVE OLS-fit-linear trend
               dtdWIN = signal.detrend( dtdWIN, type = 'constant' )     #    in scipy-v1.3.1+ can mod in-place,   overwrite_data = True ) # REMOVE mean
               #z  = np.reshape( win, -1 )                              # AVOID:~1M+ re-counting constant value, known from w directly
               #nz = np.size( z )                                       # AVOID:~1M+ re-counting constant value, known from w directly
               #rootms    = np.sqrt( 1 / ( nz - 1 ) * np.sum( ( z - np.mean( z ) )**2 ) )
               #rms[i,j]  = rootms
               rmsRESULTf32[i,j] = np.sqrt( a_div_by_nz_minus1          # .STO a "scaled"
                                          * np.dot(   dtdWIN,
                                                      dtdWIN.T
                                                      ).sum()
                                          # np.sum( ( dtdWIN            #         SUM of
                                          #       # - dtdWIN.mean()     #               mean-removed ( ALREADY done via scipy.signal.detrend( 'const' ) above )
                                          #           )**2              #               SQUARES
                                          #         )
                                            )                           #      ROOT
    return( a_num_of_NaNs )                                             # ret i4

【讨论】：

感谢您的建议。我目前正在尝试运行此代码，并意识到 1165 x 1365 示例使用我现有的代码在大约 7 分钟内运行。当我通过您提供的代码示例运行相同的 DEM 时，它会在大约 6.7 分钟内运行。所以不是一个数量级或数量级的改进。如果有帮助，我可以提供 DEM 文件。
我刚刚在一个随机的 1000 x 1000 网格上尝试了您的代码，并且该函数的运行速度明显更快（4.1 分钟）。不知道为什么加速，也许是我原始矩阵的形状。此外，耗时 11 小时的完整网格为 13078 x 11437。
好的，13k1 x 11k4 x 4B 开始进入分而治之的区域，以便更快地处理和使用编译代码 - Cython + OpenMP 工具的组合将变得方便，但缓存 -行一致性增长以增加其影响力（或不利影响，如果在缓存行中工作，则重用错误对齐的顺序）。使用 split-tiles 和 8 个 CPU 内核，肯定可以实现比目标更好的性能，但 CPU/RAM 内存-I/O 通道代表了主要的上限。
所以我们回到关于并行化的讨论，以及如何使用现有代码或修改现有代码来实现这一点。欢迎对此提出任何见解。
@morrismc 请参阅重新编译 .detrend()-substitude 的建议，这进一步加快了提议的“numba.jit”方式。如果需要，使用 Cython 化代码的更高级别的基于 OpenMP 的并行性是下一个级别，以将 DEM 显式拆分为块，以便有效地利用所有 CPU 内核（请记住，RAM / CPU 内存 I/O如果没有单线程、GIL 锁步（是的，纯 SEQUENCE，重新 [SERIALISED] 代码执行），瓶颈仍将控制性能上限。有关 OpenMP / Cython docs.cython.org/en/latest/src/userguide/… 的详细信息

【解决方案2】：

使用 Numba 的解决方案

在某些情况下，如果支持您使用的所有功能，这很容易做到。在您的代码中，win = signal.detrend(win, type = 'linear') 是您必须在 Numba 中实现的部分，因为不支持此功能。

在 Numba 中实施去趋势

如果您查看 detrend 的 source-code，并提取您的问题的相关部分，它可能看起来像这样：

@nb.njit()
def detrend(w):
    Npts=w.shape[0]
    A=np.empty((Npts,2),dtype=w.dtype)
    for i in range(Npts):
        A[i,0]=1.*(i+1) / Npts
        A[i,1]=1.

    coef, resids, rank, s = np.linalg.lstsq(A, w.T)
    out=w.T- np.dot(A, coef)
    return out.T

我还为np.max(np.isnan(win)) == 1 实现了一个更快的解决方案

@nb.njit()
def isnan(win):
    for i in range(win.shape[0]):
        for j in range(win.shape[1]):
            if np.isnan(win[i,j]):
                return True
    return False

主要功能

因为我这里用的是Numba，所以并行化很简单，只是在外层循环和

import numpy as np
import numba as nb

@nb.njit(parallel=True)
def RMSH_det_nb(DEM, w):
    [nrows, ncols] = np.shape(DEM)

    #create an empty array to store result
    rms = DEM*np.nan

    for i in nb.prange(w+1,nrows-w):
        for j in range(w+1,ncols-w):
            win = DEM[i-w:i+w-1,j-w:j+w-1]

            if isnan(win):
                rms[i,j] = np.nan
            else:
                win = detrend(win)
                z = win.flatten()
                nz = z.size
                rootms = np.sqrt(1 / (nz - 1) * np.sum((z-np.mean(z))**2))
                rms[i,j] = rootms

    return rms

时间安排（小例子）

w = 10
DEM=np.random.rand(100, 100).astype(np.float32)

res1=RMSH_det(DEM, w)
res2=RMSH_det_nb(DEM, w)
print(np.allclose(res1,res2,equal_nan=True))
#True

%timeit res1=RMSH_det(DEM, w)
#1.59 s ± 72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res2=RMSH_det_nb(DEM, w) #approx. 55 times faster
#29 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

大型阵列的时间

w = 10
DEM=np.random.rand(1355, 1165).astype(np.float32)
%timeit res2=RMSH_det_nb(DEM, w)
#6.63 s ± 21.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

[编辑] 使用正规方程的实现

Overdetermined system

此方法数值精度较低。虽然这个解决方案要快得多。

@nb.njit()
def isnan(win):
    for i in range(win.shape[0]):
        for j in range(win.shape[1]):
            if win[i,j]==np.nan:
                return True
    return False

@nb.njit()
def detrend(w):
    Npts=w.shape[0]
    A=np.empty((Npts,2),dtype=w.dtype)
    for i in range(Npts):
        A[i,0]=1.*(i+1) / Npts
        A[i,1]=1.

    coef, resids, rank, s = np.linalg.lstsq(A, w.T)
    out=w.T- np.dot(A, coef)
    return out.T

@nb.njit()
def detrend_2(w,T1,A):
    T2=np.dot(A.T,w.T)
    coef=np.linalg.solve(T1,T2)

    out=w.T- np.dot(A, coef)

    return out.T

@nb.njit(parallel=True)
def RMSH_det_nb_normal_eq(DEM,w):
    [nrows, ncols] = np.shape(DEM)

    #create an empty array to store result
    rms = DEM*np.nan

    Npts=w*2-1
    A=np.empty((Npts,2),dtype=DEM.dtype)
    for i in range(Npts):
        A[i,0]=1.*(i+1) / Npts
        A[i,1]=1.

    T1=np.dot(A.T,A)

    nz = Npts**2
    for i in nb.prange(w+1,nrows-w):
        for j in range(w+1,ncols-w):
            win = DEM[i-w:i+w-1,j-w:j+w-1]

            if isnan(win):
                rms[i,j] = np.nan
            else:
                win = detrend_2(win,T1,A)
                rootms = np.sqrt(1 / (nz - 1) * np.sum((win-np.mean(win))**2))
                rms[i,j] = rootms

    return rms

时间

w = 10
DEM=np.random.rand(100, 100).astype(np.float32)

res1=RMSH_det(DEM, w)
res2=RMSH_det_nb(DEM, w)
print(np.allclose(res1,res2,equal_nan=True))
#True

%timeit res1=RMSH_det(DEM, w)
#1.59 s ± 72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res2=RMSH_det_nb_normal_eq(DEM,w)
#7.97 ms ± 89.4 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

使用正规方程的优化解

重用临时数组以避免昂贵的内存分配，并使用矩阵乘法的自定义实现。这仅推荐用于非常小的矩阵，在大多数其他情况下，np.dot (sgeemm) 会快很多。

@nb.njit()
def matmult_2(A,B,out):
    for j in range(B.shape[1]):
        acc1=nb.float32(0)
        acc2=nb.float32(0)
        for k in range(B.shape[0]):
            acc1+=A[0,k]*B[k,j]
            acc2+=A[1,k]*B[k,j]
        out[0,j]=acc1
        out[1,j]=acc2
    return out

@nb.njit(fastmath=True)
def matmult_mod(A,B,w,out):
    for j in range(B.shape[1]):
        for i in range(A.shape[0]):
            acc=nb.float32(0)
            acc+=A[i,0]*B[0,j]+A[i,1]*B[1,j]
            out[j,i]=acc-w[j,i]
    return out

@nb.njit()
def detrend_2_opt(w,T1,A,Tempvar_1,Tempvar_2):
    T2=matmult_2(A.T,w.T,Tempvar_1)
    coef=np.linalg.solve(T1,T2)
    return matmult_mod(A, coef,w,Tempvar_2)

@nb.njit(parallel=True)
def RMSH_det_nb_normal_eq_opt(DEM,w):
    [nrows, ncols] = np.shape(DEM)

    #create an empty array to store result
    rms = DEM*np.nan

    Npts=w*2-1
    A=np.empty((Npts,2),dtype=DEM.dtype)
    for i in range(Npts):
        A[i,0]=1.*(i+1) / Npts
        A[i,1]=1.

    T1=np.dot(A.T,A)

    nz = Npts**2
    for i in nb.prange(w+1,nrows-w):
        Tempvar_1=np.empty((2,Npts),dtype=DEM.dtype)
        Tempvar_2=np.empty((Npts,Npts),dtype=DEM.dtype)
        for j in range(w+1,ncols-w):
            win = DEM[i-w:i+w-1,j-w:j+w-1]

            if isnan(win):
                rms[i,j] = np.nan
            else:
                win = detrend_2_opt(win,T1,A,Tempvar_1,Tempvar_2)
                rootms = np.sqrt(1 / (nz - 1) * np.sum((win-np.mean(win))**2))
                rms[i,j] = rootms

    return rms

时间

w = 10
DEM=np.random.rand(100, 100).astype(np.float32)

res1=RMSH_det(DEM, w)
res2=RMSH_det_nb_normal_eq_opt(DEM, w)
print(np.allclose(res1,res2,equal_nan=True))
#True

%timeit res1=RMSH_det(DEM, w)
#1.59 s ± 72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res2=RMSH_det_nb_normal_eq_opt(DEM,w)
#4.66 ms ± 87.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

isnan 的时间

这个函数是一个完全不同的实现。如果 NaN 位于数组的开头，它会快得多，但无论如何，即使没有，也会有一些加速。我用小数组（大约窗口大小）和@user3666197 建议的大大小对它进行了基准测试。

case_1=np.full((20,20),np.nan)
case_2=np.full((20,20),0.)
case_2[10,10]=np.nan
case_3=np.full((20,20),0.)

case_4 = np.full( ( int( 1E4 ), int( 1E4 ) ),np.nan)
case_5 = np.ones( ( int( 1E4 ), int( 1E4 ) ) )

%timeit np.any(np.isnan(case_1))
%timeit np.any(np.isnan(case_2))
%timeit np.any(np.isnan(case_3))
%timeit np.any(np.isnan(case_4))
%timeit np.any(np.isnan(case_5))
#2.75 µs ± 73.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#2.75 µs ± 46.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#2.76 µs ± 32.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#81.3 ms ± 2.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#86.7 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit isnan(case_1)
%timeit isnan(case_2)
%timeit isnan(case_3)
%timeit isnan(case_4)
%timeit isnan(case_5)
#244 ns ± 5.02 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#357 ns ± 1.07 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#475 ns ± 9.28 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#235 ns ± 0.933 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#58.8 ms ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

【讨论】：

您确定代码正确吗？ SLOC win = DEM[i-w:i+w-1,j-w:j+w-1] 将 win 分配为 np.view() 到原始 DEM 的一个区域中，其中来自 win = signal.detrend(win, type = 'linear') 的修改实际上是 销毁DEM 中的原始数据+ 其余部分正在处理已经损坏/破坏的DEM-data ... 这是O / P的意图吗？ +1 用于为numba 实现.detrend() 方法
@user3666197 dtrend 中的否是分配的另一个数组（输出）。因此，win = detrend(win) win 行之后指向另一个内存区域。也许我应该重命名变量以使其更清晰。
“working-kernel-win”的预分配（一次性成本）将提高性能 - [1k3,1k1]-demo 执行约 1M + 次，~150M+ 次为 [13k1,11k4]-real-DEM 执行， 所以值得清理所有可能的开销 :o)
感谢@max9111，提供此解决方案。我目前正在尝试实施它。首先一个问题，我是否需要在与主 RMSH 函数不同的文件中定义 detrend 和 max 函数？或者我可以使用其中定义的那些子函数调用单个函数吗？
@max9111 我能够成功实现您的解决方案，并且在我的测试网格 (1165 x 1355) 上花费了 7 秒，并且似乎复制了原始算法的结果。速度增加的根源是平行外循环吗？

到目前为止，很有希望，不是吗？

numba.jit() LLVM 编译的代码。哎呀，更慢？

结语：

使用 Numba 的解决方案

`numba.jit()` LLVM 编译的代码。哎呀，更慢？