了解共享内存使用以改进 Numba答案

【问题标题】：Understanding shared memory use for improvement in Numba了解共享内存使用以改进 Numba
【发布时间】：2021-09-15 18:53:23
【问题描述】：

我正在尝试了解更多关于使用共享内存来提高 Numba 中某些 cuda 内核性能的信息，为此我查看了 Numba 文档中的 Matrix multiplication Example 并尝试实施以查看收益.

这是我的测试实现，我知道文档中的示例存在一些我从Here 遵循的问题，因此我复制了修复的示例代码。

from timeit import default_timer as timer
import numba
from numba import cuda, jit, int32, int64, float64, float32
import numpy as np
from numpy import *


@cuda.jit
def matmul(A, B, C):
    """Perform square matrix multiplication of C = A * B
    """
    i, j = cuda.grid(2)
    if i < C.shape[0] and j < C.shape[1]:
        tmp = 0.
        for k in range(A.shape[1]):
            tmp += A[i, k] * B[k, j]
        C[i, j] = tmp

# Controls threads per block and shared memory usage.
# The computation will be done on blocks of TPBxTPB elements.
TPB = 16

@cuda.jit
def fast_matmul(A, B, C):
    # Define an array in the shared memory
    # The size and type of the arrays must be known at compile time
    sA = cuda.shared.array(shape=(TPB, TPB), dtype=float32)
    sB = cuda.shared.array(shape=(TPB, TPB), dtype=float32)

    x, y = cuda.grid(2)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bpg = cuda.gridDim.x    # blocks per grid

    # Each thread computes one element in the result matrix.
    # The dot product is chunked into dot products of TPB-long vectors.
    tmp = 0.
    for i in range(bpg):
        # Preload data into shared memory
        sA[ty, tx] = 0
        sB[ty, tx] = 0
        if y < A.shape[0] and (tx+i*TPB) < A.shape[1]:
            sA[ty, tx] = A[y, tx + i * TPB]
        if x < B.shape[1] and (ty+i*TPB) < B.shape[0]:
            sB[ty, tx] = B[ty + i * TPB, x]

        # Wait until all threads finish preloading
        cuda.syncthreads()

        # Computes partial product on the shared memory
        for j in range(TPB):
            tmp += sA[ty, j] * sB[j, tx]

        # Wait until all threads finish computing
        cuda.syncthreads()
    if y < C.shape[0] and x < C.shape[1]:
        C[y, x] = tmp

size = 1024*4
tpbx,tpby = 16, 16
tpb = (tpbx,tpby)
bpgx, bpgy = int(size/tpbx), int(size/tpby)
bpg = (bpgx, bpgy)
a_in = cuda.to_device(np.arange(size*size, dtype=np.float32).reshape((size, size)))
b_in = cuda.to_device(np.ones(size*size, dtype=np.float32).reshape((size, size)))
c_out1 = cuda.device_array_like(a_in)
c_out2 = cuda.device_array_like(a_in)

s = timer()
cuda.synchronize()
matmul[bpg,tpb](a_in, b_in, c_out1);
cuda.synchronize()
gpu_time = timer() - s

print(gpu_time)
c_host1 = c_out1.copy_to_host()
print(c_host1)

s = timer()
cuda.synchronize()
fast_matmul[bpg,tpb](a_in, b_in, c_out2);
cuda.synchronize()
gpu_time = timer() - s

print(gpu_time)
c_host2 = c_out2.copy_to_host()
print(c_host2)

上述内核的执行时间基本相同，实际上matmul 对一些较大的输入矩阵变得更快。我想知道我缺少什么，以便看到文档建议的收益。

谢谢，布鲁诺。

【问题讨论】：

标签： python cuda numba

【解决方案1】：

我在that other answer 中输入的代码中出现了性能错误。我现在已经修好了。简而言之，这一行：

    tmp = 0.

导致 numba 创建一个 64 位浮点变量 tmp。这触发了内核中的其他算法从 32 位浮点提升到 64 位浮点。这与算术的其余部分不一致，也与另一个答案中演示的意图不一致。此错误会影响两个内核。

当我在两个内核中将其更改为

    tmp = float32(0.)

两个内核的速度都明显加快，在我的 GTX960 GPU 上，您的测试用例显示共享代码的运行速度比非共享代码快大约 2 倍（但见下文）。

非共享内核还存在与内存访问模式相关的性能问题。与that other answer 中的索引交换类似，仅针对此特定场景，我们可以通过反转分配的索引来解决此问题：

j, i = cuda.grid(2)

在非共享内核中。这允许内核尽可能地执行，并且通过这种更改，共享内核的运行速度比非共享内核快约 2 倍。如果不对非共享内核进行额外的更改，非共享内核的性能就会差很多。

【讨论】：

好的，这绝对是我永远不会意识到的，谢谢:)
您能否解释一下为什么非共享内核中的索引顺序可能会出现性能问题？因为这对我来说不是很清楚
您可能希望利用this 等资源来有序地介绍CUDA（或阅读the programming guide）。索引顺序创建了一种情况，其中 coalescing 全局负载发生，而另一种情况则不发生。这在cuda 标签上得到了广泛的介绍。未合并的情况对性能来说更糟。同样的观察here。我打算在那里帮助你，但你发布的代码无法运行。
好的@Robert，非常感谢培训系列的解释和链接，我一定会看看的。我刚刚发布了在多个流中运行代码的尝试，因为我认为这会提高性能，但现在我知道最好只启动一个内核。我使用 LBM 并试图加快一些模拟，我已经获得了大约 45 倍于常规 python 代码的速度，但我正在寻找可以提高性能的其他地方。感谢所有的帮助。