【问题标题】：How is performance affected by using numba.cuda.local.array() compared with numba.cuda.to_device()?与 numba.cuda.to_device() 相比，使用 numba.cuda.local.array() 对性能有何影响？
【发布时间】：2021-10-22 11:00:18
【问题描述】：

在Numba Docs: Memory Management 下，对于numba.cuda.local.array() 它声明：

本地内存

本地内存是每个线程私有的内存区域。使用本地当标量局部变量时，内存有助于分配一些暂存器区域还不够。内存在运行期间被分配一次内核，不同于传统的动态内存管理。

numba.cuda.local.array（形状，类型）

分配一个本地数组设备上给定的形状和类型。 shape 是一个整数或一个表示数组维度的整数元组，必须是简单的常量表达式。 type 是元素的 Numba 类型需要存储在数组中。该数组是当前私有的线。返回一个可以读写的类数组对象喜欢任何标准数组（例如通过索引）。

^{请注意，关于“分配”一词的使用可能需要更新 Numba 文档（请参阅Numba PR 和下面的评论）}

在to_device() 下，它声明：

numba.cuda.to_device(obj, stream=0, copy=True, to=None)

分配和将 numpy ndarray 或结构化标量传输到设备。

复制主机->设备一个numpy数组：

ary = np.arange(10)

d_ary = cuda.to_device(ary)

使用其中一种对性能（运行时和内存）有何影响？

【问题讨论】：

本地数组永远不会“分配”。 CUDA 本地内存（这是在 Numba 中映射的内容）始终由编译器静态定义。运行内核时没有分配开销，内存取自静态保留的本地 DRAM 池，当 GPU 函数加载到上下文中时，根据编译器在编译代码中保留的内容和最大并发数目标 GPU 的线程。它们在内核运行时是零分配成本
@talonmies 查看我的编辑。

标签： python arrays sorting cuda numba

【解决方案1】：

TL;DR

cuda.local.array() 的初始化和赋值比 to_device() 传输快约 3 倍；然而，在一个使用 QuickSort 的简单真实示例中，它的速度要慢约 2 倍。不知道为什么这两个基准测试案例之间存在约 6 倍的差异。

对于排序示例和参考，NumPy 与 cuda.local.array() 大致相同，CuPy 比 to_device() 快约 4 倍。我想 CuPy 的优势是因为我没有实现共享内存。

基准代码

第一个基准测试是一个虚拟示例，其中向量（通过cuda.local.array() 或to_device() 被复制到out。第二个基准测试是一个简单的实际示例，在 1,000,000 行和 100 列上使用逐行快速排序.

初始化和赋值

以下是对这两种情况的简单测试：cuda.local.array()，其中值在设备本地初始化为零，to_device()，其中值在主机代码中初始化为零。然后将这些值分配给out。

# -*- coding: utf-8 -*-
"""
Created on Sat Aug 21 11:39:58 2021

@author: sterg
"""

import os

# needs to appear before cuda import
os.environ["NUMBA_ENABLE_CUDASIM"] = "0"
# set to "1" for more debugging, but slower performance
os.environ["NUMBA_CUDA_DEBUGINFO"] = "0"

from numba import cuda
from numba.types import int32, float32

import numpy as np

from time import time

bits = 32
nb_float = float32
nb_int = int32
np_float = np.float32
np_int = np.int32

rows = 1000000
# local array shape needs to be defined globally due to lack of dynamic array allocation support. Don't wrap with np.int32, etc. see https://github.com/numba/numba/issues/7314
cols = 100

shape = (rows, cols)
test_vals = np.zeros(shape)
print(rows)


@cuda.jit("void(float{0}[:,:])".format(bits))
def local_array(out):
    # CUDA iterator
    idx = cuda.grid(1)
    nrows = out.shape[0]

    if idx < nrows:
        # allocate local arrays
        vec = cuda.local.array(cols, nb_float)

        # foo stuff for MWE
        for i in range(cols):
            vec[i] = 0.0
            out[idx, i] = vec[i]


@cuda.jit("void(float{0}[:,:], float{0}[:,:])".format(bits))
def device_array(mat, out):
    # CUDA iterator
    idx = cuda.grid(1)
    nrows = mat.shape[0]

    if idx < nrows:
        vec = mat[idx]

        # foo stuff for MWE
        for i in range(cols):
            out[idx, i] = vec[i]


def local_array_driver(nrows):
    # block, grid
    block_dim = 5
    grid_dim = int(nrows / block_dim + 1)

    # CUDA
    stream = cuda.stream()
    cuda_out = cuda.device_array((nrows, cols), np_float)
    local_array[grid_dim, block_dim](cuda_out)

    out = cuda_out.copy_to_host(stream=stream)

    return out


def device_array_driver(mat):
    # block, grid
    block_dim = 5
    nrows = mat.shape[0]
    grid_dim = int(nrows / block_dim + 1)

    # CUDA
    stream = cuda.stream()
    cuda_out = cuda.device_array((nrows, cols), np_float)
    cuda_mat = cuda.to_device(np.asarray(mat, dtype=np_float), stream=stream)
    device_array[grid_dim, block_dim](cuda_mat, cuda_out)

    out = cuda_out.copy_to_host(stream=stream)

    return out


# modified from https://numba.pydata.org/numba-doc/dev/user/5minguide.html
# compilation
local_array_driver(100)
device_array_driver(test_vals[0:100])

# NOW THE FUNCTION IS COMPILED, RE-TIME IT EXECUTING FROM CACHE
start = time()
out = local_array_driver(rows)
end = time()
print("Elapsed (local: after compilation) = %s" % (end - start))

# NOW THE FUNCTION IS COMPILED, RE-TIME IT EXECUTING FROM CACHE
start = time()
out = device_array_driver(test_vals)
end = time()
print("Elapsed (device: after compilation) = %s" % (end - start))

排序示例

以下基准测试以下两种情况并执行一个简单的例程（快速排序）：通过 cuda.local.array() 在本地分配 3 个数组或通过 cuda.to_device() 传递预分配的数组。

# -*- coding: utf-8 -*-
"""
Created on Fri Aug 20 16:19:34 2021

@author: sterg
"""

import os

# needs to appear before cuda import
os.environ["NUMBA_ENABLE_CUDASIM"] = "0"
# set to "1" for more debugging, but slower performance
os.environ["NUMBA_CUDA_DEBUGINFO"] = "0"

from numba import njit, cuda
from numba.types import int32, float32

import numpy as np

from time import time
from warnings import warn
from pdb import set_trace

bits = 32
nb_float = float32
nb_int = int32
np_float = np.float32
np_int = np.int32

rows = 2000000
cols = 100
test_vals = np.random.rand(rows, cols)
print(rows)
# local array shape needs to be defined globally due to lack of dynamic array allocation support. Don't wrap with np.int32, etc. see https://github.com/numba/numba/issues/7314

verbose = True


@cuda.jit(device=True)
def partition(arr, ids, l, h):
    """
    Partition using pivot.

    Function takes last element as pivot, places the pivot element at its correct
    position in sorted array, and places all smaller (smaller than pivot) to left of
    pivot and all greater elements to right of pivot

    Source: Modified from https://www.geeksforgeeks.org/iterative-quick-sort/
    which was contributed by Mohit Kumra.

    Parameters
    ----------
    arr : vector of floats
        The array to be sorted.
    ids : vector of ints
        The unsorted IDs corresponding to arr, in other words range(len(arr)).
    l : int
        Starting index for sorting.
    h : int
        Ending index for sorting.

    Returns
    -------
    int
        the new pivot?

    """
    # index of smaller element
    i = l - 1

    pivot = arr[h]

    for j in range(l, h):

        # If current element is smaller than or equal to pivot
        if arr[j] <= pivot:

            # increment index of smaller element
            i += 1
            arr[i], arr[j] = arr[j], arr[i]
            ids[i], ids[j] = ids[j], ids[i]

    arr[i + 1], arr[h] = arr[h], arr[i + 1]
    ids[i + 1], ids[h] = ids[h], ids[i + 1]

    return i + 1


@cuda.jit(device=True)
def quickSortIterative(arr, stack, ids):
    """
    Perform iterative quicksort on array and an unsorted ID list of the array.

    Source: Modified from https://www.geeksforgeeks.org/iterative-quick-sort/
    which was contributed by Mohit Kumra.

    Parameters
    ----------
    arr : vector of floats
        The array to be sorted.
    stack : vector of ints
        Array initialized with 0's
    ids : vector of ints
        The unsorted IDs corresponding to arr, in other words range(len(arr)).

    Returns
    -------
    None.

    """
    # low and high indices.
    l, h = (0, len(arr) - 1)
    # stack = [0] * size
    # ids = list(range(len(arr)))

    # initialize top of stack
    top = -1

    # fill ids with range(len(arr))
    for i in range(len(arr)):
        ids[i] = i
        stack[i] = 0

    # push initial values of l and h to stack
    top = top + 1
    stack[top] = l
    top = top + 1
    stack[top] = h

    # Keep popping from stack while is not empty
    while top >= 0:

        # Pop h and l
        h = stack[top]
        top = top - 1
        l = stack[top]
        top = top - 1

        # Set pivot element at its correct position in
        # sorted array
        p = partition(arr, ids, l, h)

        # If there are elements on left side of pivot,
        # then push left side to stack
        if p - 1 > l:
            top = top + 1
            stack[top] = l
            top = top + 1
            stack[top] = p - 1

        # If there are elements on right side of pivot,
        # then push right side to stack
        if p + 1 < h:
            top = top + 1
            stack[top] = p + 1
            top = top + 1
            stack[top] = h


@cuda.jit(device=True)
def quickSortIterativeDevice(arr, stack, ids):
    """
    Perform iterative quicksort on array and an unsorted ID list of the array.

    Source: Modified from https://www.geeksforgeeks.org/iterative-quick-sort/
    which was contributed by Mohit Kumra.

    Parameters
    ----------
    arr : vector of floats
        The array to be sorted.
    stack : vector of ints
        Array initialized with 0's
    ids : vector of ints
        The unsorted IDs corresponding to arr, in other words range(len(arr)).

    Returns
    -------
    None.

    """
    # low and high indices.
    l, h = (0, len(arr) - 1)
    # stack = [0] * size
    # ids = list(range(len(arr)))

    # initialize top of stack
    top = -1

    # push initial values of l and h to stack
    top = top + 1
    stack[top] = l
    top = top + 1
    stack[top] = h

    # Keep popping from stack while is not empty
    while top >= 0:

        # Pop h and l
        h = stack[top]
        top = top - 1
        l = stack[top]
        top = top - 1

        # Set pivot element at its correct position in
        # sorted array
        p = partition(arr, ids, l, h)

        # If there are elements on left side of pivot,
        # then push left side to stack
        if p - 1 > l:
            top = top + 1
            stack[top] = l
            top = top + 1
            stack[top] = p - 1

        # If there are elements on right side of pivot,
        # then push right side to stack
        if p + 1 < h:
            top = top + 1
            stack[top] = p + 1
            top = top + 1
            stack[top] = h


# Controls threads per block and shared memory usage.
# The computation will be done on blocks of TPBxTPB elements.
TPBx = 16
TPBy = cols


@cuda.jit("void(float{0}[:,:], float{0}[:,:])".format(bits))
def local_array_sort(mat, out):
    # CUDA iterator
    idx = cuda.grid(1)
    nrows = mat.shape[0]

    if idx < nrows:
        vec = mat[idx]
        vals = cuda.local.array(cols, nb_float)
        sorter = cuda.local.array(cols, nb_int)
        stack = cuda.local.array(cols, nb_int)

        # fill vals with data
        for i in range(cols):
            vals[i] = vec[i]

        # sorting
        quickSortIterative(vals, stack, sorter)

        # foo stuff using vals and sorter, in this case just outputting the values (MWE)
        for i in range(cols):
            out[idx, i] = vals[i]

        # # for debugging
        # x = cuda.threadIdx.x
        # bx = cuda.blockIdx.x
        # if x == 0 and bx == 0:
        #     set_trace()


@cuda.jit("void(float{0}[:,:], int{0}[:,:], int{0}[:,:], float{0}[:,:])".format(bits))
def device_array_sort(mat, stack, sorter, out):
    # CUDA iterator
    idx = cuda.grid(1)
    nrows = mat.shape[0]

    if idx < nrows:
        vec = mat[idx]
        substack = stack[idx]
        subsorter = sorter[idx]

        # sorting (substack is already initialized to 0's, subsorter to list(range(cols)))
        quickSortIterativeDevice(vec, substack, subsorter)

        # foo stuff using vals and sorter, in this case just outputting the values (MWE)
        for i in range(cols):
            out[idx, i] = vec[i]

        # # for debugging
        # x = cuda.threadIdx.x
        # bx = cuda.blockIdx.x
        # if x == 0 and bx == 0:
        #     set_trace()


def local_array_driver(mat):
    # block, grid
    block_dim = 5
    nrows = mat.shape[0]

    grid_dim = int(nrows / block_dim + 1)

    # CUDA
    stream = cuda.stream()
    cuda_mat = cuda.to_device(np.asarray(mat, dtype=np_float), stream=stream)
    cuda_out = cuda.device_array((nrows, cols), np_float)

    local_array_sort[grid_dim, block_dim](cuda_mat, cuda_out)

    out = cuda_out.copy_to_host(stream=stream)

    return out


def device_array_driver(mat):
    # block, grid
    block_dim = 5
    nrows = mat.shape[0]
    grid_dim = int(nrows / block_dim + 1)

    # CUDA
    stream = cuda.stream()

    cuda_mat = cuda.to_device(np.asarray(mat, dtype=np_float), stream=stream)
    cuda_out = cuda.device_array((nrows, cols), np_float)

    stack = np.zeros((nrows, cols))
    sorter = np.tile(list(range(cols)), (nrows, 1))
    cuda_stack = cuda.to_device(np.asarray(stack, dtype=np_int), stream=stream)
    cuda_sorter = cuda.to_device(np.asarray(sorter, dtype=np_int), stream=stream)

    device_array_sort[grid_dim, block_dim](cuda_mat, cuda_stack, cuda_sorter, cuda_out)

    out = cuda_out.copy_to_host(stream=stream)

    return out


# modified from https://numba.pydata.org/numba-doc/dev/user/5minguide.html
# compilation
local_array_driver(test_vals[0:100])
device_array_driver(test_vals[0:100])

# NOW THE FUNCTION IS COMPILED, RE-TIME IT EXECUTING FROM CACHE
start = time()
out = local_array_driver(test_vals)
end = time()
print("Elapsed (local: after compilation) = %s" % (end - start))

# Check if a numpy array is sorted: https://stackoverflow.com/questions/47004506
issorted = np.all(np.all(np.diff(out) >= 0, axis=1))
if not issorted:
    raise ValueError("out is not sorted")
elif verbose:
    print("out is sorted")

# NOW THE FUNCTION IS COMPILED, RE-TIME IT EXECUTING FROM CACHE
start = time()
out = device_array_driver(test_vals)
end = time()
print("Elapsed (device: after compilation) = %s" % (end - start))

# Check if a numpy array is sorted: https://stackoverflow.com/questions/47004506
issorted = np.all(np.all(np.diff(out) >= 0, axis=1))
if not issorted:
    warn("out is not sorted")
elif verbose:
    print("out is sorted")

start = time()
out = np.sort(test_vals)
end = time()
print("Elapsed (NumPy) = %s" % (end - start))


# # not fully implemented/tested
# @cuda.jit("void(float{0}[:,:], float{0}[:,:])".format(bits))
# def local_array_shared_sort(mat, out):

#     # Define an array in the shared memory
#     # The size and type of the arrays must be known at compile time
#     smat = cuda.shared.array(shape=(TPBx, TPBy), dtype=float32)

#     x, y = cuda.grid(2)

#     tx = cuda.threadIdx.x
#     ty = cuda.threadIdx.y
#     bpg = cuda.gridDim.x  # blocks per grid

#     nrows = mat.shape[0]
#     if x >= nrows and y >= cols:
#         # Quit if (x, y) is outside of valid C boundary
#         return

#     # allocate local arrays
#     vals = cuda.local.array(cols, nb_float)
#     sorter = cuda.local.array(cols, nb_int)
#     stack = cuda.local.array(cols, nb_int)

#     for i in range(bpg):
#         # Preload data into shared memory
#         smat[tx, ty] = mat[x, ty + i * TPBx]

#         # Wait until all threads finish preloading
#         cuda.syncthreads()

#         # fill vals with some data (just for MWE)
#         for i in range(cols):
#             vals[i] = smat[i]

#         # sorting
#         quickSortIterative(vals, stack, sorter)

#         # foo stuff using vals and sorter, in this case just outputting the values (MWE)
#         for i in range(cols):
#             out[x, i] = vals[i]

#         # Wait until all threads finish computing (is this one unnecessary?)
#         cuda.syncthreads()

基准测试结果

初始化和赋值

对于这个虚拟示例，使用本地数组要快约 3 倍。

1000000
Elapsed (local: after compilation) = 0.14660310745239258
Elapsed (device: after compilation) = 0.4358341693878174

排序

但是，使用 1,000,000 行和 100 列实现 QuickSort 会得到：

1000000
Elapsed (local: after compilation) = 2.4355175495147705
Elapsed (device: after compilation) = 1.1668529510498047
out is sorted
Elapsed (NumPy) = 2.304871082305908

并且有 2,000,000 行：

> 2000000
> Elapsed (local: after compilation) = 4.839058876037598
> Elapsed (device: after compilation) = 2.2948694229125977
> out is sorted
> Elapsed (NumPy) = 4.541851282119751

请注意，device_array_sort 的内存要求大约高出 2-3 倍。所以看起来我在对一两百万个 100 元素向量的排序中使用 3 个本地数组的速度降低了约 2 倍。换句话说，预先分配数组并将它们发送到设备更丑陋（更多参数）并且内存效率更低，但总体上更快。

CuPy 比较

import cupy as cp
from time import time
rows = 1000000
cols = 100
test_vals = cp.random.rand(rows, cols)
t = time()
cp.sort(test_vals)
print("Elapsed (CuPy) = %s" % (time() - t))

Elapsed (CuPy) = 0.5325748920440674

内存不足 2,000,000 个案例。

关于共享内存的注意事项

此示例中未使用共享内存，但将其合并应该会进一步提高速度（可能与 CuPy 实现相当）。有关实现共享内存的示例，请参阅Numba Docs: CUDA: Examples: Matrix Multiplication。截至 2021 年 8 月 21 日，这有一些错误已在 How to generalize fast matrix multiplication on GPU using numba 中描述和修复。

【讨论】：