TL;DR
cuda.local.array() 的初始化和赋值比 to_device() 传输快约 3 倍;然而,在一个使用 QuickSort 的简单真实示例中,它的速度要慢约 2 倍。不知道为什么这两个基准测试案例之间存在约 6 倍的差异。
对于排序示例和参考,NumPy 与 cuda.local.array() 大致相同,CuPy 比 to_device() 快约 4 倍。我想 CuPy 的优势是因为我没有实现共享内存。
基准代码
第一个基准测试是一个虚拟示例,其中向量(通过cuda.local.array() 或to_device() 被复制到out。第二个基准测试是一个简单的实际示例,在 1,000,000 行和 100 列上使用逐行快速排序.
初始化和赋值
以下是对这两种情况的简单测试:cuda.local.array(),其中值在设备本地初始化为零,to_device(),其中值在主机代码中初始化为零。然后将这些值分配给out。
# -*- coding: utf-8 -*-
"""
Created on Sat Aug 21 11:39:58 2021
@author: sterg
"""
import os
# needs to appear before cuda import
os.environ["NUMBA_ENABLE_CUDASIM"] = "0"
# set to "1" for more debugging, but slower performance
os.environ["NUMBA_CUDA_DEBUGINFO"] = "0"
from numba import cuda
from numba.types import int32, float32
import numpy as np
from time import time
bits = 32
nb_float = float32
nb_int = int32
np_float = np.float32
np_int = np.int32
rows = 1000000
# local array shape needs to be defined globally due to lack of dynamic array allocation support. Don't wrap with np.int32, etc. see https://github.com/numba/numba/issues/7314
cols = 100
shape = (rows, cols)
test_vals = np.zeros(shape)
print(rows)
@cuda.jit("void(float{0}[:,:])".format(bits))
def local_array(out):
# CUDA iterator
idx = cuda.grid(1)
nrows = out.shape[0]
if idx < nrows:
# allocate local arrays
vec = cuda.local.array(cols, nb_float)
# foo stuff for MWE
for i in range(cols):
vec[i] = 0.0
out[idx, i] = vec[i]
@cuda.jit("void(float{0}[:,:], float{0}[:,:])".format(bits))
def device_array(mat, out):
# CUDA iterator
idx = cuda.grid(1)
nrows = mat.shape[0]
if idx < nrows:
vec = mat[idx]
# foo stuff for MWE
for i in range(cols):
out[idx, i] = vec[i]
def local_array_driver(nrows):
# block, grid
block_dim = 5
grid_dim = int(nrows / block_dim + 1)
# CUDA
stream = cuda.stream()
cuda_out = cuda.device_array((nrows, cols), np_float)
local_array[grid_dim, block_dim](cuda_out)
out = cuda_out.copy_to_host(stream=stream)
return out
def device_array_driver(mat):
# block, grid
block_dim = 5
nrows = mat.shape[0]
grid_dim = int(nrows / block_dim + 1)
# CUDA
stream = cuda.stream()
cuda_out = cuda.device_array((nrows, cols), np_float)
cuda_mat = cuda.to_device(np.asarray(mat, dtype=np_float), stream=stream)
device_array[grid_dim, block_dim](cuda_mat, cuda_out)
out = cuda_out.copy_to_host(stream=stream)
return out
# modified from https://numba.pydata.org/numba-doc/dev/user/5minguide.html
# compilation
local_array_driver(100)
device_array_driver(test_vals[0:100])
# NOW THE FUNCTION IS COMPILED, RE-TIME IT EXECUTING FROM CACHE
start = time()
out = local_array_driver(rows)
end = time()
print("Elapsed (local: after compilation) = %s" % (end - start))
# NOW THE FUNCTION IS COMPILED, RE-TIME IT EXECUTING FROM CACHE
start = time()
out = device_array_driver(test_vals)
end = time()
print("Elapsed (device: after compilation) = %s" % (end - start))
排序示例
以下基准测试以下两种情况并执行一个简单的例程(快速排序):通过 cuda.local.array() 在本地分配 3 个数组或通过 cuda.to_device() 传递预分配的数组。
# -*- coding: utf-8 -*-
"""
Created on Fri Aug 20 16:19:34 2021
@author: sterg
"""
import os
# needs to appear before cuda import
os.environ["NUMBA_ENABLE_CUDASIM"] = "0"
# set to "1" for more debugging, but slower performance
os.environ["NUMBA_CUDA_DEBUGINFO"] = "0"
from numba import njit, cuda
from numba.types import int32, float32
import numpy as np
from time import time
from warnings import warn
from pdb import set_trace
bits = 32
nb_float = float32
nb_int = int32
np_float = np.float32
np_int = np.int32
rows = 2000000
cols = 100
test_vals = np.random.rand(rows, cols)
print(rows)
# local array shape needs to be defined globally due to lack of dynamic array allocation support. Don't wrap with np.int32, etc. see https://github.com/numba/numba/issues/7314
verbose = True
@cuda.jit(device=True)
def partition(arr, ids, l, h):
"""
Partition using pivot.
Function takes last element as pivot, places the pivot element at its correct
position in sorted array, and places all smaller (smaller than pivot) to left of
pivot and all greater elements to right of pivot
Source: Modified from https://www.geeksforgeeks.org/iterative-quick-sort/
which was contributed by Mohit Kumra.
Parameters
----------
arr : vector of floats
The array to be sorted.
ids : vector of ints
The unsorted IDs corresponding to arr, in other words range(len(arr)).
l : int
Starting index for sorting.
h : int
Ending index for sorting.
Returns
-------
int
the new pivot?
"""
# index of smaller element
i = l - 1
pivot = arr[h]
for j in range(l, h):
# If current element is smaller than or equal to pivot
if arr[j] <= pivot:
# increment index of smaller element
i += 1
arr[i], arr[j] = arr[j], arr[i]
ids[i], ids[j] = ids[j], ids[i]
arr[i + 1], arr[h] = arr[h], arr[i + 1]
ids[i + 1], ids[h] = ids[h], ids[i + 1]
return i + 1
@cuda.jit(device=True)
def quickSortIterative(arr, stack, ids):
"""
Perform iterative quicksort on array and an unsorted ID list of the array.
Source: Modified from https://www.geeksforgeeks.org/iterative-quick-sort/
which was contributed by Mohit Kumra.
Parameters
----------
arr : vector of floats
The array to be sorted.
stack : vector of ints
Array initialized with 0's
ids : vector of ints
The unsorted IDs corresponding to arr, in other words range(len(arr)).
Returns
-------
None.
"""
# low and high indices.
l, h = (0, len(arr) - 1)
# stack = [0] * size
# ids = list(range(len(arr)))
# initialize top of stack
top = -1
# fill ids with range(len(arr))
for i in range(len(arr)):
ids[i] = i
stack[i] = 0
# push initial values of l and h to stack
top = top + 1
stack[top] = l
top = top + 1
stack[top] = h
# Keep popping from stack while is not empty
while top >= 0:
# Pop h and l
h = stack[top]
top = top - 1
l = stack[top]
top = top - 1
# Set pivot element at its correct position in
# sorted array
p = partition(arr, ids, l, h)
# If there are elements on left side of pivot,
# then push left side to stack
if p - 1 > l:
top = top + 1
stack[top] = l
top = top + 1
stack[top] = p - 1
# If there are elements on right side of pivot,
# then push right side to stack
if p + 1 < h:
top = top + 1
stack[top] = p + 1
top = top + 1
stack[top] = h
@cuda.jit(device=True)
def quickSortIterativeDevice(arr, stack, ids):
"""
Perform iterative quicksort on array and an unsorted ID list of the array.
Source: Modified from https://www.geeksforgeeks.org/iterative-quick-sort/
which was contributed by Mohit Kumra.
Parameters
----------
arr : vector of floats
The array to be sorted.
stack : vector of ints
Array initialized with 0's
ids : vector of ints
The unsorted IDs corresponding to arr, in other words range(len(arr)).
Returns
-------
None.
"""
# low and high indices.
l, h = (0, len(arr) - 1)
# stack = [0] * size
# ids = list(range(len(arr)))
# initialize top of stack
top = -1
# push initial values of l and h to stack
top = top + 1
stack[top] = l
top = top + 1
stack[top] = h
# Keep popping from stack while is not empty
while top >= 0:
# Pop h and l
h = stack[top]
top = top - 1
l = stack[top]
top = top - 1
# Set pivot element at its correct position in
# sorted array
p = partition(arr, ids, l, h)
# If there are elements on left side of pivot,
# then push left side to stack
if p - 1 > l:
top = top + 1
stack[top] = l
top = top + 1
stack[top] = p - 1
# If there are elements on right side of pivot,
# then push right side to stack
if p + 1 < h:
top = top + 1
stack[top] = p + 1
top = top + 1
stack[top] = h
# Controls threads per block and shared memory usage.
# The computation will be done on blocks of TPBxTPB elements.
TPBx = 16
TPBy = cols
@cuda.jit("void(float{0}[:,:], float{0}[:,:])".format(bits))
def local_array_sort(mat, out):
# CUDA iterator
idx = cuda.grid(1)
nrows = mat.shape[0]
if idx < nrows:
vec = mat[idx]
vals = cuda.local.array(cols, nb_float)
sorter = cuda.local.array(cols, nb_int)
stack = cuda.local.array(cols, nb_int)
# fill vals with data
for i in range(cols):
vals[i] = vec[i]
# sorting
quickSortIterative(vals, stack, sorter)
# foo stuff using vals and sorter, in this case just outputting the values (MWE)
for i in range(cols):
out[idx, i] = vals[i]
# # for debugging
# x = cuda.threadIdx.x
# bx = cuda.blockIdx.x
# if x == 0 and bx == 0:
# set_trace()
@cuda.jit("void(float{0}[:,:], int{0}[:,:], int{0}[:,:], float{0}[:,:])".format(bits))
def device_array_sort(mat, stack, sorter, out):
# CUDA iterator
idx = cuda.grid(1)
nrows = mat.shape[0]
if idx < nrows:
vec = mat[idx]
substack = stack[idx]
subsorter = sorter[idx]
# sorting (substack is already initialized to 0's, subsorter to list(range(cols)))
quickSortIterativeDevice(vec, substack, subsorter)
# foo stuff using vals and sorter, in this case just outputting the values (MWE)
for i in range(cols):
out[idx, i] = vec[i]
# # for debugging
# x = cuda.threadIdx.x
# bx = cuda.blockIdx.x
# if x == 0 and bx == 0:
# set_trace()
def local_array_driver(mat):
# block, grid
block_dim = 5
nrows = mat.shape[0]
grid_dim = int(nrows / block_dim + 1)
# CUDA
stream = cuda.stream()
cuda_mat = cuda.to_device(np.asarray(mat, dtype=np_float), stream=stream)
cuda_out = cuda.device_array((nrows, cols), np_float)
local_array_sort[grid_dim, block_dim](cuda_mat, cuda_out)
out = cuda_out.copy_to_host(stream=stream)
return out
def device_array_driver(mat):
# block, grid
block_dim = 5
nrows = mat.shape[0]
grid_dim = int(nrows / block_dim + 1)
# CUDA
stream = cuda.stream()
cuda_mat = cuda.to_device(np.asarray(mat, dtype=np_float), stream=stream)
cuda_out = cuda.device_array((nrows, cols), np_float)
stack = np.zeros((nrows, cols))
sorter = np.tile(list(range(cols)), (nrows, 1))
cuda_stack = cuda.to_device(np.asarray(stack, dtype=np_int), stream=stream)
cuda_sorter = cuda.to_device(np.asarray(sorter, dtype=np_int), stream=stream)
device_array_sort[grid_dim, block_dim](cuda_mat, cuda_stack, cuda_sorter, cuda_out)
out = cuda_out.copy_to_host(stream=stream)
return out
# modified from https://numba.pydata.org/numba-doc/dev/user/5minguide.html
# compilation
local_array_driver(test_vals[0:100])
device_array_driver(test_vals[0:100])
# NOW THE FUNCTION IS COMPILED, RE-TIME IT EXECUTING FROM CACHE
start = time()
out = local_array_driver(test_vals)
end = time()
print("Elapsed (local: after compilation) = %s" % (end - start))
# Check if a numpy array is sorted: https://stackoverflow.com/questions/47004506
issorted = np.all(np.all(np.diff(out) >= 0, axis=1))
if not issorted:
raise ValueError("out is not sorted")
elif verbose:
print("out is sorted")
# NOW THE FUNCTION IS COMPILED, RE-TIME IT EXECUTING FROM CACHE
start = time()
out = device_array_driver(test_vals)
end = time()
print("Elapsed (device: after compilation) = %s" % (end - start))
# Check if a numpy array is sorted: https://stackoverflow.com/questions/47004506
issorted = np.all(np.all(np.diff(out) >= 0, axis=1))
if not issorted:
warn("out is not sorted")
elif verbose:
print("out is sorted")
start = time()
out = np.sort(test_vals)
end = time()
print("Elapsed (NumPy) = %s" % (end - start))
# # not fully implemented/tested
# @cuda.jit("void(float{0}[:,:], float{0}[:,:])".format(bits))
# def local_array_shared_sort(mat, out):
# # Define an array in the shared memory
# # The size and type of the arrays must be known at compile time
# smat = cuda.shared.array(shape=(TPBx, TPBy), dtype=float32)
# x, y = cuda.grid(2)
# tx = cuda.threadIdx.x
# ty = cuda.threadIdx.y
# bpg = cuda.gridDim.x # blocks per grid
# nrows = mat.shape[0]
# if x >= nrows and y >= cols:
# # Quit if (x, y) is outside of valid C boundary
# return
# # allocate local arrays
# vals = cuda.local.array(cols, nb_float)
# sorter = cuda.local.array(cols, nb_int)
# stack = cuda.local.array(cols, nb_int)
# for i in range(bpg):
# # Preload data into shared memory
# smat[tx, ty] = mat[x, ty + i * TPBx]
# # Wait until all threads finish preloading
# cuda.syncthreads()
# # fill vals with some data (just for MWE)
# for i in range(cols):
# vals[i] = smat[i]
# # sorting
# quickSortIterative(vals, stack, sorter)
# # foo stuff using vals and sorter, in this case just outputting the values (MWE)
# for i in range(cols):
# out[x, i] = vals[i]
# # Wait until all threads finish computing (is this one unnecessary?)
# cuda.syncthreads()
基准测试结果
初始化和赋值
对于这个虚拟示例,使用本地数组要快约 3 倍。
1000000
Elapsed (local: after compilation) = 0.14660310745239258
Elapsed (device: after compilation) = 0.4358341693878174
排序
但是,使用 1,000,000 行和 100 列实现 QuickSort 会得到:
1000000
Elapsed (local: after compilation) = 2.4355175495147705
Elapsed (device: after compilation) = 1.1668529510498047
out is sorted
Elapsed (NumPy) = 2.304871082305908
并且有 2,000,000 行:
> 2000000
> Elapsed (local: after compilation) = 4.839058876037598
> Elapsed (device: after compilation) = 2.2948694229125977
> out is sorted
> Elapsed (NumPy) = 4.541851282119751
请注意,device_array_sort 的内存要求大约高出 2-3 倍。所以看起来我在对一两百万个 100 元素向量的排序中使用 3 个本地数组的速度降低了约 2 倍。换句话说,预先分配数组并将它们发送到设备更丑陋(更多参数)并且内存效率更低,但总体上更快。
CuPy 比较
import cupy as cp
from time import time
rows = 1000000
cols = 100
test_vals = cp.random.rand(rows, cols)
t = time()
cp.sort(test_vals)
print("Elapsed (CuPy) = %s" % (time() - t))
Elapsed (CuPy) = 0.5325748920440674
内存不足 2,000,000 个案例。
关于共享内存的注意事项
此示例中未使用共享内存,但将其合并应该会进一步提高速度(可能与 CuPy 实现相当)。有关实现共享内存的示例,请参阅Numba Docs: CUDA: Examples: Matrix Multiplication。截至 2021 年 8 月 21 日,这有一些错误已在 How to generalize fast matrix multiplication on GPU using numba 中描述和修复。