PyCUDA在使用多个块处理矩阵运算时，为什么矩阵大小必须能被块大小整除？答案

【问题标题】：PyCUDA when using multiple blocks to deal with matrix operation, why does matrix size have to be divisible by the block size?PyCUDA在使用多个块处理矩阵运算时，为什么矩阵大小必须能被块大小整除？
【发布时间】：2019-06-02 03:08:24
【问题描述】：

我正在 PyCUDA 上学习 GPU 编程。我对块上的矩阵运算的计算有点困惑。像下面的例子，我想重做计算

a = np.array([1,2,3,4,5,6])
c = a[:,np.newaxis] - a

应该是

c = [[0,-1,-2,-3,-4,-5],
     [1,0,-1,-2,-3,-4],
     [2,1,0,-1,-2,-3],
     [3,2,1,0,-1,-2]]

在 GPU 上。

如果我为矩阵和块分配相同的大小，请按照下面的代码。一切正常。但是为了测试多个块中的计算，我为块大小分配了 4，结果出错了。我已经检查了输出 c 中每个条目的 blockDim。它显示一些条目的 blockDim 为 0，但它们应该都是 4。

array([[4., 4., 4., 4., 4., 4.],
   [0., 0., 4., 4., 4., 4.],
   [0., 0., 4., 4., 4., 4.],
   [0., 0., 4., 4., 4., 4.],
   [4., 4., 4., 4., 4., 4.],
   [4., 4., 4., 4., 4., 4.]], dtype=float32)

并且threadIdx.x在同一位置显示错误的数字。

array([[0., 1., 2., 3., 0., 1.],
   [0., 0., 2., 3., 0., 1.],
   [0., 0., 2., 3., 0., 1.],
   [0., 0., 2., 3., 0., 1.],
   [0., 1., 2., 3., 0., 1.],
   [0., 1., 2., 3., 0., 1.]], dtype=float32)

这很奇怪。

可重复的代码如下。

import numpy as np
from pycuda import compiler, gpuarray, tools
import pycuda.driver as drv

# -- initialize the device
import pycuda.autoinit

kernel_code_template = """
__global__ void com_t(float *a, float *c)
{

// 2D Thread ID 
int tx = blockDim.x*blockIdx.x + threadIdx.x; // Compute row index
int ty = blockDim.y*blockIdx.y + threadIdx.y; // Compute column index

// Pvalue is used to store the element of the matrix
// that is computed by the thread
float Pvalue = 0;

// Each thread loads one row of M and one column of N, 
//   to produce one element of P.
float Aelement = blockDim.x;
float Belement = 0;
Pvalue = Aelement - Belement;

// Write the matrix to device memory;
// each thread writes one element
c[ty * %(MATRIX_SIZE)s + tx] = Pvalue;
}
"""

MATRIX_SIZE = 6
BLOCK_SIZE = 6
start = drv.Event()
end = drv.Event()

# # create a random vector
a_cpu = np.array([i for i in range(MATRIX_SIZE)]).astype(np.float32)

# compute reference on the CPU to verify GPU computation
start.record() # start timing
start.synchronize()
c_cpu = a_cpu[:,np.newaxis] - a_cpu
end.record() # end timing
# calculate the run length
end.synchronize()
secs = start.time_till(end)*1e-3
print("CPU time:")
print("%fs" % (secs))

# transfer host (CPU) memory to device (GPU) memory
a_gpu = gpuarray.to_gpu(a_cpu)

# create empty gpu array for the result (C = A * B)
c_gpu = gpuarray.empty((MATRIX_SIZE, MATRIX_SIZE), np.float32)

# get the kernel code from the template
# by specifying the constant MATRIX_SIZE
kernel_code = kernel_code_template % {
    'MATRIX_SIZE': MATRIX_SIZE
    }

# compile the kernel code
mod = compiler.SourceModule(kernel_code)

# get the kernel function from the compiled module
matrixmul = mod.get_function("com_t")

start.record() # start timing

# set grid size
if MATRIX_SIZE%BLOCK_SIZE != 0:
    grid=(MATRIX_SIZE//BLOCK_SIZE+1,MATRIX_SIZE//BLOCK_SIZE+1,1)
else:
    grid=(MATRIX_SIZE//BLOCK_SIZE,MATRIX_SIZE//BLOCK_SIZE,1)

# call the kernel on the card
matrixmul(
    # inputs
    a_gpu,
    # output
    c_gpu,
    grid = grid,
    # (only one) block of MATRIX_SIZE x MATRIX_SIZE threads
    block = (BLOCK_SIZE, BLOCK_SIZE, 1),
    )
end.record() # end timing
end.synchronize()
secs = start.time_till(end)*1e-3
print("GPU time:")
print("%fs" % (secs))


# print the results
print("-" * 80)
print("Matrix A (GPU):")
print(a_gpu.get())

print("-" * 80)
print("Matrix C (GPU):")
print(c_gpu.get())

print("-" * 80)
print("CPU-GPU difference:")
print(c_cpu - c_gpu.get())

np.allclose(c_cpu, c_gpu.get())

【问题讨论】：

标签： python matrix gpu pycuda

【解决方案1】：

问题解决了。矩阵 c 的求值应该放在类似的约束中

if((ty <matrixsize) && (tx < matrixsize))

否则，将调用请求过多的线程并挤出c的正确条目。

【讨论】：