如何在没有等待调度的块的情况下最大限度地利用 GPU？答案

【问题标题】：How to maximise the use of the GPU without having blocks waiting to be scheduled?如何在没有等待调度的块的情况下最大限度地利用 GPU？
【发布时间】：2018-01-08 01:49:51
【问题描述】：

我的 Titan-XP 上的设备查询显示我有 30 个多处理器，每个多处理器的最大线程数为 2048 个。认为硬件上可以同时物理执行的最大线程数是 30 * 2048 是否正确？即：像下面这样的内核配置会利用这个吗？

kernel<<<60, 1024>>>(...);

我真的很想在物理上执行最大数量的块，同时避免等待调度的块。以下是设备查询的完整输出：

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "TITAN Xp"
  CUDA Driver Version / Runtime Version          9.0 / 9.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 12190 MBytes (12781682688 bytes)
  (30) Multiprocessors, (128) CUDA Cores/MP:     3840 CUDA Cores
  GPU Max Clock rate:                            1582 MHz (1.58 GHz)
  Memory Clock rate:                             5705 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 3145728 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Domain ID / Bus ID / location ID:   0 / 4 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1, Device0 = TITAN Xp
Result = PASS

【问题讨论】：

“我真的很想在物理上执行最大数量的块，同时避免等待调度的块”——可能不会。也就是说，尝试实现这一目标不太可能对您有用。 @RobertCrovella 的回答指出了正确的方向。

标签： cuda gpu

【解决方案1】：

是的，你的结论是正确的。对于 CUDA 9 或 CUDA 9.1 支持的所有 GPU，可以“进行中”的最大线程数为 2048 * # of SM。（受 CUDA 8 支持的 Fermi GPU 略低，为 1536 * # of SM）

这是一个上限，您的内核的细节（资源利用率）可能意味着少于这个数字实际上可以“驻留”或“运行中”。这是GPU占用的一般主题。 CUDA 包括一个占用计算器电子表格和一个编程的occupancy API 来帮助确定这一点，对于您的特定内核。

让有限数量的线程（例如，在您的情况下为 60 * 1024）处理任意数据集大小的通常内核策略是使用某种形式的构造，称为 grid striding loop。

【讨论】：