- Executing Thread Blocks
- Threads are assigned to Streaming Multiprocessors(SM) in block granularity
- Up to 8 blocks to each SM as resource allows
- Fermi SM can take up to 1536 threads
- SM maintains thread/block idx
Warps as Scheduling Units
- Each Block is executed as 32-thread Warps(each warp as a SIMD unit)
- SM implements zero-overhead warp
Control Divergence
- If there are any dependencies between threads, you must __syncthreads() to get correct results.
Memory Model
- Global Memory (dynamic allocate, static allocate)
- Register (automatic variables)
- Shared Memory (__shared__)
- Constant Memory (__constant__)

- __device __ is optional when used with __shared__, or __constant__

- Partition data into subsets or tiles that fit into shared memory
- use one thread block to handle each tile by
- Loading the tile from global memory to shared memory, using multiple threads
- Performing the computation on subset from shared memory, reducing traffic to the global memory
- Upon completion, writing results from shared memory to global memory
相关文章:
-
2021-11-28
-
2022-12-23
-
2021-08-14
-
2021-06-15
-
2021-04-09
-
2022-01-21
-
2021-11-19
-
2021-11-15
猜你喜欢
-
2021-11-17
-
2021-11-17
-
2021-11-17
-
2021-11-17
-
2022-12-23
-
2022-12-23
-
2021-12-10
相关资源
-
下载
2022-12-11
-
下载
2022-12-30
-
下载
2021-06-24