sjtujoe
  1. Executing Thread Blocks
  • Threads are assigned to Streaming Multiprocessors(SM) in block granularity
    • Up to 8 blocks to each SM as resource allows
    • Fermi SM can take up to 1536 threads
      • 256*6
      • 512*3
  • SM maintains thread/block idx
  • Warps as Scheduling Units
    • Each Block is executed as 32-thread Warps(each warp as a SIMD unit)
    • SM implements zero-overhead warp
  • Control Divergence
    • If there are any dependencies between threads, you must __syncthreads() to get correct results.
  • Memory Model
    • Global Memory (dynamic allocate, static allocate)
    • Register (automatic variables)
    • Shared Memory (__shared__)
    • Constant Memory (__constant__)
    • image
    • __device __ is optional when used with __shared__, or __constant__
    • image
    • Partition data into subsets or tiles that fit into shared memory
    • use one thread block to handle each tile by
      • Loading the tile from global memory to shared memory, using multiple threads
      • Performing the computation on subset from shared memory, reducing traffic to the global memory
      • Upon completion, writing results from shared memory to global memory
  • 分类:

    技术点:

    相关文章:

    • 2021-11-28
    • 2022-12-23
    • 2021-08-14
    • 2021-06-15
    • 2021-04-09
    • 2022-01-21
    • 2021-11-19
    • 2021-11-15
    猜你喜欢
    • 2021-11-17
    • 2021-11-17
    • 2021-11-17
    • 2021-11-17
    • 2022-12-23
    • 2022-12-23
    • 2021-12-10
    相关资源
    相似解决方案