一定一定得避免原子操作,因为对于性能的影响实在是太明显了,例如,throughput从800MBps骤降至110MBps,

看论坛是看到有人转述的一筒子的话,记录于下:

honestly, if you're trying to do this you're probably going down the wrong path, but general rules of thumb are

- don't have multiple threads within a warp contending for a lock, that leads to all sorts of confusing issues for most people because inter-warp branches are not the same as intra-warp branches
- avoid global memory contention as much as possible (e.g., if you need to have a critical section among all warps in all CTAs, do per-CTA shared memory locks then a global lock)
- traditional threading primitives implemented with atomics are a pretty terrible idea, if you can avoid atomics as much as possible (or entirely) you can get a big perf win (and there are very interesting ways you can do this, and when I say big perf win, I mean on the order of 5-10x)

("well," you think, "it sounds like tim is speaking from experience!" oh yes, I am)

相关文章:

  • 2021-05-18
  • 2022-12-23
  • 2022-02-20
  • 2022-12-23
  • 2022-12-23
  • 2021-08-17
  • 2022-01-14
  • 2021-07-27
猜你喜欢
  • 2022-12-23
  • 2022-12-23
  • 2021-04-27
  • 2022-12-23
  • 2021-12-16
  • 2022-12-23
  • 2022-03-06
相关资源
相似解决方案