关于最近的cuda原子操作问题

一定一定得避免原子操作，因为对于性能的影响实在是太明显了，例如，throughput从800MBps骤降至110MBps,

看论坛是看到有人转述的一筒子的话，记录于下:

honestly, if you're trying to do this you're probably going down the wrong path, but general rules of thumb are

- don't have multiple threads within a warp contending for a lock, that leads to all sorts of confusing issues for most people because inter-warp branches are not the same as intra-warp branches
- avoid global memory contention as much as possible (e.g., if you need to have a critical section among all warps in all CTAs, do per-CTA shared memory locks then a global lock)
- traditional threading primitives implemented with atomics are a pretty terrible idea, if you can avoid atomics as much as possible (or entirely) you can get a big perf win (and there are very interesting ways you can do this, and when I say big perf win, I mean on the order of 5-10x)

("well," you think, "it sounds like tim is speaking from experience!" oh yes, I am)