GTS 250 和 Fermi 设备之间的 CUDA 块同步差异答案

【问题标题】：CUDA block synchronization differences between GTS 250 and Fermi devicesGTS 250 和 Fermi 设备之间的 CUDA 块同步差异
【发布时间】：2011-04-04 14:48:57
【问题描述】：

所以我一直在研究在全局内存中创建哈希表的程序。该代码在作为 Compute 1.1 设备的 GTS250 上完全正常运行（尽管速度较慢）。但是，在 Compute 2.0 设备（C2050 或 C2070）上，哈希表已损坏（数据不正确，指针有时错误）。

基本上，当只使用一个块（两个设备）时，代码可以正常工作。但是，当使用 2 个或更多块时，它仅适用于 GTS250，不适用于任何 Fermi 设备。

我知道两个平台之间的 warp 调度和内存架构是不同的，我在开发代码时会考虑到这一点。据我了解，使用__theadfence() 应该确保任何全局写入都已提交并且对其他块可见，但是，从损坏的哈希表来看，它们似乎不是。

我还在 NVIDIA CUDA 开发者论坛上发布了这个问题，可以在 here 找到。

相关代码如下：

__device__ void lock(int *mutex) {
    while(atomicCAS(mutex, 0, 1) != 0);
}

__device__ void unlock(int *mutex) {
    atomicExch(mutex, 0);
}

__device__ void add_to_global_hash_table(unsigned int key, unsigned int count, unsigned int sum, unsigned int sumSquared, Table table, int *globalHashLocks, int *globalFreeLock, int *globalFirstFree)
{
    // Find entry if it exists
    unsigned int hashValue = hash(key, table.count);

    lock(&globalHashLocks[hashValue]);

    int bucketHead = table.entries[hashValue];
    int currentLocation = bucketHead;

    bool found = false;
    Entry currentEntry;

    while (currentLocation != -1 && !found) {
        currentEntry = table.pool[currentLocation];
        if (currentEntry.data.x == key) {
            found = true;
        } else {
            currentLocation = currentEntry.next;
        }
    }

    if (currentLocation == -1) {
        // If entry does not exist, create entry
        lock(globalFreeLock);
        int newLocation = (*globalFirstFree)++;
        __threadfence();
        unlock(globalFreeLock);

        Entry newEntry;
        newEntry.data.x = key;
        newEntry.data.y = count;
        newEntry.data.z = sum;
        newEntry.data.w = sumSquared;
        newEntry.next = bucketHead;

        // Add entry to table
        table.pool[newLocation] = newEntry;
        table.entries[hashValue] = newLocation;
    } else {
        currentEntry.data.y += count;
        currentEntry.data.z += sum;
        currentEntry.data.w += sumSquared;
        table.pool[currentLocation] = currentEntry;
    }

    __threadfence();
    unlock(&globalHashLocks[hashValue]);
}

【问题讨论】：

该设备功能不足以说明问题所在。设备功能的正确运行取决于锁和解锁的完整性，正确保护临界区，但它们是如何工作的呢？
刚刚添加了锁定/解锁代码。

标签： cuda synchronization gpgpu nvidia

【解决方案1】：

正如LSChien 在此post 中所指出的，问题在于L1 缓存一致性。虽然使用__threadfence() 将保证共享和全局内存写入对其他线程可见，因为它不是原子的，所以block 1 中的thread x 可能会达到缓存的内存值，直到block 0 中的thread y 执行到threadfence操作说明。相反，LSChien 在他的帖子中建议使用atomicCAS() 来强制线程从全局内存而不是缓存值中读取。正确的做法是将内存声明为volatile，要求对该内存的每次写入都立即对网格中的所有其他线程可见。

【讨论】：

在 CUDA 中 volatile 实际上意味着一个值必须立即从寄存器写回内存。它不保证写入何时对另一个线程可见的及时性。其效果是防止编译器应用任何会导致内存存储指令从代码中删除以及在寄存器中从操作到操作的中间结果的优化。

【解决方案2】：

__threadfence 保证在返回之前对全局内存的写入对当前块中的其他线程可见。那和“对全局内存的写操作完成”不一样！考虑在每个多核上进行缓存。

【讨论】：

不，这就是 __threadfence_block 所做的。 __threadfence 确保全局内存事务对设备级别的所有线程都是可见的。