取消引用指针时的竞争条件答案

【问题标题】：Race condition when dereferencing pointers取消引用指针时的竞争条件
【发布时间】：2017-10-31 14:58:25
【问题描述】：

我们在一个非常简单的内存池中工作，我们发现了一个非常有趣的错误，但我们无法解决。

算法的思想如下：有一堆“可用”的内存块，所以每个块都有一个指向下一个可用块的指针。为了避免使用二级数据结构，我们决定使用相同的内存块来存储指针。因此，下一个可用的块是通过取消引用这个块来获得的： void *nextChunk = *((void **)chunk)

代码最初是用 C++ 原子实现的，但我们可以简化它并重现 C 原子内在函数的问题：

void *_topChunk;

void *getChunk()
{
    void *chunk;

    // Try to reserve a chunk (for these tests, it has been forced that _topChunk can never be null)
    do {
        chunk = _topChunk;
    } while(!__sync_bool_compare_and_swap(&_topChunk, chunk, *((void **)chunk)));

    return chunk;
}

void returnChunk(void *chunk)
{
    do {
        *((void **)chunk) = _topChunk;
    } while (!__sync_bool_compare_and_swap(&_topChunk, *((void **)chunk), chunk));
}

对于我们一直在运行以调试此问题的测试，我们生成了几个执行此操作的线程：

while (1) {
    void *ptr = getChunk();
    *((void **)ptr) = (void *)~0ULL;
    returnChunk(ptr);
}

在执行的某个时刻，getChunk() 会出现段错误，因为它试图取消引用 0xfff... 指针。但是从 returnChunk() 中写的内容来看，*((void **)chunk) 永远不应该是 0xfff...，它应该是来自堆栈的有效指针。为什么它不起作用？

我们也尝试过使用中间的void *，而不是直接解引用，结果完全一样。

【问题讨论】：

做你想做的事，无需征得我们的许可。:)
为了避免二级数据结构，我们决定使用相同的内存块来存储指针。因此，下一个可用的块是通过取消引用这个块获得的： void *nextChunk = *((void **)chunk) 这似乎从根本上被破坏了。这如何防止两个或多个线程同时看到相同的旧值，然后同时写入相同的新值？
@AndrewHenle 对于 getChunk() 函数，比较和交换保证获得的块对于同时访问的每个线程都是不同的。对于 returnChunk() 函数，指向下一个块的指针在“它被公开”之前更新（它被推入堆栈）。至少，这就是我们认为在发布的代码中应该发生的事情。
避免竞争条件的一种简单方法是使用互斥锁。建议：1）锁定互斥体 2）操纵指针 3）解锁互斥体
@user3629249 我们正在努力使这个内存池尽可能的快速和精简。因此，最好只使用两个原子操作而不是完整的互斥锁。如果我们无法使其与原子操作一起工作，我们将求助于互斥锁。

标签： c multithreading pointers atomic

【解决方案1】：

我认为问题出在函数 getChunk 上。 __sync_bool_compare_and_swap 的第三个参数可能已过时。让我们看一下稍微修改过的 getChunk 版本：

void *getChunk()
{
    void *chunk;
    void *chunkNext;

    // Try to reserve a chunk (for these tests, it has been forced that _topChunk can never be null)
    do {
        chunk = _topChunk;
        chunkNext = *(void **)chunk;
        //chunkNext might have been changed meanwhile, but chunk is the same!! 
    } while(!__sync_bool_compare_and_swap(&_topChunk, chunk, chunkNext));
    return chunk;
}

假设我们有一个由三个块组成的简单链，位于地址 0x100、0x200 和 0x300。我们需要三个线程（A、B 和 C）来打破链：

//The Chain: TOP -> 0x100 -> 0x200 -> 0x300 -> NIL
 Thread   
 A      chnk     = top;         //is 0x100
 A      chnkNext = *chnk;       //is 0x200
   B       chnk = top           //is 0x100
   B       chnkNext = *chnk;    //is 0x200
   B       syncSwap();          //okay, swap takes place
   B       return chnk;         //is 0x100
   /*** The Chain Now: TOP -> 0x200 -> 0x300 -> NIL ***/
     C        chnk = top;      //is 0x200
     C        chnkNext = *chnk //is 0x300
     C        syncSwap         //okay, swap takes place
     C        return chnk;     //is 0x200
   /*** The Chain Now: TOP -> 0x300 -> NIL ***/
   B       returnChunk(0x100); 
   /*** The Chain Now: TOP -> 0x100 -> 0x300 -> NIL ***/
 A      syncSwap(&Top, 0x100, 0x200 /*WRONG, *chnk IS NOW 0x300!!!!*/  );
 A      return chnk;

【讨论】：

@markmb 否，__sync_bool_compare_and_swap (type *ptr, type oldval type newval)，返回 true，因为指针（再次）指向 0x100。
是的，这很可能是问题所在。我自己看不到这种极端情况。非常感谢你。 @user5329483：我删除了我的评论，因为我看到了我的错误，但在删除之前我没有看到你对它的评论，对此感到抱歉。
除非您使用真正的锁，否则恐怕无法解决您的问题。