C++ std::condition_variable::notify_all() 上的 linux 上的段错误答案

【问题标题】：segfault on linux on C++ std::condition_variable::notify_all()C++ std::condition_variable::notify_all() 上的 linux 上的段错误
【发布时间】：2022-01-25 07:38:48
【问题描述】：

我正在努力让自己了解 C++11 的最新变化，并且我正在围绕 std::queue 创建一个名为 SafeQueue 的线程安全包装器。我有两个可能阻塞的条件，队列已满和队列为空。我为此使用 std::condition_variable 。不幸的是，在 Linux 上，对我的空状态的 notify_all() 调用是段错误的。它在带有 clang 的 Mac 上运行良好。这是 enqueue() 方法中的段错误：

#ifndef mqueue_hpp
#define mqueue_hpp

#include <queue>
#include <mutex>

//////////////////////////////////////////////////////////////////////
// SafeQueue - A thread-safe templated queue.                       //
//////////////////////////////////////////////////////////////////////
template<class T>
class SafeQueue
{
public:
    // Instantiate a new queue. 0 maxsize means unlimited.
    SafeQueue(unsigned int maxsize = 0);
    ~SafeQueue(void);
    // Enqueue a new T. If enqueue would cause it to exceed maxsize,
    // block until there is room on the queue.
    void enqueue(const T& item);
    // Dequeue a new T and return it. If the queue is empty, wait on it
    // until it is not empty.
    T& dequeue(void);
    // Return size of the queue.
    size_t size(void);
    // Return the maxsize of the queue.
    size_t maxsize(void) const;
private:
    std::mutex m_mutex;
    std::condition_variable m_empty;
    std::condition_variable m_full;
    std::queue<T> m_queue;
    size_t m_maxsize;
};

template<class T>
SafeQueue<T>::SafeQueue(unsigned int maxsize) : m_maxsize(maxsize) { }

template<class T>
SafeQueue<T>::~SafeQueue() { }

template<class T>
void SafeQueue<T>::enqueue(const T& item) {
    // Synchronize.
    if ((m_maxsize != 0) && (size() == m_maxsize)) {
        // Queue full. Can't push more on. Block until there's room.
        std::unique_lock<std::mutex> lock(m_mutex);
        m_full.wait(lock);
    }
    {
        std::lock_guard<std::mutex> lock(m_mutex);
        // Add to m_queue and notify the reader if it's waiting.
        m_queue.push(item);
    }
    m_empty.notify_all();
}

template<class T>
T& SafeQueue<T>::dequeue(void) {
    // Synchronize. No unlock needed due to unique lock.
    if (size() == 0) {
        // Wait until something is put on it.
        std::unique_lock<std::mutex> lock(m_mutex);
        m_empty.wait(lock);
    }
    std::lock_guard<std::mutex> lock(m_mutex);
    // Pull the item off and notify writer if it's waiting on full cond.
    T& item = m_queue.front();
    m_queue.pop();
    m_full.notify_all();
    return item;
}

template<class T>
size_t SafeQueue<T>::size(void) {
    std::lock_guard<std::mutex> lock(m_mutex);
    return m_queue.size();
}

template<class T>
size_t SafeQueue<T>::maxsize(void) const {
    return m_maxsize;
}

#endif /* mqueue_hpp */

显然我做错了什么，但我无法弄清楚。 gdb 的输出：

Core was generated by `./test'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x0000000000414739 in std::condition_variable::notify_all() ()
#2  0x00000000004054c4 in SafeQueue<int>::enqueue (this=0x7ffee06b3470,
    item=@0x7ffee06b355c: 1) at ../mqueue.hpp:59
#3  0x0000000000404ab6 in testsafequeue () at test.cpp:13
#4  0x0000000000404e99 in main () at test.cpp:49
(gdb) frame 2
#2  0x00000000004054c4 in SafeQueue<int>::enqueue (this=0x7ffee06b3470,
    item=@0x7ffee06b355c: 1) at ../mqueue.hpp:59
59          m_empty.notify_all();
(gdb) info locals
No locals.
(gdb) this.m_empty
Undefined command: "this.m_empty".  Try "help".
(gdb) print this->m_empty
$1 = {_M_cond = {__data = {{__wseq = 0, __wseq32 = {__low = 0,
          __high = 0}}, {__g1_start = 0, __g1_start32 = {__low = 0,
          __high = 0}}, __g_refs = {0, 0}, __g_size = {0, 0},
      __g1_orig_size = 0, __wrefs = 0, __g_signals = {0, 0}},
    __size = '\000' <repeats 47 times>, __align = 0}}

帮助表示赞赏。

我的示例测试崩溃了。

SafeQueue<int> queue(10);
queue.enqueue(1);

enqueue() 中 notify_all() 的段错误。

【问题讨论】：

在你的实现中肯定存在数据竞争：enqueue() 函数应该只获取一次锁并检查它需要检查的任何内容，如果必须的话，可能会阻塞条件变量。但是，这些只是实现上的语义错误，它们不应该导致崩溃。我猜崩溃是由这个类的使用方式引起的。
你应该删除你的复制/移动ctor/赋值操作符。
@o11c 它们已经被隐式定义为已删除或根本没有隐式声明。
我在上面展示了如何使用实现。这很简单。

标签： c++ linux

【解决方案1】：

在这两个函数中，您使用一个锁来等待条件变量，但是一旦等待结束，您就销毁该锁，而不是使用一个新锁来实际操作队列。

在获取新锁之间，另一个线程可能会获取互斥锁的锁，例如从队列中删除原始线程打算从队列中取出的对象，可能会在空队列上调用front。

您需要在每个函数中获取一个锁，并执行它下的所有操作。锁定仅在wait 执行时（自动）释放。

此外，从wait 返回并不意味着调用了notify_*。 wait 可能会虚假地醒来。此外，notify_all 可能会通知多个线程有一个可用的新元素。您需要在循环中调用wait，在退出之前检查执行操作所需的条件。

wait 还提供了一个重载，您可以使用该重载将条件作为第二个参数作为谓词来避免显式循环。

除此之外

T& item = m_queue.front();
m_queue.pop();
//...
return item;

也会导致未定义的行为。 pop 将破坏 item 引用的对象，从而产生悬空引用。使用返回的引用会导致未定义的行为。

您需要从队列中复制/移动对象，而不是保留对它的引用：

T item = m_queue.front();
m_queue.pop();
//...
return item;

因此，dequeue 也必须返回 T，而不是 T&。

【讨论】：

【解决方案2】：

首先，试图使集合类线程安全是错误的 - 通过使用互斥锁包装所有公共方法。请参阅my previous answer on this topic，了解尝试创建线程安全容器的固有竞争条件。

至于你当前的代码，这行看起来很可疑：

T& item = m_queue.front();
m_queue.pop();

您的m_queue.front 调用将返回对队列中项目的引用，但pop 方法肯定也会破坏引用指向的实例。

更好

T SafeQueue<T>::dequeue(void) { // return a copy of the item in the queue
    ... 
    T item = m_queue.front(); // make the copy to be returned
    m_queue.pop();

至于您的其余实施，您有几个竞争条件和问题。你永远不应该仅仅因为wait 返回你等待的条件仍然有效（即竞争条件！）。而是保持整个函数的锁定。当你调用wait时，它会释放锁直到wait返回。

正如 cmets 中提到的 user17732522 一样，尝试递归锁定 std::mutex （通过在 enqueue 或 dequeue 方法中调用 size() ）将最确定的死锁。您可以使用递归互斥锁，但最好避免使用这种模式。

改进如下：

void SafeQueue<T>::enqueue(const T& item) {

    std::unique_lock<std::mutex> lock(m_mutex);

    while ((m_maxsize != 0) && (m_queue.size() >= m_maxsize)) {
        // Queue full. Can't push more on. Block until there's room.
        m_full.wait(lock); // this will atomically unlock the mutex and wait for the cv to get notified
    }
    m_queue.push(item);
    m_empty.notify_all();
}

T SafeQueue<T>::dequeue(void) {

    std::unique_lock<std::mutex> lock(m_mutex);

    while (m_queue.size() == 0) {
        // Wait until something is put on it.
        m_empty.wait(lock);  // this will atomically unlock the mutex and wait for the cv 
    }

    // Pull the item off and notify writer if it's waiting on full cond.
    T item = m_queue.front();
    m_queue.pop();
    m_full.notify_all();
    return item;
}

再次，我坚持我的第一段 - 创建线程安全容器是一种设计谬误。而是让使用非线程安全容器的场景线程安全！

【讨论】：

随着 OP 的 size 锁定互斥锁的实现，在持有锁的同时调用 size 可能会导致未定义的行为/死锁。
@user17732522 - 哇哦！我错过了。那肯定会陷入僵局。很好的收获。
我必须阅读您的帖子。重点是使用线程安全队列进行线程间通信。
在我阅读 cppreference.com 并基于他们的示例代码实现之前，您的代码看起来像是我的第一个实现。虽然仍然崩溃。我的测试相当简单，有一个整数队列。我会把它添加到我上面的帖子中。
虽然我感谢大家的 cmets，但我还是要指出，他们与手头的问题无关，即 segfault。实施您的建议，虽然他们改进了实施，但并没有解决问题。它仍然在 notify_all() 中出现段错误，无论是在出队入队期间，似乎都是随机的。如果有人对此有相关评论，请说出来。