C++11 async 只使用一个核心答案

【问题标题】：C++11 async is using only one coreC++11 async 只使用一个核心
【发布时间】：2015-02-23 16:44:57
【问题描述】：

我正在尝试在 C++ 中并行化一个长时间运行的函数，并使用 std::async 它只使用一个内核。

不是函数的运行时间太小，因为我目前使用的测试数据需要大约10分钟才能运行。

根据我的逻辑，我创建了 NThreads 价值的 Futures（每个都占用循环的一部分而不是单个单元格，因此它是一个运行时间很长的线程），每个线程都将调度一个异步任务。然后在它们被创建之后，程序自旋锁等待它们完成。但是它总是使用一个核心？！

这也不是我在看顶部并说它看起来大致像一个 CPU，我的 ZSH 配置输出最后一个命令的 CPU %，它总是完全 100%，从不高于

auto NThreads = 12;
auto BlockSize = (int)std::ceil((int)(NThreads / PathCountLength));

std::vector<std::future<std::vector<unsigned __int128>>> Futures;

for (auto I = 0; I < NThreads; ++I) {
    std::cout << "HERE" << std::endl;
    unsigned __int128 Min = I * BlockSize;
    unsigned __int128 Max = I * BlockSize + BlockSize;

    if (I == NThreads - 1)
        Max = PathCountLength;

    Futures.push_back(std::async(
        [](unsigned __int128 WMin, unsigned __int128 Min, unsigned__int128 Max,
           std::vector<unsigned __int128> ZeroChildren,
           std::vector<unsigned __int128> OneChildren,
           unsigned __int128 PathCountLength)
           -> std::vector<unsigned __int128> {
           std::vector<unsigned __int128> LocalCount;
           for (unsigned __int128 I = Min; I < Max; ++I)
               LocalCount.push_back(KneeParallel::pathCountOrStatic(
                   WMin, I, ZeroChildren, OneChildren, PathCountLength));
          return LocalCount;
    },
    WMin, Min, Max, ZeroChildInit, OneChildInit, PathCountLength));
}

for (auto &Future : Futures) {
    Future.get();
}

有没有人有任何见解。

我在 Arch Linux 上使用 clang 和 LLVM 进行编译。有没有我需要的编译标志，但据我所知，C++11 标准化了线程库？

编辑：如果它有助于任何人提供任何进一步的线索，当我注释掉它应该在所有核心上运行的本地向量时，当我将它放回时回滚到一个核心。

编辑2：所以我确定了解决方案，但这似乎很奇怪。从 lambda 函数返回向量将其固定到一个核心，所以现在我通过将 shared_ptr 传递给输出向量并对其进行操作来解决这个问题。嘿，很快，它就会在核心上启动！

我认为现在使用期货毫无意义，因为我没有回报，我会改用线程，不！，使用 no 回报的线程也使用一个核心。奇怪吧？

好吧，回到使用期货，只是返回一个到扔掉什么的。是的，您猜对了，即使从线程返回一个 int 也会将程序粘到一个内核上。除了期货不能有 void lambda 函数。所以我的解决方案是将一个指针传递给一个永远不会返回任何东西的 int lambda 函数来存储输出。是的，感觉就像胶带，但我找不到更好的解决方案。

看起来很……奇怪？就像编译器以某种方式错误地解释了 lambda 一样。可能是因为我使用的是 LLVM 的 dev 版本而不是稳定的分支...？

无论如何我的解决方案，因为我最讨厌在这里找到我的问题并且没有答案：

auto NThreads = 4;
auto BlockSize = (int)std::ceil((int)(NThreads / PathCountLength));

auto Futures = std::vector<std::future<int>>(NThreads);
auto OutputVectors =
    std::vector<std::shared_ptr<std::vector<unsigned __int128>>>(
        NThreads, std::make_shared<std::vector<unsigned __int128>>());

for (auto I = 0; I < NThreads; ++I) {
  unsigned __int128 Min = I * BlockSize;
  unsigned __int128 Max = I * BlockSize + BlockSize;

if (I == NThreads - 1)
  Max = PathCountLength;

Futures[I] = std::async(
  std::launch::async,
  [](unsigned __int128 WMin, unsigned __int128 Min, unsigned __int128 Max,
       std::vector<unsigned __int128> ZeroChildren,
       std::vector<unsigned __int128> OneChildren,
       unsigned __int128 PathCountLength,
       std::shared_ptr<std::vector<unsigned __int128>> OutputVector)
        -> int {
      for (unsigned __int128 I = Min; I < Max; ++I) {
        OutputVector->push_back(KneeParallel::pathCountOrStatic(
            WMin, I, ZeroChildren, OneChildren, PathCountLength));
      }
    },
    WMin, Min, Max, ZeroChildInit, OneChildInit, PathCountLength,
    OutputVectors[I]);
}

for (auto &Future : Futures) {
  Future.get();
}

【问题讨论】：

异步调用与多处理无关。
@texasbruce 您的意思是：std::async 不是进行多处理的正确方法吗？让它多核的功能是什么？

标签： c++ multithreading c++11 asynchronous

【解决方案1】：

通过向 async 提供第一个参数，您可以将其配置为延迟运行 (std::launch::deferred)、在自己的线程中运行 (std::launch::async)，或者让系统在这两个选项之间做出决定 (std::launch::async | std::launch::deferred)。后者是默认行为。

因此，要强制它在另一个线程中运行，请将您对 std::async 的调用调整为 std::async(std::launch::async, /*...*/)。

【讨论】：

感谢您的建议，但没有帮助...？我在开头添加了 std::launch::async ，但它表现出相同的行为。短期运行后来自 zsh 的片段 - “13.92s 用户 5.28s 系统 100% cpu 19.184 总计”
我编辑了第一个问题以添加更多信息，看看它是否增加了一些见解