推力排序示例中的崩溃答案

【问题标题】：Crash in thrust sorting example推力排序示例中的崩溃
【发布时间】：2014-08-22 09:32:55
【问题描述】：

我正在尝试官网示例https://developer.nvidia.com/thrust的第一个示例并将向量大小更改为32

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <algorithm>
#include <cstdlib>
#include <time.h>

using namespace std;

int main(void){
  // generate random numbers serially
  thrust::host_vector<int> h_vec(32 << 23);
  std::generate(h_vec.begin(), h_vec.end(), rand);
  std::cout << "1." << time(NULL) << endl;

  // transfer data to the device
  thrust::device_vector<int> d_vec = h_vec;
  cout << "2." << time(NULL) << endl;
  // sort data on the device (846M keys per second on GeForce GTX 480)
  thrust::sort(d_vec.begin(), d_vec.end());
  // transfer data back to host
  thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());
  std::cout << "3." << time(NULL) << endl;

  return 0;
}

但是程序在运行到thrust::sort 行时崩溃了。我尝试交替使用 std::vector 和 std:sort 并且效果很好。

这是推力的错误吗？我正在使用 Thrust 1.7 + Cuda 6.5 + Visual Studio 2013 Update 2。

我使用的是 GeForce GT 740M，总内存为 2048M。

我用processexplorer监控进程，看到它分配了1.0G内存。但是我有 2G GPU 内存，16G 主 CPU 内存。

错误消息是“一个问题导致程序停止正常工作。Windows 将关闭程序并通知您是否有可用的解决方案。[调试] [关闭程序]”。单击[Debug]后，我可以看到调用堆栈。问题出在这一行：

thrust::device_vector<int> d_vec = h_vec;

cuda 的最后一个来源是这样的：

testcuda.exe!thrust::system::cuda::detail::malloc<thrust::system::cuda::detail::tag>(thrust::system::cuda::detail::execution_policy<thrust::system::cuda::detail::tag> & __formal, unsigned __int64 n) Line 48  C++

这似乎是一个内存分配问题。但是我有 2G GPU 内存，16G 主 CPU 内存。为什么？？

致罗伯特：

原始示例运行良好，即使对于 32

我的测试代码在这里：https://github.com/henrywoo/wufuheng/blob/master/testcuda.cu

在我的测试中，没有异常，只是运行时出错。

【问题讨论】：

h_vec(32 << 23) 将尝试分配一个 2.7 亿元素的数组。是否抛出了oom错误？
也许您的硬件无法处理 1 GB 的向量。
要写一个更好的问题，而不是说“程序崩溃”，将实际的错误输出粘贴到您的问题中（您可以编辑自己的问题。）还要指出您在哪个 GPU 上运行它。代码是否与 32<<20 的原始向量大小一起正常工作？如果是这样，很可能您的 GPU 内存不足。

标签： c++ thrust

【解决方案1】：

sizeof(int) * 32 即您正在分配大约 1 GB 的 GPU RAM。很可能，您的卡无法处理那么多元素。这可能是因为：

GPU RAM 通常不足
没有足够的连续空闲 GPU RAM（这是必需的，因为向量必须适合连续的内存块）

【讨论】：

我有 2G GPU 内存，16G 主 CPU 内存。如何检查是否有足够的连续可用 GPU RAM。
恐怕我什么都不知道。但是，根据我的经验，很有可能使用 2G RAM，几乎不可能找到 1G 的空闲 RAM。此外，考虑到故障发生在malloc，这很可能是原因。
在设备requires O(N) temporary storage上排序。要求对 1GB 的向量进行排序需要额外的 1GB 临时存储空间，即。总共约2GB。由于显示开销和其他原因，您的 2GB GPU 没有那么多可用空间。您可以使用cuda API call 查询空闲内存，但由于碎片，它可能无法在单个分配中全部可用。
正如我在帖子中提到的，在调用排序函数之前，将内存从主 RAM 复制到 GPU RAM 时会发生崩溃。
很有可能您没有足够的可用内存。您从未回答过我的问题，即示例代码中显示的某种原始大小 (32