了解推力 (CUDA) 内存使用情况答案

【问题标题】：Understanding Thrust (CUDA) memory usage了解推力 (CUDA) 内存使用情况
【发布时间】：2016-03-25 06:15:35
【问题描述】：

我正在使用 cuda/thrust 库进行一些蒙特卡罗模拟。在我得到 bad_alloc 异常的一定数量的模拟中，这非常有效。这似乎没问题，因为我的代码中越来越多的模拟意味着处理越来越大的 device_vectors。所以我希望这种异常会在某个时候出现。

我现在想做的是根据我的 GPU 上的可用内存设置此模拟次数的上限。然后，我可以将工作负载拆分为多个模拟。

因此，在启动我的一组模拟之前，我一直在尝试确定问题的大小。不幸的是，当我试图通过简单的例子来理解内存的管理方式时，我得到了令人惊讶的结果。

这是我一直在测试的代码示例：

#include <cuda.h>
#include <thrust/system_error.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>
#include <cuda_profiler_api.h>

int main() 
{
    size_t freeMem, totalMem;

    cudaDeviceReset();
    cudaSetDevice(0);

    cudaMemGetInfo(&freeMem, &totalMem);
    std::cout << "Total Memory | Free Memory "<< std::endl;
    std::cout << totalMem << ", " << freeMem << std::endl;

    thrust::device_vector<float> vec1k(1000, 0);

    cudaMemGetInfo(&freeMem, &totalMem);
    std::cout << totalMem << ", " << freeMem << std::endl;

    thrust::device_vector<float> vec100k(100000, 0);

    cudaMemGetInfo(&freeMem, &totalMem);
    std::cout << totalMem << ", " << freeMem << std::endl;

    thrust::device_vector<float> vec1M(1000000, 0);

    cudaMemGetInfo(&freeMem, &totalMem);
    std::cout << totalMem << ", " << freeMem << std::endl;

    return 0;
}

这是我得到的结果：

Total Memory | Free Memory
2147483648, 2080542720
2147483648, 2079494144
2147483648, 2078445568
2147483648, 2074382336

所以，基本上，

1,000 个元素的向量（加上所需的一切）使用 1,048,576 个字节
100,000 个元素的向量也使用 1,048,576 个字节！
1,000,000 元素向量使用 4,063,232 字节。

我原本预计内存使用量会随着元素数量的增加而大致成比例，但是当我预计为“10x”时，我得到了“4x”，而且这种关系在 1,000 到 100,000 个元素之间不成立。

所以，我的两个问题是：

谁能帮我理解这些数字？
如果我无法估计我的代码将使用的适当内存量，那么确保我的程序适合内存的好策略是什么？

编辑

根据 Mai Longdong 的评论，我尝试了两个向量，一个是 262144 浮点数（4 个字节），另一个是 262145。不幸的是，事情看起来不像是直接的“每 1MB 页面分配”：

第一个向量的大小（262144 个浮点数）：1048576 字节
第二个向量的大小（262145 个浮点数）：1179648 字节

两者之间的增量为 131072 字节（或 128 KB）。页面大小是可变的？这有意义吗？

【问题讨论】：

你读过devtalk.nvidia.com/default/topic/525397/…吗？
嗯，我刚刚做了，内容非常丰富，非常感谢。看来内存分配过程远不是线性的。
或this one on Stack Overflow
好的，我认为主要思想是，只要 device_vector 的大小低于某个阈值并且有足够的可用内存，就会分配相当大的默认内存。
重新编辑：实际上，任何 64 kB 的倍数都是有意义的。我过去回答过一个页面大小为 64 kB 的问题。无论如何，这没有记录。

标签： memory-management cuda thrust

【解决方案1】：

Thrust 对内存管理没有任何作用，默认分配器只是cudaMalloc，您看到的是驱动程序内存管理器页面大小选择算法在工作。这没有记录，也没有迹象表明平台和硬件版本之间的行为是一致的。

也就是说，如果我将您的代码扩展为更有用的东西：

#include <iostream>
#include <vector>
#include <thrust/system_error.h>
#include <thrust/device_vector.h>
#include <thrust/execution_policy.h>

void report_mem(size_t allocd, bool first=false)
{
    size_t freeMem, totalMem;
    cudaMemGetInfo(&freeMem, &totalMem);
    if (first) 
        std::cout << "Allocated | Total Memory | Free Memory "<< std::endl;
    std::cout << allocd << ", " << totalMem << ", " << freeMem << std::endl;
}

int main() 
{
    cudaSetDevice(0);

    report_mem(0, true);
    std::vector<size_t> asizes;
    const int nallocs = 10;
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<14);
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<16);
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<18);
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<20);
    for(int i=0; i < nallocs; i++) asizes.push_back(1<<22);

    typedef thrust::device_vector<float> dvecf_t;
    std::vector<dvecf_t*> allocs;
    auto it = asizes.begin();
    for(; it != asizes.end(); ++it) {
        dvecf_t* v = new dvecf_t(*it);
        allocs.push_back(v);
    report_mem(v->capacity() * sizeof(float));
    }
    return 0;
}

并在 Windows 64 位的计算 2.1 设备上运行它，我明白了：

Allocated | Total Memory | Free Memory 
0, 1073741824, 1007849472
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
65536, 1073741824, 1006800896
262144, 1073741824, 1005752320
262144, 1073741824, 1005752320
262144, 1073741824, 1005752320
262144, 1073741824, 1005752320
262144, 1073741824, 1004703744
262144, 1073741824, 1004703744
262144, 1073741824, 1004703744
262144, 1073741824, 1004703744
262144, 1073741824, 1003655168
262144, 1073741824, 1003655168
1048576, 1073741824, 1002606592
1048576, 1073741824, 1001558016
1048576, 1073741824, 1000509440
1048576, 1073741824, 999460864
1048576, 1073741824, 998412288
1048576, 1073741824, 997363712
1048576, 1073741824, 996315136
1048576, 1073741824, 995266560
1048576, 1073741824, 994217984
1048576, 1073741824, 993169408
4194304, 1073741824, 988975104
4194304, 1073741824, 984780800
4194304, 1073741824, 980586496
4194304, 1073741824, 976392192
4194304, 1073741824, 972197888
4194304, 1073741824, 968003584
4194304, 1073741824, 963809280
4194304, 1073741824, 959614976
4194304, 1073741824, 955420672
4194304, 1073741824, 951226368
16777216, 1073741824, 934449152
16777216, 1073741824, 917671936
16777216, 1073741824, 900894720
16777216, 1073741824, 884117504
16777216, 1073741824, 867340288
16777216, 1073741824, 850563072
16777216, 1073741824, 833785856
16777216, 1073741824, 817008640
16777216, 1073741824, 800231424

我将其解释为在我测试过的平台上分配粒度为 1MiB（1048576 或 2^20 字节）。您的平台可能不同。

【讨论】：