【发布时间】:2020-11-10 23:58:34
【问题描述】:
我在下面的链接中找到了“矢量化/批量排序”和“嵌套排序”方法。 How to use Thrust to sort the rows of a matrix?
当我对 500 行和 1000 个元素尝试此方法时,它们的结果是
- 矢量化/批量排序:66ms
- 嵌套排序:3290ms
我正在使用 1080ti HOF 模型来执行此操作,但与您的情况相比,它需要的时间太长。
但在下面的链接中,它可能不到 10 毫秒,几乎是 100 微秒。
(How to find median value in 2d array for each column with CUDA?)
您能否推荐如何优化此方法以减少操作时间?
#include <thrust/device_vector.h>
#include <thrust/device_ptr.h>
#include <thrust/host_vector.h>
#include <thrust/sort.h>
#include <thrust/execution_policy.h>
#include <thrust/generate.h>
#include <thrust/equal.h>
#include <thrust/sequence.h>
#include <thrust/for_each.h>
#include <iostream>
#include <stdlib.h>
#define NSORTS 500
#define DSIZE 1000
int my_mod_start = 0;
int my_mod() {
return (my_mod_start++) / DSIZE;
}
bool validate(thrust::device_vector<int> &d1, thrust::device_vector<int> &d2) {
return thrust::equal(d1.begin(), d1.end(), d2.begin());
}
struct sort_functor
{
thrust::device_ptr<int> data;
int dsize;
__host__ __device__
void operator()(int start_idx)
{
thrust::sort(thrust::device, data + (dsize*start_idx), data + (dsize*(start_idx + 1)));
}
};
#include <time.h>
#include <windows.h>
unsigned long long dtime_usec(LONG start) {
SYSTEMTIME timer2;
GetSystemTime(&timer2);
LONG end = (timer2.wSecond * 1000) + timer2.wMilliseconds;
return (end-start);
}
int main() {
for (int i = 0; i < 3; i++) {
SYSTEMTIME timer1;
cudaDeviceSetLimit(cudaLimitMallocHeapSize, (16 * DSIZE*NSORTS));
thrust::host_vector<int> h_data(DSIZE*NSORTS);
thrust::generate(h_data.begin(), h_data.end(), rand);
thrust::device_vector<int> d_data = h_data;
// first time a loop
thrust::device_vector<int> d_result1 = d_data;
thrust::device_ptr<int> r1ptr = thrust::device_pointer_cast<int>(d_result1.data());
GetSystemTime(&timer1);
LONG time_ms1 = (timer1.wSecond * 1000) + timer1.wMilliseconds;
for (int i = 0; i < NSORTS; i++)
thrust::sort(r1ptr + (i*DSIZE), r1ptr + ((i + 1)*DSIZE));
cudaDeviceSynchronize();
time_ms1 = dtime_usec(time_ms1);
std::cout << "loop time: " << time_ms1 << "ms" << std::endl;
//vectorized sort
thrust::device_vector<int> d_result2 = d_data;
thrust::host_vector<int> h_segments(DSIZE*NSORTS);
thrust::generate(h_segments.begin(), h_segments.end(), my_mod);
thrust::device_vector<int> d_segments = h_segments;
GetSystemTime(&timer1);
time_ms1 = (timer1.wSecond * 1000) + timer1.wMilliseconds;
thrust::stable_sort_by_key(d_result2.begin(), d_result2.end(), d_segments.begin());
thrust::stable_sort_by_key(d_segments.begin(), d_segments.end(), d_result2.begin());
cudaDeviceSynchronize();
time_ms1 = dtime_usec(time_ms1);
std::cout << "loop time: " << time_ms1 << "ms" << std::endl;
if (!validate(d_result1, d_result2)) std::cout << "mismatch 1!" << std::endl;
//nested sort
thrust::device_vector<int> d_result3 = d_data;
sort_functor f = { d_result3.data(), DSIZE };
thrust::device_vector<int> idxs(NSORTS);
thrust::sequence(idxs.begin(), idxs.end());
GetSystemTime(&timer1);
time_ms1 = (timer1.wSecond * 1000) + timer1.wMilliseconds;
thrust::for_each(idxs.begin(), idxs.end(), f);
cudaDeviceSynchronize();
time_ms1 = dtime_usec(time_ms1);
std::cout << "loop time: " << time_ms1 << "ms" << std::endl;
if (!validate(d_result1, d_result3)) std::cout << "mismatch 2!" << std::endl;
}
return 0;
}
【问题讨论】:
-
你显然在 Windows 上。您是否构建了调试项目?
-
@RobertCrovella 是的。我这个项目是在visual studio 2017上搭建的,是不是也需要在linux环境下?
-
您需要从在 Visual Studio 中构建 debug 项目切换到在 Visual Studio 中构建 release 项目。然后重新运行代码,查看计时结果。
-
@RobertCrovella 谢谢。我将项目更改为 release,它显示 2ms 和 19ms。是否不可能像您的情况一样减少(〜100us)?是硬件设置不同造成的吗?
-
特斯拉 v100 比 1080ti 快很多,而且我的陈述是关于幼崽,而不是推力。我有理由相信,幼崽分段排序会比推力排序更快。尽管如此,您的大约 2 毫秒的推力时间现在完全在 10 毫秒的估计范围内。将来,您永远不应该在 Visual Studio 中对 debug 构建进行性能分析。我将提供一个答案,展示如何对 500 个数组(每个数组包含 1024 个元素)进行幼崽分段排序,因此您可以在此处与推力排序进行比较。