尽管代码中没有双精度，但 Cuda 将双精度降级为浮点错误答案

【问题标题】：Cuda demoting double to float error despite no doubles in code尽管代码中没有双精度，但 Cuda 将双精度降级为浮点错误
【发布时间】：2012-01-08 11:56:09
【问题描述】：

我正在使用 PyCUDA 编写内核。我的 GPU 设备只支持计算能力 1.1 (arch sm_11)，所以我只能在我的代码中使用浮点数。我付出了巨大的努力来确保我用浮点数做所有事情，但尽管如此，我的代码中有一个特定的行一直导致编译器错误。

这段代码是：

  // Gradient magnitude, so 1 <= x <= width, 1 <= y <= height. 
  if( j > 0 && j < im_width && i > 0 && i < im_height){
    gradient_mag[idx(i,j)] = float(sqrt(x_gradient[idx(i,j)]*x_gradient[idx(i,j)] + y_gradient[idx(i,j)]*y_gradient[idx(i,j)]));
  }

这里，idx() 是一个__device__ 辅助函数，它根据像素索引i 和j 返回一个线性索引，它只适用于整数。我自始至终都在使用它，它不会在其他任何地方出错，所以我强烈怀疑它不是idx()。 sqrt() 调用只是来自支持浮点数的标准 C 数学函数。所有涉及的数组x_gradient、y_gradient 和gradient_mag 都是float*，它们是我的函数输入的一部分（即在Python 中声明，然后转换为设备变量等）。

我已经尝试删除额外的演员表以浮动在上面的代码中，但没有成功。我也尝试过像这样完全愚蠢的事情：

 // Gradient magnitude, so 1 <= x <= width, 1 <= y <= height. 
 if( j > 0 && j < im_width && i > 0 && i < im_height){
    gradient_mag[idx(i,j)] = 3.0f; // also tried float(3.0) here
  }

所有这些变体都会产生相同的错误：

 pycuda.driver.CompileError: nvcc said it demoted types in source code it compiled--this is likely not what you want.
 [command: nvcc --cubin -arch sm_11 -I/usr/local/lib/python2.7/dist-packages/pycuda-2011.1.2-py2.7-linux-x86_64.egg/pycuda/../include/pycuda kernel.cu]
 [stderr:
 ptxas /tmp/tmpxft_00004329_00000000-2_kernel.ptx, line 128; warning : Double is not supported. Demoting to float
 ]

有什么想法吗？我已经调试了我的代码中的许多错误，并希望今晚让它工作，但事实证明这是一个我无法理解的错误。

已添加 -- 这是内核的截断版本，它在我的机器上产生与上述相同的错误。

 every_pixel_hog_kernel_source = \
 """
 #include <math.h>
 #include <stdio.h>

 __device__ int idx(int ii, int jj){
     return gridDim.x*blockDim.x*ii+jj;
 }

 __device__ int bin_number(float angle_val, int total_angles, int num_bins){ 

     float angle1;   
     float min_dist;
     float this_dist;
     int bin_indx;

     angle1 = 0.0;
     min_dist = abs(angle_val - angle1);
     bin_indx = 0;

     for(int kk=1; kk < num_bins; kk++){
         angle1 = angle1 + float(total_angles)/float(num_bins);
         this_dist = abs(angle_val - angle1);
         if(this_dist < min_dist){
             min_dist = this_dist;
             bin_indx = kk;
         }
     }

     return bin_indx;
 }

 __device__ int hist_number(int ii, int jj){

     int hist_num = 0;

     if(jj >= 0 && jj < 11){ 
         if(ii >= 0 && ii < 11){ 
             hist_num = 0;
         }
         else if(ii >= 11 && ii < 22){
             hist_num = 3;
         }
         else if(ii >= 22 && ii < 33){
             hist_num = 6;
         }
     }
     else if(jj >= 11 && jj < 22){
         if(ii >= 0 && ii < 11){ 
             hist_num = 1;
         }
         else if(ii >= 11 && ii < 22){
             hist_num = 4;
         }
         else if(ii >= 22 && ii < 33){
             hist_num = 7;
         }
     }
     else if(jj >= 22 && jj < 33){
         if(ii >= 0 && ii < 11){ 
             hist_num = 2;
         }
         else if(ii >= 11 && ii < 22){
             hist_num = 5;
         }
         else if(ii >= 22 && ii < 33){
             hist_num = 8;
         }
     }

     return hist_num;
 }

  __global__ void every_pixel_hog_kernel(float* input_image, int im_width, int im_height, float* gaussian_array, float* x_gradient, float* y_gradient, float* gradient_mag, float* angles, float* output_array)
  {    
      /////
      // Setup the thread indices and linear offset.
      /////
      int i = blockDim.y * blockIdx.y + threadIdx.y;
      int j = blockDim.x * blockIdx.x + threadIdx.x;
      int ang_limit = 180;
      int ang_bins = 9;
      float pi_val = 3.141592653589f; //91

      /////
      // Compute a Gaussian smoothing of the current pixel and save it into a new image array
      // Use sync threads to make sure everyone does the Gaussian smoothing before moving on.
      /////
      if( j > 1 && i > 1 && j < im_width-2 && i < im_height-2 ){

            // Hard-coded unit standard deviation 5-by-5 Gaussian smoothing filter.
            gaussian_array[idx(i,j)] = float(1.0/273.0) *(
            input_image[idx(i-2,j-2)] + float(4.0)*input_image[idx(i-2,j-1)] + float(7.0)*input_image[idx(i-2,j)] + float(4.0)*input_image[idx(i-2,j+1)] + input_image[idx(i-2,j+2)] + 
            float(4.0)*input_image[idx(i-1,j-2)] + float(16.0)*input_image[idx(i-1,j-1)] + float(26.0)*input_image[idx(i-1,j)] + float(16.0)*input_image[idx(i-1,j+1)] + float(4.0)*input_image[idx(i-1,j+2)] +
            float(7.0)*input_image[idx(i,j-2)] + float(26.0)*input_image[idx(i,j-1)] + float(41.0)*input_image[idx(i,j)] + float(26.0)*input_image[idx(i,j+1)] + float(7.0)*input_image[idx(i,j+2)] +
            float(4.0)*input_image[idx(i+1,j-2)] + float(16.0)*input_image[idx(i+1,j-1)] + float(26.0)*input_image[idx(i+1,j)] + float(16.0)*input_image[idx(i+1,j+1)] + float(4.0)*input_image[idx(i+1,j+2)] +
            input_image[idx(i+2,j-2)] + float(4.0)*input_image[idx(i+2,j-1)] + float(7.0)*input_image[idx(i+2,j)] + float(4.0)*input_image[idx(i+2,j+1)] + input_image[idx(i+2,j+2)]);
     }
     __syncthreads();

     /////
     // Compute the simple x and y gradients of the image and store these into new images
     // again using syncthreads before moving on.
     /////

     // X-gradient, ensure x is between 1 and width-1
     if( j > 0 && j < im_width){
         x_gradient[idx(i,j)] = float(input_image[idx(i,j)] - input_image[idx(i,j-1)]);
     }
     else if(j == 0){
         x_gradient[idx(i,j)] = float(0.0);
     }

    // Y-gradient, ensure y is between 1 and height-1
    if( i > 0 && i < im_height){
         y_gradient[idx(i,j)] = float(input_image[idx(i,j)] - input_image[idx(i-1,j)]);
    }
    else if(i == 0){
        y_gradient[idx(i,j)] = float(0.0);
    }
    __syncthreads();

    // Gradient magnitude, so 1 <= x <= width, 1 <= y <= height. 
    if( j < im_width && i < im_height){

        gradient_mag[idx(i,j)] = float(sqrt(x_gradient[idx(i,j)]*x_gradient[idx(i,j)] + y_gradient[idx(i,j)]*y_gradient[idx(i,j)]));
    }
    __syncthreads();

    /////
    // Compute the orientation angles
    /////
    if( j < im_width && i < im_height){
        if(ang_limit == 360){
            angles[idx(i,j)] = float((atan2(y_gradient[idx(i,j)],x_gradient[idx(i,j)])+pi_val)*float(180.0)/pi_val);
        }
        else{
            angles[idx(i,j)] = float((atan( y_gradient[idx(i,j)]/x_gradient[idx(i,j)] )+(pi_val/float(2.0)))*float(180.0)/pi_val);
        }
    }
    __syncthreads();

    // Compute the HoG using the above arrays. Do so in a 3x3 grid, with 9 angle bins for each grid.
    // forming an 81-vector and then write this 81 vector as a row in the large output array.

    int top_bound, bot_bound, left_bound, right_bound, offset;
    int window = 32;

    if(i-window/2 > 0){
        top_bound = i-window/2;
        bot_bound = top_bound + window;
    }
    else{
        top_bound = 0;
        bot_bound = top_bound + window;
    }

    if(j-window/2 > 0){
        left_bound = j-window/2;
        right_bound = left_bound + window;
    }
    else{
        left_bound = 0;
        right_bound = left_bound + window;
    }

    if(bot_bound - im_height > 0){
        offset = bot_bound - im_height;
        top_bound = top_bound - offset;
        bot_bound = bot_bound - offset;
    }

    if(right_bound - im_width > 0){
        offset = right_bound - im_width;
        right_bound = right_bound - offset;
        left_bound = left_bound - offset;
    }

    int counter_i = 0;
    int counter_j = 0;
    int bin_indx, hist_indx, glob_col_indx, glob_row_indx;
    int row_width = 81; 

    for(int pix_i = top_bound; pix_i < bot_bound; pix_i++){
        for(int pix_j = left_bound; pix_j < right_bound; pix_j++){

            bin_indx = bin_number(angles[idx(pix_i,pix_j)], ang_limit, ang_bins);
            hist_indx = hist_number(counter_i,counter_j);

            glob_col_indx = ang_bins*hist_indx + bin_indx;
            glob_row_indx = idx(i,j);

            output_array[glob_row_indx*row_width + glob_col_indx] = float(output_array[glob_row_indx*row_width + glob_col_indx] + float(gradient_mag[idx(pix_i,pix_j)]));


            counter_j = counter_j + 1; 
        }
        counter_i = counter_i + 1;
        counter_j = 0;
    }

}
"""

【问题讨论】：

试试sqrtf() 或者std::sqrt()。 Python 与此有什么关系？
感谢您的建议，但我刚刚尝试过，sqrtf() 没有帮助。我怀疑这是 PyCUDA 特有的，但认为包含该细节是相关的，以防它碰巧与 PyCUDA 中使用设备变量的方式有关。
你确定你在这里张贴的线路正确吗？
这里我支持@KerrekSB，它报告的行号是针对 .ptx 文件的，所以你可能找错地方了。
@EMS：不是那条线。在该行的 C 代码中引入语法错误并不能证明任何事情 - 您所询问的错误是由汇编器生成的，而不是编译器。如果您需要帮助，请发布完整的内核代码，因为目前您找错地方了。

标签： cuda gpu pycuda

【解决方案1】：

（评论，不是答案，但是太大了不能当评论）

您能否在发生错误的行周围提供 PTX 代码？

我尝试使用您提供的代码编译一个简单的内核：

__constant__ int im_width;
__constant__ int im_height;

__device__ int idx(int i,int j) {
    return i+j*im_width;
}

__global__ void kernel(float* gradient_mag, float* x_gradient, float* y_gradient) {
    int i = threadIdx.x;
    int j = threadIdx.y;
  // Gradient magnitude, so 1 <= x <= width, 1 <= y <= height.
  if( j > 0 && j < im_width && i > 0 && i < im_height){
    gradient_mag[idx(i,j)] = float(sqrt(x_gradient[idx(i,j)]*x_gradient[idx(i,j)] + y_gradient[idx(i,j)]*y_gradient[idx(i,j)]));
  }
}

使用：

nvcc.exe -m32 -maxrregcount=32 -gencode=arch=compute_11,code=\"sm_11,compute_11\" --compile -o "Debug\main.cu.obj" main.cu

没有错误。

使用 CUDA 4.1 beta 编译器

更新

我尝试编译您的新代码（我在 CUDA/C++ 中工作，而不是 PyCUDA，但这不重要）。也没有发现错误！使用 CUDA 4.1 和 CUDA 4.0。您的 CUDA 安装版本是什么？

C:\>nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2011 NVIDIA Corporation
Built on Wed_Oct_19_23:13:02_PDT_2011
Cuda compilation tools, release 4.1, V0.2.1221

【讨论】：

我不确定 ptx 代码是什么意思。我正在使用 PyCUDA，还没有听说过这个概念。 PyCUDA 是否像 CUDA 一样生成这些中间内容？
从您的错误中可以看出 nvcc 正在幕后运行； nvcc --cubin -arch sm_11 -I/usr/local/lib/python2.7/dist-packages/pycuda-2011.1.2-py2.7-linux-x86_64.egg/pycuda/../include/pycuda kernel.cu 跳到link 的实际第 19 页（编号为 17）
从那个链接你可以看到一个幕后编译的例子；在实际的第 21 页（编号为 19）上，您可以看到 pxtas 的来源，获取一个 ptx 文件并生成一个 cubin。有关示例编译的图表，请参见实际第 22 页（编号为 20）。无论如何，您应该看到生成错误的 pxtas 在链中很靠后。在编译的早期阶段生成错误不会帮助您获得正确的行号。
它看起来是 3.2 版。它在学术实验室的计算机上，因此在我正在研究的短期内不太可能将其更改为更新的版本。
安装并试用了 v3.2。仍然无法重现错误（在 C++ 环境中）。您确定它是-this-而不是任何其他代码导致它吗？如果您可以控制 pyCUDA 中的编译标志，也许您可以尝试“--keep”标志，然后检查生成的 PTX 代码？ (--keep 防止删除中间文件)

【解决方案2】：

你的问题在这里：

angle1 = 0.0;

0.0 是一个双精度常数。 0.0f 是单精度常数。

【讨论】：

这是不正确的。请注意，我将 angle1 声明为浮点数，因此将其分配为 0.0 会自动将其转换为浮点数。即使我按照您的建议添加 f，我也会得到相同的降级双精度浮点错误，具有相同的行号 128 详细信息。
他的观点是正确的；你不应该有 0.0，它应该是 0.0f。到处都是如此。你不应该使用双重文字，然后将它们扔掉；你应该使用浮点文字。
啊，可能是你的问题：float(1.0/273.0)。您正在做双除法，然后转换为浮点数。

【解决方案3】：

这是一个使用双打的明确案例：

 gaussian_array[idx(i,j)] = float(1.0/273.0) *

看到被分割的双重文字了吗？

但实际上，使用浮点字面量而不是双重字面量转换为浮点数 - 转换很难看，我建议它们会隐藏这样的错误。

-------编辑 1/Dec----------

首先，感谢@CygnusX1，不断折叠会阻止这种计算——我什至没有想到。

我试图重现错误的环境：我安装了 CUDA SDK 3.2（@EMS 提到他们似乎在实验室中使用），编译上面截断的内核版本，并且确实 nvopencc 确实优化了上面计算（感谢@CygnusX1），实际上它没有在生成的 PTX 代码中的任何地方使用双精度数。此外，ptxas 没有给出@EMS 收到的错误。从那以后，我认为问题出在every_pixel_hog_kernel_source 代码本身之外，可能在 PyCUDA 中。但是，使用 PyCUDA 2011.1.2 并使用该 still 进行编译不会像@EMS 的问题那样产生警告。我可以得到问题中的错误，但它是通过引入双重计算，例如从gaussian_array[idx(i,j)] = float(1.0/273.0) *移除演员表

要获得相同的 python 案例，以下是否会产生错误：

import pycuda.driver as cuda
from pycuda.compiler import compile

x=compile("""put your truncated kernel code here""",options=[],arch="sm_11",keep=True)

在我的情况下它不会产生错误，所以有可能我根本无法复制您的结果。不过，我可以给一些建议。当使用compile（或SourceModule）时，如果你使用keep=True，python将在显示错误消息之前打印出正在生成ptx文件的文件夹。然后，如果您可以检查该文件夹中生成的 ptx 文件并查看 .f64 出现的位置，它应该可以了解什么被视为双重 - 但是，破译原始内核中的代码是困难的 - 拥有产生错误的最简单示例将对您有所帮助。

【讨论】：

将所有双精度字面值更改为浮点字面值后，我仍然得到完全相同的错误。
那些是常数。编译器能够优化它！
非常感谢keep=True 的评论。我还没有解决这个问题，但这应该会有所帮助。
如果双精度是一个常数，它希望以这种方式出现：mov.f64 %fd2, 0d3f6e01e01e01e01e; // 0.003663（存储双精度值 0.003663 以供计算）。像 CygnusX1 指出的那样，找到原始等效项时的部分问题是常量折叠。（这个例子是在我编译 1.0/273.0 时生成的，没有进行浮点转换 - 生成的 ptx 使用了该计算的双精度结果 0.003663，而不是实际进行计算）。 .f64 操作是使用双精度数的操作。