【问题标题】:Cudafy kernel does not compileCudafy 内核无法编译
【发布时间】:2015-03-09 15:19:47
【问题描述】:

使用 Cudafy 开始我的第一步,并尝试编写一个函数,该函数将获取其线程的位置,并在此基础上将一些 int 值保存到数组元素中。 我的代码:

[Cudafy]
public static void GenerateRipples(GThread thread, int[] results)
{
  int threadPosInBlockX = thread.threadIdx.x;
  int threadPosInBlockY = thread.threadIdx.y;

  int blockPosInGridX = thread.blockIdx.x;
  int blockPosInGridY = thread.blockIdx.y;

  int gridSizeX = thread.gridDim.x;
  int gridSizeY = thread.gridDim.y;

  int blockSizeX = thread.blockDim.x;
  int blockSizeY = thread.blockDim.y;

  //int threadX = blockSizeX*blockPosInGridX + threadPosInBlockX;

  //if i use only one variable, everything is fine:
  int threadY = blockSizeY;

  //if i add or multiply anything, it cannot compile:
  //int threadY = blockSizeY*blockPosInGridY + threadPosInBlockY;


//  results[gridSizeX*blockSizeX*threadY + threadX] = 255;
}

所以我不能在这里计算threadY。如果我在计算中使用多个变量,Cudafy 翻译类会引发错误(CudafyModule cm = CudafyTranslator.Cudafy(); 引发 Cudafy.CudafyLanguageException)。

我做错了什么?

更新: 这是在 GPU 上运行内核的代码:

public void RunTest2()
{
    GPGPU gpu = CudafyHost.GetDevice(CudafyModes.Target, CudafyModes.DeviceId);
    CudafyModule km = CudafyTranslator.Cudafy();
    gpu.LoadModule(km);

    int size = 20 * 20;
    int[] allPixels = new int[size];

    int[] dev_result = gpu.Allocate<int>(size);

    dim3 blocksInGrid = new dim3(5, 5);
    dim3 threadsPerBlock = new dim3(4, 4);

    gpu.Launch(blocksInGrid, threadsPerBlock).GenerateRipples(dev_result);
    gpu.CopyFromDevice(dev_result, allPixels);

    gpu.FreeAll();
}

【问题讨论】:

    标签: c# cuda cudafy.net


    【解决方案1】:

    我们需要看看您是如何启动内核的,上面的代码应该可以正常运行。我创建了一个运行良好的测试类,并为您提供了如何准备内核网格/块/线程维度的示例。 如果您想查看出色的示例,请下载 Cudafy 源代码并编译 CudafyExamples 项目,请查看它们如何准备和使用 CUDAfy 的功能。

    ** 注意:我在发布第一堂课之前一定抽了一些不错的东西,我忽略了验证它没有产生内存访问冲突!!

    以下固定类,无违规。

    CodeprojectStackOverflow 上查找很好的示例。

    using System;
    using System.Collections.Generic;
    using System.Diagnostics;
    using System.Linq;
    using System.Text;
    
    using Cudafy;
    using Cudafy.Host;
    using Cudafy.Translator;
    
    namespace FxKernelTest 
    { 
        public class FxKernTest  
        {
            public GPGPU fxgpu;
    
            public const int N = 1024 * 64;
    
            public void ExeTestKernel()
            {
                GPGPU gpu = CudafyHost.GetDevice(CudafyModes.Target, 0);
                eArchitecture arch = gpu.GetArchitecture();
                CudafyModule km = CudafyTranslator.Cudafy(arch);
    
                gpu.LoadModule(km);
    
                int[] host_results = new int[N];
    
                // Either assign a new block of memory to hold results on device
                var dev_results = gpu.Allocate<int>(N);
                gpu.Set<int>(dev_results);
    
                // Or fill your array with values first and then
                for (int i = 0; i < N; i++) host_results[i] = i * 3;
    
                // Copy array with ints to device
                //var dev_filled_results = gpu.CopyToDevice(host_results);
    
                // 64*16 = 1024 threads per block (which is max for sm_30)
                dim3 threadsPerBlock = new dim3(64, 16);    
    
                // 8*8 = 64 blocks per grid, 1024 threads per block = kernel launched 65536 times
                dim3 blocksPerGrid = new dim3(8, 8); 
    
                //var threadsPerBlock = 1024; // this will only give you blockDim.x = 1024, .y = 0, .z = 0
                //var blocksPerGrid = 1;      // just for show
    
                gpu.Launch(blocksPerGrid, threadsPerBlock, "GenerateRipples", dev_results);
    
                gpu.CopyFromDevice(dev_results, host_results); 
    
                // Test our results
                for (int index = 0; index < N; index++)
                    if (host_results[index] != index)
                        throw new Exception("Check your indexing math, genius!!!");
            }
    
            [Cudafy]
            public static void GenerateRipples(GThread thread, int[] results)
            {
                var blockSize = thread.blockDim.x * thread.blockDim.y;
    
                var offsetToGridY = blockSize * thread.gridDim.x;
    
                // This took me a few tries, I've never used 4 dimensions into a 1D array beofre :)
    
                var tid = thread.blockIdx.y * offsetToGridY +       // each Grid Y is 8192 in size
                          thread.blockIdx.x * blockSize +           // each Grid X is 1024 in size
                          thread.threadIdx.y * thread.blockDim.x +  // each Block Y is 64 in size
                          thread.threadIdx.x;                       // index into block
    
    
                var threadPosInBlockX = thread.threadIdx.x;
    
                var threadPosInBlockY = thread.threadIdx.y;
    
                var blockPosInGridX = thread.blockIdx.x;
    
                var blockPosInGridY = thread.blockIdx.y;
    
                var gridSizeX = thread.gridDim.x;
    
                var gridSizeY = thread.gridDim.y;
    
                var blockSizeX = thread.blockDim.x;
    
                var blockSizeY = thread.blockDim.y;
    
                // this is your code, see how I calculate the actual thread ID above!
                var threadX = blockSizeX * blockPosInGridX + threadPosInBlockX;
    
                //if i use only one variable, everything is fine:
                var threadY = blockSizeY;
    
                // this calculates just fine
                threadY = blockSizeY * blockPosInGridY + threadPosInBlockY;
    
                // hint: use NSight for Visual Studio and look at the NSight output, 
                // it reports access violations and tells you where...
    
                // if our threadId is within bounds of array size
                // we cause access violation if not
                // (class constants are automatically passed to kernels)
                if (tid < N)
                    results[tid] = tid;
    
            }
    
        }
    }
    

    ptxas info : 0 bytes gmem ptxas info : 编译入口函数 'sm_30' 的 'GenerateRipples' ptxas 信息:函数属性 生成涟漪 0 字节堆栈帧,0 字节溢出存储,0 字节溢出加载 ptxas 信息:使用 5 个寄存器,328 字节 cmem[0]

    【讨论】:

    • 我运行了上面的代码,它只是用 num 255 填充了每个 int[] 结果,但它确实计算得很好。
    • 好吧,我的索引一开始并不那么热,我以前从未使用四维网格来索引一维数组。我添加了一个测试以确保索引正确。希望它有所帮助,即使它没有帮助你,它肯定帮助了我;)
    • 我已经添加了运行内核的代码(我创建了一个新项目,现在它可以工作,但在旧项目中它仍然没有,我仍然不知道为什么)跨度>
    • 我很高兴它对你有用:)。过去我遇到过这样的问题,我有两个完全相同的项目,一个编译并运行并且可以调试,另一个只是拒绝编译干净或运行。那时我开始清理整个项目,从 bin/Debug 文件夹中手动删除所有 .cdfy/.cu 或 temp 文件,然后清除调试符号缓存文件夹。这已经有一段时间没有发生在我身上了,但我确实经历过一些类似的令人沮丧的经历!帮我一个忙并将我的答案标记为已解决? :D
    猜你喜欢
    • 2012-01-02
    • 2015-07-31
    • 2023-03-08
    • 2020-04-10
    • 1970-01-01
    • 2017-03-13
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多