如何将数组分成块答案

【问题标题】：How to break an array into blocks如何将数组分成块
【发布时间】：2017-06-12 16:37:56
【问题描述】：

我有一个表示长方体中点的数组。它是一个一维数组，它使用以下索引函数来实现3维：

int getCellIndex(int ix, int iy, int iz) {
    return ix + (iy * numCellsX) + (iz * numCellsX * numCellsY);
}

域中的单元格数为：

numCells = (numX + 2) * (numY + 2) * (numZ + 2)

其中 numX/numY/numZ 是 X/Y/Z 方向的单元格数。每个方向的 +2 是在域外部创建填充单元。每个方向的单元格数由下式给出：

numX = 5 * numY
numZ = numY/2
numY = userInput

对于每个单元格，我想根据它的邻居值（即模板）计算该单元格的新值，其中邻居在上方、下方、左侧、右侧、正面和背面。但是，我只想对还不错的单元格进行此计算。我有一个布尔数组来跟踪单元格是否坏。这是当前计算的样子：

for(int z = 1; z < numZ+1; z++) {
    for(int y = 1; y < numY+1; y++) {
        for(int x = 1; x < numX+1; x++) {
            if(!isBadCell[ getCellIndex(x,y,z) ] {
                // Do stencil Computation
            }
        }
    }
}

这不是很好的表现。我希望能够对循环进行矢量化以提高性能，但是由于 if 语句我不能。我提前知道细胞是否坏了，这在整个计算过程中都不会改变。我想将域分成块，最好是 4x4x4 块，这样我可以先验地计算每个块是否包含坏细胞，如果是这样，照常处理它，或者如果不是，使用一个优化的函数，可以采取矢量化的优势例如

for(block : blocks) {
    if(isBadBlock[block]) {
        slowProcessBlock(block) // As above
    } else {
        fastVectorizedProcessBlock(block)
    }
}

注意：块不需要物理存在，即这可以通过更改索引函数并使用不同的索引循环数组来实现。我愿意接受任何最有效的方法。

fastVectorizedProcessBlock() 函数看起来类似于 slowProcessBlock() 函数，但使用 if 语句删除（因为我们知道它不包含坏单元格）和矢量化 pragma。

如何将我的域拆分为多个块以便完成此任务？这似乎很棘手，因为 a) 每个方向上的单元格数量不相等，b) 我们需要考虑填充单元格，因为我们绝不能尝试计算它们的值，因为这会导致内存访问失效的界限。

如何在不使用 if 语句的情况下处理不包含坏单元格的块？

编辑：

这是我最初的想法：

for(int i = 0; i < numBlocks; i++) { // use blocks of 4x4x4 = 64
    if(!isBadBlock[i]) {
        // vectorization pragma here
        for(int z = 0; z < 4; z++) {
            for(int y = 0; y < 4; y++) {
                for(int x = 0; x < 4; x++) {
                    // calculate stencil using getCellIndex(x,y,z)*i
                }
             }
         }
     } else {
         for(int z = 0; z < 4; z++) {
            for(int y = 0; y < 4; y++) {
                for(int x = 0; x < 4; x++) {
                    if(!isBadCell[i*getCellIndex(x,y,z)]) {    
                    // calculate stencil using getCellIndex(x,y,z)*i
                }
             }
         }
     }
 }

现在单元格将存储在块中，即第一个 4x4x4 块中的所有单元格将存储在 pos 0-63 中，然后第二个块中的所有单元格将存储在 pos 64-127 等中。

但是，如果 numX/numY/numZ 值不正确，我认为不会起作用。例如，如果 numY = 2、numZ = 1 和 numX = 10 会怎样？ for 循环期望 z 方向至少有 4 个单元格深。有什么好办法可以解决吗？

更新 2 - 以下是模板计算的样子：

if ( isBadCell[ getCellIndex(x,y,z) ] ) {
  double temp = someOtherArray[ getCellIndex(x,y,z) ] +
                    1.0/CONSTANT/CONSTANT*
                    (
                      - 1.0 * cells[ getCellIndex(x-1,y,z) ]
                      - 1.0 * cells[ getCellIndex(x+1,y,z) ]
                      - 1.0 * cells[ getCellIndex(x,y-1,z) ]
                      - 1.0 * cells[ getCellIndex(x,y+1,z) ]
                      - 1.0 * cells[ getCellIndex(x,y,z-1) ]
                      - 1.0 * cells[ getCellIndex(x,y,z+1) ]
                      + 6.0 * cells[ getCellIndex(x,y,z) ]
                      );
  globalTemp += temp * temp;
  cells[ getCellIndex(x,y,z) ] += -omega * temp / 6.0 * CONSTANT * CONSTANT;
}

【问题讨论】：

小贴士：而不是ix + (iy * numCellsX) + (iz * numCellsX * numCellsY)，使用链式计算ix + numCellsX*(iy + iz * numCellsY)，少一个*。
如果所有块都不是 4x4x4，那么你如何确定哪些不是？您是分配一个精确大小的数组 X*Y*Z 还是将其四舍五入以便最终得到精确的 (X*Y*Z) / (4*4*4) 立方体？我认为您需要先弄清楚这一点，然后才能回答问题。
@Lundin 可以填充不是 4x4x4 的块以使其成为 4x4x4，但是当我遍历每个块时，我需要知道哪些单元格是填充单元格，这需要一个 if 语句停止矢量化。任何想法如何解决这个问题？
我假设这些设置类似于int cells[numCellsZ][numCellsY][numCellsX]... 在这种情况下，您不能通过执行某些操作从单元格[0][0][0] 中获取偏移量喜欢(int *) &cells[z][y][x] - (int *) &cells[0][0][0]
@technosaurus 这些单元格被声明为单元格[numCells]，其中 numCells = (numX+2) + (numY+2) + (numZ+2)。请参阅布局的索引功能。你能解释一下你的解决方案吗？我认为它只会给我一个单元格的全局索引，这就是 get index 函数的作用。

标签： c arrays optimization multidimensional-array tiling

【解决方案1】：

getCellIndex() 在哪里检索numCellX 和numCellY 的值？最好将它们作为参数传递而不是依赖全局变量，并将此函数设为static inline 以允许编译器进行优化。

static line int getCellIndex(int ix, int iy, int iz, int numCellsX, numCellsY) {
    return ix + (iy * numCellsX) + (iz * numCellsX * numCellsY);
}

for (int z = 1; z <= numZ; z++) {
    for (int y = 1; y <= numY; y++) {
        for (int x = 1; x <= numX; x++) {
            if (!isBadCell[getCellIndex(x, y, z, numX + 2, numY + 2)] {
                // Do stencil Computation
            }
        }
    }
}

您还可以使用一些局部变量删除所有乘法：

int index = (numY + 2) * (numX + 2);  // skip top padding plane
for (int z = 1; z <= numZ; z++) {
    index += numX + 2;  // skip first padding row
    for (int y = 1; y <= numY; y++) {
        index += 1;   // skip first padding col
        for (int x = 1; x <= numX; x++, index++) {
            if (!isBadCell[index] {
                // Do stencil Computation
            }
        }
        index += 1;   // skip last padding col
    }
    index += numX + 2;   // skip last padding row
}

这些方向是否有希望在很大程度上取决于为获得模板值而执行的实际计算。你也应该发布它。

如果您可以更改坏单元格的布尔数组的格式，则将行填充为 8 的倍数并使用 8 列的水平填充来改善对齐会很有用。将布尔数组设为位数组允许通过单个测试一次检查 8、16、32 甚至 64 个单元。

您可以调整数组指针以使用基于 0 的坐标。

下面是它的工作原理：

int numCellsX = 8 + ((numX + 7) & ~7) + 8;
int numCellsY = 1 + numY + 1;
int numCellsXY = numCellsX * numCellsY;
// adjusted array_pointer
array_pointer = allocated_pointer + 8 + numCellsX + numCellsXY;
// assuming the isBadCell array is 0 based too.
for (int z = 0, indexZ = 0; z < numZ; z++, indexZ += numCellsXY) {
    for (int y = 0, indexY = indexZ; y < numY; y++, indexY += numCellsX) {
        for (int x = 0, index = indexY; x <= numX - 8; x += 8, index += 8) {
            int mask = isBadCell[index >> 3];
            if (mask == 0) {
                // let the compiler unroll computation for 8 pixels with
                for (int i = 0; i < 8; i++) {
                   // compute stencil value for x+i,y,z at index+i
                }
            } else {
                for (int i = 0; i < 8; i++, mask >>= 1) {
                    if (!(mask & 1)) {
                       // compute stencil value for x+i,y,z at index+i
                    }
                }
            }
        }
        int mask = isBadCell[index >> 3];
        for (; x < numX; x++, index++, mask >>= 1) {
            if (!(mask & 1)) {
                // compute stencil value for x,y,z at index
            }
        }
    }
}

编辑：

模板函数对 getCellIndex 的调用次数过多。下面是如何使用上面代码中计算的索引值来优化它：

// index is the offset of cell x,y,z
// numCellsX, numCellsY are the dimensions of the plane
// numCellsXY is the offset between planes: numCellsX * numCellsY

if (isBadCell[index]) {
    double temp = someOtherArray[index] +
                1.0 / CONSTANT / CONSTANT *
                ( - 1.0 * cells[index - 1]
                  - 1.0 * cells[index + 1]
                  - 1.0 * cells[index - numCellsX]
                  - 1.0 * cells[index + numCellsX]
                  - 1.0 * cells[index - numCellsXY]
                  - 1.0 * cells[index + numCellsXY]
                  + 6.0 * cells[index]
                );
    cells[index] += -omega * temp / 6.0 * CONSTANT * CONSTANT;
    globalTemp += temp * temp;
}

将&cells[index] 预计算为指针可能会改进代码，但编译器应该能够检测到这个公共子表达式并已经生成高效的代码。

EDIT2：

这是一种平铺方法：您可以添加缺少的参数，大多数大小都假定是全局的，但您可能应该传递一个指向具有所有这些值的上下文结构的指针。它使用isBadTile[] 和isGoodTile[]：布尔数组，分别判断给定图块是否所有单元格都坏和所有单元格都好。

void handle_tile(int x, int y, int z, int nx, int ny, int nz) {
    int index0 = x + y * numCellsX + z * numCellsXY;
    // skipping a tile with all cells bad.
    if (isBadTile[index0] && nx == 4 && ny == 4 && nz == 4)
        return;
    // handling a 4x4x4 tile with all cells OK.
    if (isGoodTile[index0] && nx == 4 && ny == 4 && nz == 4) {
        for (int iz = 0; iz < 4; iz++) {
            for (int iy = 0; iy < 4; iy++) {
                for (int ix = 0; ix < 4; ix++) {
                    int index = index0 + ix + iy * numCellsX + iz + numCellsXY;
                    // Do stencil computation using `index`
                }
            }
        }
    } else {
        for (int iz = 0; iz < nz; iz++) {
            for (int iy = 0; iy < ny; iy++) {
                for (int ix = 0; ix < nx; ix++) {
                    int index = index0 + ix + iy * numCellsX + iz + numCellsXY;
                    if (!isBadCell[index] {
                        // Do stencil computation using `index`
                }
            }
        }
    }
}

void handle_cells() {
    int x, y, z;
    for (z = 1; z <= numZ; z += 4) {
        int nz = min(numZ + 1 - z, 4);
        for (y = 1; y <= numY; y += 4) {
            int ny = min(numY + 1 - y, 4);
            for (x = 1; x <= numX; x += 4) {
                int nx = min(numX + 1 - x, 4);
                handle_tile(x, y, z, nx, ny, nz);
            }
        }
    }
}

这是一个计算isGoodTile[] 数组的函数。唯一正确计算的偏移量对应于 x 的 4 + 1 的倍数，y 和 z 的值小于其最大值的 3。

此实现不是最理想的，因为可以计算的元素更少。不完整的边界图块（距边缘少于 4 个）可能会被标记为不好，以跳过单个案例的好案例。如果 isBadTile 数组为边缘切片正确计算，则对坏切片的测试可能适用于这些边缘切片，但目前情况并非如此。

void computeGoodTiles() {
    int start = 1 + numCellsX + numCellsXY;
    int stop = numCellsXY * numCellsZ - 1 - numCellsX - numCellsXY;

    memset(isGoodTile, 0, sizeof(*isGoodTile) * numCellsXY * numCellsZ);
    for (int i = start; i < stop; i += 4) {
        isGoodTile[i] = (isBadCell[i + 0] | isBadCell[i + 1] |
                         isBadCell[i + 2] | isBadCell[i + 3]) ^ 1;
    }
    for (int i = start; i < stop - 3 * numCellsX; i += 4) {
        isGoodTile[i] = isGoodTile[i + 0 * numCellsX] &
                        isGoodTile[i + 1 * numCellsX] &
                        isGoodTile[i + 2 * numCellsX] &
                        isGoodTile[i + 3 * numCellsX];
    }
    for (int i = start; i < stop - 3 * numCellsXY; i += 4) {
        isGoodTile[i] = isGoodTile[i + 0 * numCellsXY] &
                        isGoodTile[i + 1 * numCellsXY] &
                        isGoodTile[i + 2 * numCellsXY] &
                        isGoodTile[i + 3 * numCellsXY];
    }
}

void computeBadTiles() {
    int start = 1 + numCellsX + numCellsXY;
    int stop = numCellsXY * numCellsZ - 1 - numCellsX - numCellsXY;

    memset(isBadTile, 0, sizeof(*isBadTile) * numCellsXY * numCellsZ);
    for (int i = start; i < stop; i += 4) {
        isBadTile[i] = isBadCell[i + 0] & isBadCell[i + 1] &
                       isBadCell[i + 2] & isBadCell[i + 3];
    }
    for (int i = start; i < stop - 3 * numCellsX; i += 4) {
        isBadTile[i] = isBadTile[i + 0 * numCellsX] &
                       isBadTile[i + 1 * numCellsX] &
                       isBadTile[i + 2 * numCellsX] &
                       isBadTile[i + 3 * numCellsX];
    }
    for (int i = start; i < stop - 3 * numCellsXY; i += 4) {
        isBadTile[i] = isBadTile[i + 0 * numCellsXY] &
                       isBadTile[i + 1 * numCellsXY] &
                       isBadTile[i + 2 * numCellsXY] &
                       isBadTile[i + 3 * numCellsXY];
    }
}

【讨论】：

感谢您的回答，我将详细介绍它，但我只是想问一下如何扩展它以便我们可以跳过整个单元格块？块优化的主要原因是我们可以在整个计算之前计算某个区域/块是否包含badCells，如果没有，我们可以使用优化函数对其进行处理，在此处未显示的其他计算中，我们实际上可以完全跳过这些块。考虑到 badCells 只占域的一小部分，这应该会使代码更快。这有意义吗？
@JC2188：我更新了 8 像素水平块的代码。您可以轻松地将其扩展为 16、32 或 64 像素，但它是面向行的。如果坏块是 X 维度较小但在其他维度上扩展的簇，则更复杂的方法可能会产生良好的结果，但我怀疑您可能需要为 isBadCell 数组使用不同的结构。
@JC2188：优化内部循环应该会带来实质性的改进。先尝试一下，不要使用面向位的东西。
好的，我想我明白它是如何工作的。是否有可能这样做，而不是面向行，而是平铺？我认为这会提供更好的缓存行为
@JC2188：它可能会提供更好的缓存行为，也可能不会，只有仔细的基准测试才能告诉您哪个性能更高，并且仅适用于经过测试的架构、大小和编译器。可以通过添加更多循环来尝试平铺方法，但是索引计算的增量计算会变得更加复杂和昂贵，可能不值得。我将发布这样的替代方案。

【解决方案2】：

我认为您可能会嵌套几组类似的循环。像这样的：

for(int z = 1; z < numZ+1; z+=4) {
    for(int y = 1; y < numY+1; y+=4) {
        for(int x = 1; x < numX+1; x+=4) {
            if(!isBadBlock[ getBlockIndex(x>>2,y>>2,z>>2) ]) {
                for(int zz = z; zz < z + 4 && zz < numZ+1; zz++) {
                   for(int yy = y; yy < y + 4 && yy < numY+1; yy++) {
                      for(int xx = z; xx < x + 4 && xx < numX+1; xx++) {
                         if(!isBadCell[ getCellIndex(xx,yy,zz) ]) {
                             // Do stencil Computation
                            }
                        }
                    }
                }
            }
        }
    }
}

【讨论】：

在上面的代码中，isBadBlock 是一个布尔数组，表示块将所有单元格坏。 OP 似乎假设有更多的所有单元都很好的块，并希望首先优化这些块。我想这两种方法可以结合使用，具体取决于坏细胞的分布。

【解决方案3】：

按照您目前的设置方式，您可以使用 3d 数组简单地获取索引，如下所示：

#include <sys/types.h>
#define numX 256
#define numY 128
#define numZ 64
//Note the use of powers of 2 - it will simplify things a lot

int cells[numX][numY][numZ];

size_t getindex(size_t x, size_t y,size_t z){
  return (int*)&cells[x][y][z]-(int*)&cells[0][0][0];
}

这将像这样布置单元格：

[0,0,0][0,0,1][0,0,2]...[0,0,numZ-1]
[0,1,0][0,1,1][0,1,2]...[0,1,numZ-1]
...
[0,numY-1,0][0,numY-1,1]...[0,1,numZ-1]
...
[1,0,0][1,0,1][0,0,2]...[1,0,numZ-1]
[1,1,0][1,1,1][1,1,2]...[1,1,numZ-1]
...
[numX-1,numY-1,0][numX-1,numY-1,1]...[numX-1,numY-1,numZ-1]

So efficient loops would look like:

for(size_t x=0;x<numX;x++)
  for(size_t y=0;y<numY;y++)
    for(size_t z=0;z<numZ;z++)
      //vector operations on z values

但是，如果您想将其拆分为 4x4x4 块，则可以使用 4x4x4 块的 3d 数组，例如：

#include <sys/types.h>
#define numX 256 
#define numY 128
#define numZ 64

typedef int block[4][4][4];
block blocks[numX][numY][numZ];
//add a compiler specific 64 byte alignment to  help with cache misses?

size_t getblockindex(size_t x, size_t y,size_t z){
  return (block *)&blocks[x][y][z]-(block *)&blocks[0][0][0];
}

我将索引重新排序为 x,y,z 以便我可以将它们直接记在脑海中，但请确保您对它们进行排序，以便最后一个是您在最里面的一系列 for 循环中操作的那个.

【讨论】：

【解决方案4】：

虽然 OP 需要使用阻塞的方法，但我建议不要这样做。

你看，每个连续的细胞序列（沿 X 轴的一维细胞）已经是这样一个块。 阻塞不是让问题更简单，而是用固定大小的较小副本替换原始问题，一遍又一遍地重复。

简单地说，阻塞对解决手头的真正问题没有任何帮助。它根本不应该是解决方案的必备功能。

相反，我建议完全避免根本问题——只是以不同的方式。

您会看到，您可以保留一个坏细胞索引的（排序）列表，而不是为您需要测试的每个单元格设置一个“坏单元格”标志（每个单元格一次，不少于一次）。然后您可以一次处理整个数据集，然后对坏单元格索引列表中列出的单元格进行修复循环。

另请注意，除非您处理单元格值的副本，否则计算新单元格值的顺序将影响结果。这几乎肯定不是你想要的。

所以，这是我的建议：

#include <stdlib.h>
#include <errno.h>

typedef struct {
    /* Core cells in the state, excludes border cells */
    size_t   xsize;
    size_t   ysize;
    size_t   zsize;

    /* Index calculation: x + y * ystride + z * zstride */
    /* x is always linear in memory; xstride = 1 */
    size_t   ystride; /* = xsize + 2 */
    size_t   zstride; /* = ystride * (ysize + 2) */

    /* Cell data, points to cell (0,0,0) */
    double  *current;
    double  *previous;

    /* Bad cells */
    size_t   fixup_cells;  /* Number of bad cells */
    size_t  *fixup_index;  /* Array of bad cells' indexes */

    /* Dynamically allocated memory */
    void    *mem[3];
} lattice;

void lattice_free(lattice *const ref)
{
    if (ref) {
        /* Free dynamically allocated memory, */
        free(ref->mem[0]);
        free(ref->mem[1]);
        free(ref->mem[2]);
        /* then initialize/poison the contents. */
        ref->xsize = 0;
        ref->ysize = 0;
        ref->zsize = 0;
        ref->ystride = 0;
        ref->zstride = 0;
        ref->previous = NULL;
        ref->current = NULL;
        ref->fixup_cells = 0;
        ref->fixup_index = NULL;
        ref->mem[0] = NULL;
        ref->mem[1] = NULL;
        ref->mem[2] = NULL;
    }
}


int lattice_init(lattice *const ref, const size_t xsize, const size_t ysize, const size_t zsize)
{
    const size_t  xtotal = xsize + 2;
    const size_t  ytotal = ysize + 2;
    const size_t  ztotal = zsize + 2;
    const size_t  ntotal = xtotal * ytotal * ztotal;
    const size_t  double_bytes = ntotal * sizeof (double);
    const size_t  size_bytes = xsize * ysize * zsize * sizeof (size_t);

    /* NULL reference to the variable to initialize? */
    if (!ref)
        return EINVAL;

    /* Initialize/poison the lattice variable. */
    ref->xsize = 0;
    ref->ysize = 0;
    ref->zsize = 0;
    ref->ystride = 0;
    ref->zstride = 0;
    ref->previous = NULL;
    ref->current = NULL;
    ref->fixup_cells = 0;
    ref->fixup_index = NULL;
    ref->mem[0] = NULL;
    ref->mem[1] = NULL;
    ref->mem[2] = NULL;

    /* Verify size is nonzero */
    if (xsize < 1 || ysize < 1 || zsize < 1)
        return EINVAL;        

    /* Verify size is not too large */
    if (xtotal <= xsize || ytotal <= ysize || ztotal <= zsize ||
        ntotal / xtotal / ytotal != ztotal ||
        ntotal / xtotal / ztotal != ytotal ||
        ntotal / ytotal / ztotal != xtotal ||
        double_bytes / ntotal != sizeof (double) ||
        size_bytes / ntotal != sizeof (size_t))
        return ENOMEM;

    /* Allocate the dynamic memory needed. */
    ref->mem[0] = malloc(double_bytes);
    ref->mem[1] = malloc(double_bytes);
    ref->mem[2] = malloc(size_bytes);
    if (!ref->mem[0] || !ref->mem[1] || !ref->mem[2]) {
        free(ref->mem[2]);
        ref->mem[2] = NULL;
        free(ref->mem[1]);
        ref->mem[1] = NULL;
        free(ref->mem[0]);
        ref->mem[0] = NULL;
        return ENOMEM;
    }

    ref->xsize = xsize;
    ref->ysize = ysize;
    ref->zsize = zsize;

    ref->ystride = xtotal;
    ref->zstride = xtotal * ytotal;

    ref->current = (double *)ref->mem[0] + 1 + xtotal;
    ref->previous = (double *)ref->mem[1] + 1 + xtotal;

    ref->fixup_cells = 0;
    ref->fixup_index = (size_t *)ref->mem[2];

    return 0;
}

请注意，我更喜欢 x + ystride * y + zstride * z 索引计算形式而不是 x + xtotal * (y + ytotal * z)，因为前者中的两个乘法可以并行完成（在超标量管道中，在可以同时进行两个不相关整数乘法的架构上在单个 CPU 内核上），而在后者中，乘法必须是顺序的。

请注意，ref->current[-1 - ystride - zstride] 指的是单元格 (-1, -1, -1) 处的当前单元格值，即与原始单元格 (0, 0, 0) 对角线的边界单元格。换句话说，如果您在索引i 处有一个单元格（x、y、z），那么
i-1 是 (x-1, y, z) 处的单元格
i+1 是 (x+1, y, z) 处的单元格
i-ystride 是 (x, y-1, z) 处的单元格
i+ystride 是 (x, y+1, z) 处的单元格
i-zstride 是 (x, y, z-1) 处的单元格
i+zstride 是 (x, y, z-1) 处的单元格
i-ystride 是 (x, y-1, z) 处的单元格
i-1-ystride-zstride 是 (x-1, y-1, z-1) 处的单元格
i+1+ystride+zstride 是 (x+1, y+1, z+1) 处的单元格
等等。

ref->fixup_index 数组足够大，可以列出除边框单元格之外的所有单元格。保持排序（或在构建后排序）是个好主意，因为这有助于缓存局部性。

如果您的晶格具有周期性边界条件，您可以使用 6 个 2D 循环、12 个 1D 循环和 8 个副本将第一个和最后一个有效单元格复制到边界，然后再开始新的更新。

因此，您的更新周期本质上是：

计算或填充->current中的边框。
交换->current和->previous。
使用来自->previous 的数据计算->current 的所有单元格。
循环遍历->fixup_index 中的->fixup_cells 索引，并重新计算对应的->current 单元格。

请注意，在步骤 3 中，您可以对 0 和 xsize-1 + (ysize-1)*ystride + (zsize-1)*zstride 之间的所有索引进行线性处理，包括在内；也就是说，包括大约 67% 的边界单元格。与整个体积相比，它们相对较少，并且具有单个线性循环可能比跳过边界单元更快 - 特别是如果您可以矢量化计算。（在这种情况下，这很重要。）

您甚至可以通过为每个线程分配一组连续的索引来处理多个线程的工作。因为您从->previous 读取并写入->current，所以线程不会互相践踏，尽管如果一个线程到达其区域的末尾而另一个线程位于其区域的开头，则可能会有一些缓存线乒乓地区;由于数据的定向方式（缓存行只是几个——通常是 2、4 或 8 个——大小的单元），乒乓球在实践中不应该成为问题。（显然，不需要锁。）

这个特殊的问题在任何方面都不是真正的新问题。建模 Conway's Game of Life 或 square- or cubic-lattice Ising model 以及实现许多其他晶格模型都涉及相同的问题（但通常使用布尔数据而不是双精度数据，并且没有“坏单元格”）。

【讨论】：