如何在 C 中有效地累积数据数组答案

【问题标题】：How to accumulate arrays of data efficiently in C如何在 C 中有效地累积数据数组
【发布时间】：2016-05-24 18:54:46
【问题描述】：

问题是我有一个巨大的矩阵 A，并且给定一个（相当大的）整数数组，例如，假设我的矩阵是： [0,0,0,0,0,0,0,0, 1,1,1,1,1,1,1,1, 2,2,2,2,2,2,2,2, 3,3,3,3,3,3,3,3, 4,4,4,4,4,4,4,4, ......]

整数数组为[0, 2, 4]

那么通过累加[0,0,0,0,0,0,0,0], [2,2, 2,2,2,2,2,2],[4,4,4,4,4,4,4,4]

这是一个简单的问题，但幼稚的 C 实现似乎很慢。在累积大量行时尤其如此。

手动 loop_unrolling 似乎没有帮助。我不熟悉内联汇编，有什么建议吗？我想知道是否还有用于此类操作的已知库。

下面是我目前的实现：

void accumulateRows(int* js, int num_j, Dtype* B, int nrow, int ncol, int incRowB, Dtype* buffer){

int i = 0;
int num_accumulated_rows = (num_j / 8) * 8;
int remaining_rows = num_j - num_accumulated_rows;

// unrolling factor of 8, each time, accumulate 8 rows  
for(; i < num_accumulated_rows; i+=8){
    int r1 = js[i];
    int r2 = js[i+1];
    int r3 = js[i+2];
    int r4 = js[i+3];
    int r5 = js[i+4];
    int r6 = js[i+5];
    int r7 = js[i+6];
    int r8 = js[i+7];
    register Dtype* B1_row = &B[r1*incRowB];
    register Dtype* B2_row = &B[r2*incRowB];
    register Dtype* B3_row = &B[r3*incRowB];
    register Dtype* B4_row = &B[r4*incRowB];
    register Dtype* B5_row = &B[r5*incRowB];
    register Dtype* B6_row = &B[r6*incRowB];
    register Dtype* B7_row = &B[r7*incRowB];
    register Dtype* B8_row = &B[r8*incRowB];
    for(int j = 0; j < ncol; j+=1){
        register Dtype temp = B1_row[j] + B2_row[j] + B3_row[j] + B4_row[j];
        temp += B5_row[j] + B6_row[j] + B7_row[j] + B8_row[j];
        buffer[j] += temp;
    }
}

// left_over from the loop unrolling
for(; i < remaining_rows; i++){
    int r = js[i];
    Dtype* B_row = &B[r*incRowB];
    for(int i = 0; i < n; i++){
        buffer[i] += B_row[i];
    }
}

}

编辑我认为这种累积在数据库中很常见，例如当我们要查询任何一个星期一、星期二等的总销售额时。

我知道 gcc 支持英特尔 SSE，我正在学习如何将其应用到这个问题上，因为这非常类似于 SIMD

【问题讨论】：

你要分享你的“缓慢幼稚的 C 实现”吗？
当你说你想要 A 的行到单个缓冲区中时，你的意思有点不清楚。能举个例子吗？
这种冗余量在您的矩阵中是常见的还是预期的？如果是这样，只需将整数数组与6 相加，然后然后将结果广播到6,6,6,6,...。如果不是，您的 C 代码看起来可能会很好地自动矢量化。只要您正在加载连续数据，就应该没问题。聚集很慢，但可以有效地加载数组的连续索引。
您的代码中没有二维数组（又称矩阵）。
@Olaf 根据Wikipedia 的说法，矩阵的大小可以为 1，从问题中可以清楚地看出 2d 版本被扁平化为 1d 版本。放下吹毛求疵，专注于问题。

标签： c arrays database sse inline-assembly

【解决方案1】：

这是实现该功能的一种方法，以及一些关于进一步加速的建议

#include <stdlib.h> // size_t

typedef int Dtype;

// Note:
// following function assumes a 'contract' with the caller
//    that no entry in 'whichRows[]'
//    is larger than (number of rows in 'baseArray[][]' -1)

void accumulateRows(
    // describe source 2d array
    /* size_t numRows */ size_t numCols, Dtype BaseArray[][ numCols ],

    // describe row selector array
    size_t numSelectRows, size_t whichRows[ numSelectRows ],

    // describe result array
    Dtype resultArray[ numCols ] )
{
    size_t colIndex;
    size_t selectorIndex;

    // initialize resultArray to all 0
    for( colIndex = 0; colIndex < numCols; colIndex++ )
    {
        resultArray[colIndex] = 0;
    }

    // accumulate totals for each column of selected rows
    for( selectorIndex = 0; selectorIndex < numSelectRows; selectorIndex++ )
    {
        for( colIndex = 0; colIndex < numCols; colIndex++ )
        {
            resultArray[colIndex] += BaseArray[ whichRows[selectorIndex] ][colIndex];
        } // end for each column
    } // end for each selected row
}

#if 0
// you might want to unroll the "initialize resultArray" loop
//    by replacing the loop with
    resultArray[0] = 0;
    resultArray[1] = 0;
    resultArray[2] = 0;
    resultArray[3] = 0;
    resultArray[4] = 0;
    resultArray[5] = 0;
    resultArray[6] = 0;
    resultArray[7] = 0;
// however, that puts a constraint on the number of columns always being 8
#endif

#if 0
// you might want to unroll the 'sum of columns' loop by replacing the loop with
    resultArray[0] += BaseArray[ whichRows[selectorIndex] ][0];
    resultArray[1] += BaseArray[ whichRows[selectorIndex] ][1];
    resultArray[2] += BaseArray[ whichRows[selectorIndex] ][2];
    resultArray[3] += BaseArray[ whichRows[selectorIndex] ][3];
    resultArray[4] += BaseArray[ whichRows[selectorIndex] ][4];
    resultArray[5] += BaseArray[ whichRows[selectorIndex] ][5];
    resultArray[6] += BaseArray[ whichRows[selectorIndex] ][6];
    resultArray[7] += BaseArray[ whichRows[selectorIndex] ][7];
// however, that puts a constraint on the number of columns always being 8
#endif

#if 0
// on Texas Instrument DSPs ,
//    could use a #pragma to unroll the loop
//    or (better)
//    make use of the built-in loop table
//    to massively speed up the execution of the loop(s)
#endif

【讨论】：