【发布时间】:2018-08-17 02:39:45
【问题描述】:
编辑:
ICC(添加-qopt-report=5 -qopt-report-phase:vec后):
LOOP BEGIN at 4.c(107,2)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
LOOP BEGIN at 4.c(108,3)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
remark #15346: vector dependence: assumed OUTPUT dependence between c[i][j] (110:5) and c[i][j] (110:5)
LOOP BEGIN at 4.c(109,4)
remark #15344: loop was not vectorized: vector dependence prevents vectorization
remark #15346: vector dependence: assumed FLOW dependence between c[i][j] (110:5) and c[i][j] (110:5)
remark #15346: vector dependence: assumed ANTI dependence between c[i][j] (110:5) and c[i][j] (110:5)
LOOP END
LOOP BEGIN at 4.c(109,4)
<Remainder>
LOOP END
LOOP END
LOOP END
如果是矢量化的,似乎 C[i][j] 是在写入之前读取的(就像我正在做的减少一样)。问题是为什么允许减少是引入局部变量(temp)?
原始问题:
我有一个 C sn-p 下面可以进行矩阵乘法。 a, b - 操作数,c - a*b 结果。 n - 行和列的长度。
double ** c = create_matrix(...) // initialize n*n matrix with zeroes
double ** a = fill_matrix(...) // fills n*n matrix with random doubles
double ** b = fill_matrix(...) // fills n*n matrix with random doubles
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
c[i][j] += a[i][k] * b[k][j];
}
}
}
ICC(版本 18.0.0.1)无法矢量化(提供 -O3 标志)内部循环。
ICC输出:
LOOP BEGIN at 4.c(107,2)
remark #25460: No loop optimizations reported
LOOP BEGIN at 4.c(108,3)
remark #25460: No loop optimizations reported
LOOP BEGIN at 4.c(109,4)
remark #25460: No loop optimizations reported
LOOP END
LOOP BEGIN at 4.c(109,4)
<Remainder>
LOOP END
LOOP END
LOOP END
不过,通过以下更改,编译器将内部循环向量化。
// OLD
for (k = 0; k < n; k++) {
c[i][j] += a[i][k] * b[k][j];
}
// TO (NEW)
double tmp = 0;
for (k = 0; k < n; k++) {
tmp += a[i][k] * b[k][j];
}
c[i][j] = tmp;
ICC矢量化输出:
LOOP BEGIN at 4.c(119,2)
remark #25460: No loop optimizations reported
LOOP BEGIN at 4.c(120,3)
remark #25460: No loop optimizations reported
LOOP BEGIN at 4.c(134,4)
<Peeled loop for vectorization>
LOOP END
LOOP BEGIN at 4.c(134,4)
remark #15300: LOOP WAS VECTORIZED
LOOP END
LOOP BEGIN at 4.c(134,4)
<Alternate Alignment Vectorized Loop>
LOOP END
LOOP BEGIN at 4.c(134,4)
<Remainder loop for vectorization>
LOOP END
LOOP END
LOOP END
不是将向量乘法结果累加到矩阵 C 单元中,而是将结果累加到单独的变量中并稍后分配。
为什么编译器不优化第一个版本?可能是由于 a 或 / 和 b 到 c 元素的潜在别名(写后读问题)?
【问题讨论】:
-
请提供minimal reproducible example。在您在这里显示的代码中,
a、b和c甚至可能指的是完全相同的内存 -
我同意 a、b 和 c 可以引用相同的确切内存。这是我对为什么块没有矢量化的假设。但是,为什么当我引入单独的累加器时它会被矢量化?我从不让 c a b 引用相同的内存位置。虽然编译器可能认为我可能会这样做。
-
编译器必须生成正确的代码,即使 a 或 b 和 c 重叠,除非它可以证明这不会发生。如果你提供一个完整的例子,将会有详细的解释。
-
好建议。我已经用初始化示例更新了代码。
-
我已经编辑了我的帖子。似乎问题在于数据依赖性。
标签: c++ c performance vectorization compiler-optimization