【发布时间】:2019-11-24 03:01:51
【问题描述】:
我正在尝试计算顺序/随机内存读/写的内存访问时间。代码如下:
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <time.h>
#define PRINT_EXCECUTION_TIME(msg, code) \
do { \
struct timeval t1, t2; \
double elapsed; \
gettimeofday(&t1, NULL); \
do { \
code; \
} while (0); \
gettimeofday(&t2, NULL); \
elapsed = (t2.tv_sec - t1.tv_sec) * 1000.0; \
elapsed += (t2.tv_usec - t1.tv_usec) / 1000.0; \
printf(msg " time: %f ms\n", elapsed); \
} while (0);
const int RUNS = 20;
const int N = (1 << 27) - 1;
int *data;
int seqR() {
register int res = 0;
register int *data_p = data;
register int pos = 0;
for (register int j = 0; j < RUNS; j++) {
for (register int i = 0; i < N; i++) {
pos = (pos + 1) & N;
res = data_p[pos];
}
}
return res;
}
int seqW() {
register int res = 0;
register int *data_p = data;
register int pos = 0;
for (register int j = 0; j < RUNS; j++) {
for (register int i = 0; i < N; i++) {
pos = (pos + 1) & N;
data_p[pos] = res;
}
}
return res;
}
int rndR() {
register int res = 0;
register int *data_p = data;
register int pos = 0;
for (register int j = 0; j < RUNS; j++) {
for (register int i = 0; i < N; i++) {
pos = (pos + i) & N;
res = data_p[pos];
}
}
return res;
}
int rndW() {
register int res = 0;
register int *data_p = data;
register int pos = 0;
for (register int j = 0; j < RUNS; j++) {
for (register int i = 0; i < N; i++) {
pos = (pos + i) & N;
data_p[pos] = res;
}
}
return res;
}
int main() {
data = (int *)malloc(sizeof(int) * (N + 1));
assert(data);
for (int i = 0; i < N; i++) {
data[i] = i;
}
for (int i = 0; i < 10; i++) {
PRINT_EXCECUTION_TIME("seqR", seqR());
PRINT_EXCECUTION_TIME("seqW", seqW());
PRINT_EXCECUTION_TIME("rndR", rndR());
PRINT_EXCECUTION_TIME("rndW", rndW());
}
return 0;
}
我使用gcc 6.5.0 和-O0 来防止优化,但得到的结果如下:
seqR time: 2538.010000 ms
seqW time: 2394.991000 ms
rndR time: 40625.169000 ms
rndW time: 46184.652000 ms
seqR time: 2411.038000 ms
seqW time: 2309.115000 ms
rndR time: 41575.063000 ms
rndW time: 46206.275000 ms
很容易理解,顺序访问比随机访问要快得多。但是,随机写入比随机读取慢而顺序写入比顺序读取快对我来说没有意义。什么原因会导致这种情况?
此外,我可以肯定地说seqR 的内存带宽是(20 * ((1 << 27) - 1) * 4 * 1024 * 1024 * 1024)GB / (2.538)s = 4.12GB/s?
【问题讨论】:
-
你用什么处理器来运行实验?
-
在我看来
data_p[N]可以在所有四个函数中访问。您可能想要分配 N+1ints。 -
谢谢,应该是
N + 1。我正在使用 Intel Xeon E5-2695 v4 Broadwell。 -
@zingdle:哦。与四核桌面相比,多核 Xeon 对单线程内存带宽不利是出了名的。不过,4GB/s 仍然低于我的预期,所以我仍然认为您在顺序函数中受 CPU 限制(而不是内存)。见Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?
-
@PeterCordes 是的,看起来原始代码受 CPU 限制。我将数组更改为
volatile,删除register并使用O3重新编译。程序集告诉我它使用一些xmm寄存器进行优化,但花费的时间与以前大致相同。如果我省略pos = (pos + i) & N;并使用data[i]访问数组,则花费的时间减半。但是,我不能像以前那样直接比较随机/顺序内存访问时间。
标签: performance x86-64 cpu-cache memory-bandwidth