【发布时间】:2019-12-14 08:52:49
【问题描述】:
我在Math.max 周围玩,看看它是否受到分支预测的影响(不,至少在 x64 的 JDK 上没有,有一个 cmovl),以及按位实现是否可以与默认实现竞争。所有测试如下所示:
@Threads(4)
@State(Scope.Thread)
@BenchmarkMode({Mode.AverageTime, Mode.SampleTime})
@OutputTimeUnit(TimeUnit.MILLISECONDS)
public class CoreMaximum {
private int[][] corpus;
@Setup
public void setUp() {
corpus = Corpus.create();
}
@Benchmark
public void constant(Blackhole blackhole) {
val arguments = corpus[0];
for (val payload : corpus) {
blackhole.consume(arguments[0]);
blackhole.consume(arguments[1]);
blackhole.consume(payload[0]);
blackhole.consume(payload[1]);
blackhole.consume(Math.max(arguments[0], arguments[1]));
}
}
@Benchmark
public void random(Blackhole blackhole) {
val payload = corpus[0];
for (val arguments : corpus) {
blackhole.consume(arguments[0]);
blackhole.consume(arguments[1]);
blackhole.consume(payload[0]);
blackhole.consume(payload[1]);
blackhole.consume(Math.max(arguments[0], arguments[1]));
}
}
}
其中Math.max 可以替换为对另一个实现的调用,Corpus.create() 返回由 SecureRandom 填充的int[1_000_000][2]。
问题是,即使我确信被调用的代码不受分支预测的影响,并且在 constant 和 random 基准测试中执行偶数的负载和消耗时,我仍然在所有基准测试中得到类似的差异实现:
CoreMaximum.constant avgt 25 13.080 ± 0.680 ms/op
CoreMaximum.constant:CPI avgt 5 0.528 ± 0.027 #/op
CoreMaximum.constant:L1-dcache-load-misses avgt 5 478734.008 ± 2419.011 #/op
CoreMaximum.constant:L1-dcache-loads avgt 5 49990187.380 ± 114908.845 #/op
CoreMaximum.constant:L1-dcache-stores avgt 5 17998192.002 ± 42008.496 #/op
CoreMaximum.constant:L1-icache-load-misses avgt 5 2142.398 ± 526.619 #/op
CoreMaximum.constant:LLC-load-misses avgt 5 28553.636 ± 1338.175 #/op
CoreMaximum.constant:LLC-loads avgt 5 33148.939 ± 667.526 #/op
CoreMaximum.constant:LLC-store-misses avgt 5 150.218 ± 26.488 #/op
CoreMaximum.constant:LLC-stores avgt 5 271.536 ± 113.444 #/op
CoreMaximum.constant:branch-misses avgt 5 187.060 ± 123.697 #/op
CoreMaximum.constant:branches avgt 5 17001028.964 ± 32923.938 #/op
CoreMaximum.constant:cycles avgt 5 57063715.464 ± 2900664.885 #/op
CoreMaximum.constant:dTLB-load-misses avgt 5 13153.047 ± 1808.179 #/op
CoreMaximum.constant:dTLB-loads avgt 5 49999483.367 ± 94718.665 #/op
CoreMaximum.constant:dTLB-store-misses avgt 5 36.217 ± 7.357 #/op
CoreMaximum.constant:dTLB-stores avgt 5 17999664.120 ± 23160.612 #/op
CoreMaximum.constant:iTLB-load-misses avgt 5 32.138 ± 4.584 #/op
CoreMaximum.constant:iTLB-loads avgt 5 16.571 ± 20.613 #/op
CoreMaximum.constant:instructions avgt 5 107989860.816 ± 240202.175 #/op
CoreMaximum.random avgt 25 14.082 ± 0.717 ms/op
CoreMaximum.random:CPI avgt 5 0.503 ± 0.037 #/op
CoreMaximum.random:L1-dcache-load-misses avgt 5 479117.110 ± 2632.690 #/op
CoreMaximum.random:L1-dcache-loads avgt 5 56030755.475 ± 120501.598 #/op
CoreMaximum.random:L1-dcache-stores avgt 5 24015559.169 ± 51480.836 #/op
CoreMaximum.random:L1-icache-load-misses avgt 5 2473.731 ± 968.508 #/op
CoreMaximum.random:LLC-load-misses avgt 5 29106.351 ± 1251.508 #/op
CoreMaximum.random:LLC-loads avgt 5 34274.838 ± 1178.339 #/op
CoreMaximum.random:LLC-store-misses avgt 5 156.975 ± 29.332 #/op
CoreMaximum.random:LLC-stores avgt 5 268.092 ± 128.106 #/op
CoreMaximum.random:branch-misses avgt 5 169.811 ± 102.783 #/op
CoreMaximum.random:branches avgt 5 18007736.925 ± 42002.060 #/op
CoreMaximum.random:cycles avgt 5 61431988.502 ± 4506086.004 #/op
CoreMaximum.random:dTLB-load-misses avgt 5 13157.184 ± 1496.518 #/op
CoreMaximum.random:dTLB-loads avgt 5 56026614.485 ± 144375.149 #/op
CoreMaximum.random:dTLB-store-misses avgt 5 37.321 ± 6.870 #/op
CoreMaximum.random:dTLB-stores avgt 5 24011292.571 ± 54348.034 #/op
CoreMaximum.random:iTLB-load-misses avgt 5 38.059 ± 19.549 #/op
CoreMaximum.random:iTLB-loads avgt 5 18.290 ± 36.399 #/op
CoreMaximum.random:instructions avgt 5 122045528.539 ± 313318.615 #/op
(我没有足够的经验得出结论,但显然random benchmark 的计数器数量比constant 多,尤其是instructions 和branches)
Random 总是需要更长的时间并且有更多的指令,这显然表明基准测试没有考虑到某些问题,但我不知道出了什么问题。除了线程数,我没有调整任何东西(不知道为什么,但它只使用一个而不是核心数),预热应该完成他们的工作(我看到C2 已经在第一次预热时编译代码),和转储ASM(在CompileCommandFile 中使用打印)除了几个nop 和特定指令的放置之外没有显示出显着差异。我错过了什么?
【问题讨论】:
标签: java jvm performance-testing jmh