遵循instructions for using operf 或多或少,我用单行sapply(1:1e5, sum) 创建了一个文件并运行
$ operf ~/bin/R-3-1-branch/bin/R -f sum.R
$ opreport -l ~/bin/R-3-1-branch/lib/libR.so |less
生产
CPU: Intel Sandy Bridge microarchitecture, speed 2.401e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples % image name symbol name
835882 93.0929 libR.so RunGenCollect
27731 3.0884 libR.so SortNodes
9323 1.0383 libR.so AgeNodeAndChildren
2038 0.2270 libR.so CheckFinalizers
1593 0.1774 libR.so Rf_allocVector3
1222 0.1361 libR.so duplicate1
...
等等。大部分时间都花在垃圾收集器上(RunGenCollect -- 运行分代垃圾收集器)。于是我跑了
$ R -d gdb R
(gdb) run
> sapply(1:1e5, sum)
^C
(gdb) break RunGenCollect
(gdb) continue
Continuing.
Breakpoint 1, RunGenCollect (size_needed=50000) at /home/mtmorgan/src/R-3-1-branch/src/main/memory.c:1504
1504 bad_sexp_type_seen = 0;
(gdb) where
生产的
#0 RunGenCollect (size_needed=50000) at /home/mtmorgan/src/R-3-1-branch/src/main/memory.c:1504
#1 0x00007ffff789d354 in R_gc_internal (size_needed=50000) at /home/mtmorgan/src/R-3-1-branch/src/main/memory.c:2825
#2 0x00007ffff789e99b in Rf_allocVector3 (type=13, length=100000, allocator=0x0) at /home/mtmorgan/src/R-3-1-branch/src/main/memory.c:2563
#3 0x00007ffff788e1a5 in Rf_allocVector (type=13, length=100000) at /home/mtmorgan/src/R-3-1-branch/src/include/Rinlinedfuns.h:189
#4 0x00007ffff7831787 in duplicate1 (s=0x7ffff3b0b010, deep=TRUE) at /home/mtmorgan/src/R-3-1-branch/src/main/duplicate.c:335
#5 0x00007ffff783371a in duplicate_child (s=0x7ffff3b0b010, deep=TRUE) at /home/mtmorgan/src/R-3-1-branch/src/main/duplicate.c:199
#6 0x00007ffff783357a in duplicate_list (s=0x2c98b30, deep=TRUE) at /home/mtmorgan/src/R-3-1-branch/src/main/duplicate.c:261
#7 0x00007ffff7830fc2 in duplicate1 (s=0x2c98b30, deep=TRUE) at /home/mtmorgan/src/R-3-1-branch/src/main/duplicate.c:308
#8 0x00007ffff783371a in duplicate_child (s=0x2c98b30, deep=TRUE) at /home/mtmorgan/src/R-3-1-branch/src/main/duplicate.c:199
#9 0x00007ffff783357a in duplicate_list (s=0x2c98a88, deep=TRUE) at /home/mtmorgan/src/R-3-1-branch/src/main/duplicate.c:261
#10 0x00007ffff7830fc2 in duplicate1 (s=0x2c98a88, deep=TRUE) at /home/mtmorgan/src/R-3-1-branch/src/main/duplicate.c:308
#11 0x00007ffff7830c7f in Rf_duplicate (s=0x2c98a88) at /home/mtmorgan/src/R-3-1-branch/src/main/duplicate.c:132
#12 0x00007ffff79257f4 in do_summary (call=0x2c98a88, op=0x6259a0, args=0x303cf88, env=0x2c97f48) at /home/mtmorgan/src/R-3-1-branch/src/main/summary.c:462
...
这里的相关行是第462行
(gdb) up 12
#12 0x00007ffff79257f4 in do_summary (call=0x2c98a88, op=0x6259a0, args=0x303cf88, env=0x2c97f48) at /home/mtmorgan/src/R-3-1-branch/src/main/summary.c:462
462 PROTECT(call2 = duplicate(call));
(gdb) list
457 return ans;
458 }
459
460 /* match to foo(..., na.rm=FALSE) */
461 PROTECT(args = fixup_NaRm(args));
462 PROTECT(call2 = duplicate(call));
463 SETCDR(call2, args);
464
465 if (DispatchGroup("Summary", call2, op, args, env, &ans)) {
466 UNPROTECT(2);
通话正在重复
(gdb) call Rf_PrintValue(call)
FUN(1:100000[[5339L]], ...)
对于循环的每次迭代,触发垃圾回收。 not 为 cumsum 执行了类似的代码。长期以来一直如此,原因并非 100% 明显
$ svn annotate ~/src/R-3-1-branch/src/main/summary.c |less
...
42643 ripley /* match to foo(..., na.rm=FALSE) */
42643 ripley PROTECT(args = fixup_NaRm(args));
42643 ripley PROTECT(call2 = duplicate(call));
42643 ripley SETCDR(call2, args)
...
$ svn log -r42643
------------------------------------------------------------------------
r42643 | ripley | 2007-08-25 23:09:50 -0700 (Sat, 25 Aug 2007) | 1 line
make the rest of the group generics primitive
------------------------------------------------------------------------
在R-devel 邮件列表中处理这件事会很有趣。并不是sum 特别慢,而是对垃圾收集器的调用开始支配执行时间。
嗯,仔细想想
sapply(1:1e5, function(x) sum(x))
与cumsum 在同一个球场上运行。我认为这是因为原始版本中第 462 行的 duplicate 正在复制 1e5 元素,以准备选择要求和的第 i 个元素。相反,在function(x) sum(x) 中,向量已经是子集,所以重复只是第 i 个元素。复制原始向量也解释了为什么 1e5 元素比 1e4 元素慢得多,以及为什么as.list(1:1e5) 相对高效(实际上只有列表元素被复制,或者甚至没有)。调用sum 期间的重复与它属于(S3)Summary 组泛型这一事实有关,请参阅?"group generic"。