【发布时间】:2014-04-16 02:49:34
【问题描述】:
我有一个长时间运行的 node.js 进程,有时会跳转到 100% CPU 并停止响应请求。最近一次这样做是我将strace 附加到进程中,这就是我所看到的:
Process 19523 attached - interrupt to quit
gettimeofday({1394558574, 234192}, NULL) = 0
gettimeofday({1394558574, 235381}, NULL) = 0
gettimeofday({1394558574, 306460}, NULL) = 0
mmap(0x3edab823a000, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x3edab823a000
munmap(0x3edab823a000, 811008) = 0
munmap(0x3edab8400000, 237568) = 0
mmap(0x3edab8300000, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x3edab8300000
gettimeofday({1394558574, 316435}, NULL) = 0
gettimeofday({1394558574, 316576}, NULL) = 0
gettimeofday({1394558574, 316677}, NULL) = 0
gettimeofday({1394558574, 316775}, NULL) = 0
gettimeofday({1394558574, 316879}, NULL) = 0
gettimeofday({1394558574, 316978}, NULL) = 0
gettimeofday({1394558574, 317439}, NULL) = 0
gettimeofday({1394558574, 317537}, NULL) = 0
gettimeofday({1394558574, 318139}, NULL) = 0
gettimeofday({1394558574, 318234}, NULL) = 0
gettimeofday({1394558574, 318343}, NULL) = 0
gettimeofday({1394558574, 318437}, NULL) = 0
gettimeofday({1394558574, 318530}, NULL) = 0
gettimeofday({1394558574, 318624}, NULL) = 0
gettimeofday({1394558574, 319135}, NULL) = 0
gettimeofday({1394558574, 319648}, NULL) = 0
gettimeofday({1394558574, 319769}, NULL) = 0
gettimeofday({1394558574, 319975}, NULL) = 0
futex(0x7f5b380008c8, FUTEX_WAKE_PRIVATE, 1) = 1
gettimeofday({1394558574, 322266}, NULL) = 0
gettimeofday({1394558574, 322426}, NULL) = 0
gettimeofday({1394558574, 322520}, NULL) = 0
gettimeofday({1394558574, 322759}, NULL) = 0
gettimeofday({1394558574, 322853}, NULL) = 0
gettimeofday({1394558574, 322995}, NULL) = 0
futex(0x7f5b380008c8, FUTEX_WAKE_PRIVATE, 1) = 1
gettimeofday({1394558574, 417614}, NULL) = 0
gettimeofday({1394558575, 386566}, NULL) = 0
gettimeofday({1394558575, 387704}, NULL) = 0
gettimeofday({1394558575, 463410}, NULL) = 0
mmap(0x24cfd260f000, 2097152, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x24cfd260f000
munmap(0x24cfd260f000, 987136) = 0
munmap(0x24cfd2800000, 61440) = 0
mmap(0x24cfd2700000, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x24cfd2700000
gettimeofday 的电话很多,其他的不多!什么可能导致 node.js 像这样卡住?
更新:我从旧版本的 node.js 升级到了 10.29(我认为),然后就消失了。我刚刚升级到 10.33,问题又回来了。这次我在调试它方面取得了更多进展。第一条:
$ sudo strace -p 11812 -c
Process 11812 attached - interrupt to quit
Process 11812 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.000192 0 2780 gettimeofday
0.00 0.000000 0 1390 getrusage
0.00 0.000000 0 31 futex
0.00 0.000000 0 1390 clock_gettime
------ ----------- ----------- --------- --------- ----------------
100.00 0.000192 5591 total
从Node.js: How to attach to a running process and to debug the server with a console? 我发现了如何将节点调试器附加到正在运行的进程,这就是我得到的地方:
$ sudo kill -s SIGUSR1 11812
$ sudo node debug -p 11812
connecting... ok
break in timers.js:79
77 }
78
79 function listOnTimeout() {
80 var msecs = this.msecs;
81 var list = this;
debug> bt
#0 timers.js:79:23
所以看起来它肯定与计时器有关,但我不知道如何进一步向上移动堆栈以找出问题在我的代码中被触发的位置。
【问题讨论】:
-
您使用的是“时间条件变量”吗? npmjs.org/package/condvar
-
@AlexejMagura 这是一个大型应用程序,所以我不能全部发布,而且我不知道从哪里开始将其追溯到负责的代码。
-
我建议使用分析器找出泄漏的来源,而不是通过更改代码手动搜索。看看:github.com/felixge/node-memory-leak-tutorial 和/或 github.com/node-inspector/node-inspector
-
您是否尝试过通过 JSHint 或其他分析工具运行所有代码?我怀疑这是一个永无止境的循环。
-
从目前为止所说的一般情况下考虑这一点:您是否有任何类型的队列正在备份“重试”类型的事件,这些事件最终会得到处理?您说它最终恢复正常的事实让我认为某处(在您的代码或库中)可能是每个都在争夺/检查某种资源的可用性的事物的瓶颈。在正常操作中可能微不足道,但如果出现后备情况......在使用量等方面出现问题时是否有任何线索?
标签: javascript linux node.js strace