【发布时间】:2019-12-19 19:24:46
【问题描述】:
我们有两个 TypeScript 应用程序,都是通过 CRA 创建的,还有一个 CI 管道,它运行一系列 npm 命令来运行测试/lint 并为后续阶段构建应用程序:
time npm install --no-optional --unsafe-perm
npm test -- --coverage
npm run tsc
npm run lint
export REACT_APP_VERSION=$VERSION
export REACT_APP_COMMIT=$GIT_COMMIT
npm run build
npm run build-storybook
我们的 CI 管道在 Jenkins 中运行,我们使用 kubernetes 插件来按需获取执行器。
该脚本通过Jenkinsfile 中的以下逻辑为app1 和app2 并行运行:
stage('Frontend - App1') {
agent {
kubernetes {
label 'node'
defaultContainer 'jnlp'
yamlFile 'infrastructure/scripts/ci/pod-templates/node.yaml'
idleMinutes 30
}
}
environment {
CI = 'true'
NPMRC_SECRET_FILE_PATH = credentials('verdaccio-npmrc')
}
steps {
dir('frontend/app1') {
container('node') {
sh 'cp $NPMRC_SECRET_FILE_PATH ~/.npmrc'
sh 'chmod u+rw ~/.npmrc'
sh '../../infrastructure/scripts/ci/build-frontend.sh'
}
publishHTML(target: [
allowMissing : false,
alwaysLinkToLastBuild: false,
keepAll : true,
reportDir : 'coverage',
reportFiles : 'index.html',
reportName : "Coverage Report (app1)"
])
junit 'testing/junit.xml'
stash includes: 'build/**/*', name: 'app1-build'
stash includes: 'storybook-static/**/*', name: 'app1-storybook-build'
}
}
}
那么,看看我们所看到的。昨天我们反复看到相同的症状:app1 的前端阶段将完成(两者中较小的一个),而app2 将神秘地停止在运行测试的中间(登录 Jenkins 的最后一行总是@987654329 @,但并不总是相同的测试)。在被我们的管道超时(或无聊的开发人员)杀死之前,它将保持这种状态整整一个小时。
我们运行kubectl exec -it node-blah sh 以进入运行卡住阶段的吊舱并进行一些诊断。运行ps aux | cat 给了我们这个:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
node 1 0.0 0.0 4396 720 ? Ss+ 08:51 0:00 cat
node 17 0.0 0.0 0 0 ? Z 08:51 0:00 [sh] <defunct>
node 32 0.0 0.0 0 0 ? Z 08:51 0:00 [sh] <defunct>
node 47 0.0 0.0 0 0 ? Z 08:51 0:00 [sh] <defunct>
node 664 0.0 0.0 0 0 ? Z 09:04 0:00 [sh] <defunct>
.
.
.
node 6760 0.0 0.0 4340 108 ? S 10:36 0:00 sh -c (pid=$$; { while [ \( -d /proc/$pid -o \! -d /proc/$$ \) -a -d '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8' -a \! -f '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-log.txt'; sleep 3; done } & jsc=durable-508a7912908a6919b577783c49df638d; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/script.sh' > '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp'; mv '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp' '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt'; wait) >&- 2>&- &
node 6761 0.0 0.0 4340 1060 ? S 10:36 0:00 sh -c (pid=$$; { while [ \( -d /proc/$pid -o \! -d /proc/$$ \) -a -d '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8' -a \! -f '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt' ]; do touch '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-log.txt'; sleep 3; done } & jsc=durable-508a7912908a6919b577783c49df638d; JENKINS_SERVER_COOKIE=$jsc 'sh' -xe '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/script.sh' > '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-log.txt' 2>&1; echo $? > '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp'; mv '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt.tmp' '/home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/jenkins-result.txt'; wait) >&- 2>&- &
node 6762 0.0 0.0 4340 812 ? S 10:36 0:00 sh -xe /home/jenkins/workspace/app_master/frontend/app2@tmp/durable-f617acc8/script.sh
node 6764 0.0 0.0 20096 2900 ? S 10:36 0:00 /bin/bash ../../infrastructure/scripts/ci/build-frontend.sh
node 6804 0.0 0.5 984620 38552 ? Sl 10:37 0:00 npm
node 6816 0.0 0.0 4356 836 ? S 10:37 0:00 sh -c react-app-rewired test --reporters default --reporters jest-junit "--coverage"
node 6817 0.0 0.4 877704 30220 ? Sl 10:37 0:00 node /home/jenkins/workspace/app_master/frontend/app2/node_modules/.bin/react-app-rewired test --reporters default --reporters jest-junit --coverage
node 6823 0.4 1.3 1006148 97108 ? Sl 10:37 0:06 node /home/jenkins/workspace/app_master/frontend/app2/node_modules/react-app-rewired/scripts/test.js --reporters default --reporters jest-junit --coverage
node 6881 2.8 2.6 1065992 194076 ? Sl 10:37 0:41 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6886 2.8 2.6 1067004 195748 ? Sl 10:37 0:40 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6898 2.9 2.5 1058872 187360 ? Sl 10:37 0:43 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6905 2.8 2.4 1054256 183492 ? Sl 10:37 0:42 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6910 2.8 2.6 1067812 196344 ? Sl 10:37 0:41 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6911 2.7 2.6 1063680 191088 ? Sl 10:37 0:40 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 6950 0.8 1.9 1018536 145396 ? Sl 10:38 0:11 /usr/local/bin/node /home/jenkins/workspace/app_master/frontend/app2/node_modules/jest-worker/build/child.js
node 7833 0.0 0.0 4340 804 ? Ss 10:59 0:00 sh
node 7918 0.0 0.0 4240 652 ? S 11:01 0:00 sleep 3
node 7919 0.0 0.0 17508 2048 ? R+ 11:01 0:00 ps aux
node 7920 0.0 0.0 4396 716 ? S+ 11:01 0:00 cat
来自ps上的手册:
S interruptible sleep (waiting for an event to complete)
l is multi-threaded (using CLONE_THREAD, like NPTL pthreads do)
所以我认为这表明测试已经开始正常运行,产生子进程以并行运行它们,然后在 40 秒左右后出于任何原因,这些进程都进入睡眠状态并且不再做任何事情.
我们很困惑如何进一步调查这个问题,特别是因为我们很难将我们喜欢的任何东西安装到 pod 中以进行进一步调查(没有简单的 root 访问权限)......但是任何建议的理论/ 下一步将受到欢迎!
** 编辑 **
似乎idleMinutes 不是罪魁祸首,因为今天我们已经多次看到该问题在恢复后再次发生。我已经能够验证该脚本是否在 kubernetes 中的一个全新节点中运行,该节点以前没有被任何其他构建使用过。所以现在我不知道最近发生了什么变化以使这种情况开始发生:(
【问题讨论】:
标签: node.js jenkins npm jestjs create-react-app