日志翻转上的主管异常导致应用服务器冻结？答案

【问题标题】：Supervisor exception on log rollover causes app server to freeze?日志翻转上的主管异常导致应用服务器冻结？
【发布时间】：2014-04-02 20:41:27
【问题描述】：

我正在 EC2 服务器上运行带有 gunicorn 的烧瓶应用程序。我使用 supervisord 来监控和重启应用服务器。昨天，服务器没有响应 http 请求。我们使用 supervisorctl 查看状态，它显示为正在运行。我们查看了主管日志，看到了以下错误：

CRIT uncaptured python exception, closing channel <POutputDispatcher at 34738328
for <Subprocess at 34314576 with name flask in state RUNNING> (stdout)>
(<type 'exceptions.OSError'>:[Errno 2] No such file or directory

[/usr/local/lib/python2.7/dist-packages/supervisor/supervisord.py|runforever|233] 
[/usr/local/lib/python2.7/dist-packages/supervisor/dispatchers.py|handle_read_event|231] 
[/usr/local/lib/python2.7/dist-packages/supervisor/dispatchers.py|record_output|165] 
[/usr/local/lib/python2.7/dist-packages/supervisor/dispatchers.py|_log|141]
[/usr/local/lib/python2.7/dist-packages/supervisor/loggers.py|info|273] 
[/usr/local/lib/python2.7/dist-packages/supervisor/loggers.py|log|291] 
[/usr/local/lib/python2.7/dist-packages/supervisor/loggers.py|emit|186]
[/usr/local/lib/python2.7/dist-packages/supervisor/loggers.py|doRollover|220])

重新启动 supervisord 为我们解决了这个问题。以下是我们的主管配置的相关部分：

[supervisord]
childlogdir = /var/log/supervisord/
logfile = /var/log/supervisord/supervisord.log
logfile_maxbytes = 50MB
logfile_backups = 10
loglevel = info
pidfile = /var/log/supervisord/supervisord.pid
umask = 022
nodaemon = false
nocleanup = false

[program:flask]
directory=%(here)s
environment=PATH="/home/ubuntu/.virtualenvs/flask/bin"
command=newrelic-admin run-program gunicorn app:app -c gunicorn_conf.py
autostart=true
autorestart=true
redirect_stderr=true

奇怪的是，我们有 2 台服务器在 ELB 后面运行，并且它们都在 10 分钟后出现了相同的问题。我猜测两者的日志大约在同一时间达到了限制（这是可能的，因为它们都看到大约相同的流量）并且翻转失败。关于为什么会发生这种情况的任何想法？

【问题讨论】：

我根据您的评论更新了我的答案。

标签： python logging flask gunicorn supervisord

【解决方案1】：

AFAIK supervisor 使用自己的日志记录实现，而不是 Python 标准库中的实现 - 尽管类和方法名称非常相似。

在翻转期间删除文件时可能存在竞争条件 - 您需要检查特定 supervisor 版本的源代码，并将其与最新的 supervisor 版本进行比较（如果不同）。这是我系统上supervisor 代码的摘录（在doRollover() 方法中）：

try:
    os.remove(dfn)
except OSError, why:
    # catch race condition (already deleted)
    if why[0] != errno.ENOENT:
        raise

如果您的翻转代码不这样做，您可能需要升级您的 supervisor 版本。

更新：如果重命名时发生错误，则可能是尚未捕获的竞争条件。考虑在the supervisor mailing list 上提问。

【讨论】：

它确实做到了这一点。导致异常的行是 os.rename(sfn, dfn)。看起来 sfn 丢失了，代码中没有检查它是否存在。

【解决方案2】：

在主管部分（例如：[program:flask]）中，您需要设置：

stdout_logfile_maxbytes=0
stderr_logfile_maxbytes=0

【讨论】：