【问题标题】:Python app in Docker container doesn't stop/remove Docker container when app fails当应用程序失败时,Docker 容器中的 Python 应用程序不会停止/删除 Docker 容器
【发布时间】:2021-09-15 18:12:48
【问题描述】:

我有一个 Python 应用程序,它轮询队列以获取新数据,并将其插入 TimescaleDB 数据库(TimescaleDB 是 PostgreSQL 的扩展)。 此应用程序必须始终保持运行。

问题是,Python 程序可能时不时会失败,我希望 Docker Swarm 会重启容器。但是,即使发生故障,容器也会继续运行。 为什么我的容器没有出现故障,然后被 Docker Swarm 重新启动?

Python 应用程序如下所示:

def main():
    try:
        conn = get_db_conn()
        insert_data(conn)
    except Exception:
        logger.exception("Error with main inserter.py function")
        send_email_if_error()
        raise
    finally:
        try:
            conn.close()
            del conn
        except Exception:
            pass

        return 0


if __name__ == "__main__":
    main()

Dockerfile 如下所示:

FROM python:3.8-slim-buster

# Configure apt and install packages
RUN apt-get update && \
    apt-get -y --no-install-recommends install cron nano procps

# Install Python requirements.
RUN pip3 install --upgrade pip && \
    pip3 install poetry==1.0.10

COPY poetry.lock pyproject.toml /
RUN poetry config virtualenvs.create false && \
  poetry install --no-interaction --no-ansi

# Copy everything to the / folder inside the container
COPY . /

# Make /var/log the default directory in the container
WORKDIR /var/log

# Start Python app on container startup
CMD ["python3", "/inserter/inserter.py"]

Docker 编写文件:

version: '3.7'
services:
  inserter13:
    # Name and tag of image the Dockerfile creates
    image: mccarthysean/ijack:timescale
    depends_on: 
      - timescale13
    env_file: .env
    environment: 
      POSTGRES_HOST: timescale13
    networks:
      - traefik-public
    deploy:
      # Either global (exactly one container per physical node) or
      # replicated (a specified number of containers). The default is replicated
      mode: replicated
      # For stateless applications using "replicated" mode,
      # the total number of replicas to create
      replicas: 2
      restart_policy:
        on-failure # default is 'any'

  timescale13:
    image: timescale/timescaledb:2.3.0-pg13
    volumes: 
      - type: volume
        source: ijack-timescale-db-pg13
        target: /var/lib/postgresql/data # the location in the container where the data are stored
        read_only: false
      # Custom postgresql.conf file will be mounted (see command: as well)
      - type: bind
        source: ./postgresql_custom.conf
        target: /postgresql_custom.conf
        read_only: false
    env_file: .env
    command: ["-c", "config_file=/postgresql_custom.conf"]
    ports:
      - 0.0.0.0:5432:5432
    networks:
      traefik-public:
    deploy:
      # Either global (exactly one container per physical node) or
      # replicated (a specified number of containers). The default is replicated
      mode: replicated
      # For stateless applications using "replicated" mode,
      # the total number of replicas to create
      replicas: 1
      placement:
        constraints:
          # Since this is for the stateful database,
          # only run it on the swarm manager, not on workers
          - "node.role==manager"
      restart_policy:
        condition: on-failure # default is 'any'


# Use a named external volume to persist our data
volumes:
  ijack-timescale-db-pg13:
    external: true

networks:
  # Use the previously created public network "traefik-public", shared with other
  # services that need to be publicly available via this Traefik
  traefik-public:
    external: true

我用于构建“inserter.py”容器映像的“Docker-compose.build.yml”文件:

version: '3.7'
services:
  inserter:
    # Name and tag of image the Dockerfile creates
    image: mccarthysean/ijack:timescale
    build:
      # context: where should docker-compose look for the Dockerfile?
      # i.e. either a path to a directory containing a Dockerfile, or a url to a git repository
      context: .
      dockerfile: Dockerfile.inserter
    environment: 
      POSTGRES_HOST: timescale

我运行的 Bash 脚本,它使用 Docker Swarm 构建、推送和部署数据库和插入器容器:

#!/bin/bash

# Build and tag image locally in one step. 
# No need for docker tag <image> mccarthysean/ijack:<tag>
echo ""
echo "Building the image locally..."
echo "docker-compose -f docker-compose.build.yml build"
docker-compose -f docker-compose.build.yml build

# Push to Docker Hub
# docker login --username=mccarthysean
echo ""
echo "Pushing the image to Docker Hub..."
echo "docker push mccarthysean/ijack:timescale"
docker push mccarthysean/ijack:timescale

# Deploy to the Docker swarm and send login credentials 
# to other nodes in the swarm with "--with-registry-auth"
echo ""
echo "Deploying to the Docker swarm..."
echo "docker stack deploy --with-registry-auth -c docker-compose.prod13.yml timescale13"
docker stack deploy --with-registry-auth -c docker-compose.prod13.yml timescale13

当 Python 插入程序失败(可能是数据库连接问题或其他原因)时,它会向我发送电子邮件警报,然后引发错误并失败。此时,我预计 Docker 容器会失败并使用 Docker Swarm 的restart_policy: on-failure 重新启动。但是,出现错误后,当我输入docker service ls 时,我看到以下0/2 replicas

ID                  NAME                                        MODE                REPLICAS            IMAGE                                         PORTS
u354h0uj4ug6        timescale13_inserter13                      replicated          0/2                 mccarthysean/ijack:timescale
o0rbfx5n2z4h        timescale13_timescale13                     replicated          1/1                 timescale/timescaledb:2.3.0-pg13              *:5432->5432/tcp

当它正常时(大部分时间),它会显示2/2 副本。为什么我的容器没有出现故障然后被 Docker Swarm 重新启动?

【问题讨论】:

  • 你能尝试在你的 Dockerfile 中用 ENTRYPOINT 替换 CMD 吗?
  • @edijon 我制作了一个假的 Python 应用程序来运行,并在 Dockerfile 中使用 ENTRYPOINT 而不是 CMD。这并没有什么不同。实际上,无论我使用ENTRYPOINT 还是CMD,我的假Python 程序都会根据需要不断重新启动...我似乎无法复制我之前的重启失败(即replicas 0/2),即使它在过去的一个月里发生了两次......
  • 我建议 ENTRYPOINT 以确保 docker 监控您进程的真实 PID。另一项检查是测试该脚本是否以非零退出状态结束。另一种方法是在您的容器中启动一个 shell,然后检查您的脚本 PID。

标签: python docker docker-compose docker-swarm


【解决方案1】:

我想通了,并更新了我的问题以提供有关我的try: except: 失败例程的更多详细信息。

这是发生的错误(实际上是两个错误,如您所见):

Here's the error information: 
Traceback (most recent call last):
  File "/inserter/inserter.py", line 357, in execute_sql
    cursor.execute(sql, values)
psycopg2.errors.AdminShutdown: terminating connection due to administrator command SSL connection has been closed unexpectedly


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/inserter/inserter.py", line 911, in main
    insert_alarm_log_rds(
  File "/inserter/inserter.py", line 620, in insert_alarm_log_rds
    rc = execute_sql(
  File "/inserter/inserter.py", line 364, in execute_sql
    conn.rollback()
psycopg2.InterfaceError: connection already closed

如您所见,首先出现了 psycopg2.errors.AdminShutdown 错误,这是在我的第一个 try: except: 例程中引发的。但是,这之后是 second psycopg2.InterfaceError,它实际上发生在我的 finally: 清理代码中,然后是 pass 语句和 return 0,所以我猜之前的错误没有重新引发,并且代码以错误代码 0 结束,而不是刺激重启所需的 non-zero

@edijon 关于需要非零退出代码的评论帮助我解决了这个问题。

我需要在finally: 例程中重新引发错误,如下所示:

def main():
    try:
        conn = get_db_conn()
        insert_data(conn)
    except Exception:
        logger.exception("Error with main inserter.py function")
        send_email_if_error()
        raise
    finally:
        try:
            conn.close()
            del conn
        except Exception:
            # previously the following was just 'pass' 
            # and I changed it to 'raise' to ensure errors
            # cause a non-zero error code for Docker's 'restart_policy'
            raise

        # The following was previously "return 0"
        # which caused the container not to restart...
        # Either comment it out, or change it to return non-zero
        return 1


if __name__ == "__main__":
    main()

【讨论】:

    猜你喜欢
    • 2018-11-07
    • 1970-01-01
    • 2018-09-05
    • 1970-01-01
    • 1970-01-01
    • 2018-03-12
    • 2022-11-16
    • 2018-01-07
    • 1970-01-01
    相关资源
    最近更新 更多