【问题标题】:Docker container random segmentation faultDocker 容器随机分段错误
【发布时间】:2020-07-07 13:43:37
【问题描述】:

我正在尝试在 Docker 容器上运行应用程序,但程序随机生成分段错误。有时代码会按预期运行。其他时候,当我中断它的执行(Ctrl + C)并再次运行它时,它会出现段错误。

下面是我的 Dockerfile 和 gdb 的输出。我可以看到问题归结为 cv2.VideoCapture,但我已经尝试了一些修复(如locales)但没有奏效。在主机上(即容器外),代码运行良好。任何帮助将不胜感激。

Dockerfile:

FROM nvidia/cuda:10.2-devel

ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update && apt-get install -y \
        python3-opencv locales gdb python3-dbg ca-certificates python3-dev git wget sudo unzip vim \
        libx11-dev libxfixes-dev libxi-dev \
        libxcb1-dev libx11-xcb-dev libxcb-glx0-dev \
        libdbus-1-dev libxkbcommon-dev libxkbcommon-x11-dev \
        zlib1g-dev libgl1-mesa-dev libfontconfig1-dev \
        cmake ninja-build protobuf-compiler libprotobuf-dev build-essential wget libssl1.0-dev > /dev/null && \
  rm -rf /var/lib/apt/lists/*
RUN ln -sv /usr/bin/python3 /usr/bin/python

RUN locale-gen en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US:en
ENV LC_ALL en_US.UTF-8

# create a non-root user
ARG USER_ID=1000
RUN useradd -m --no-log-init --system  --uid ${USER_ID} appuser -g sudo
RUN echo '%sudo ALL=(ALL) NOPASSWD:ALL' >> /etc/sudoers
USER appuser
WORKDIR /home/appuser

ENV PATH="/home/appuser/.local/bin:${PATH}"
RUN wget https://bootstrap.pypa.io/get-pip.py && \
        python3 get-pip.py --user && \
        rm get-pip.py

# install dependencies
# See https://pytorch.org/ for other options if you use a different version of CUDA
RUN pip install --user tensorboard cython
RUN pip install --user torch torchvision
RUN pip install --user 'git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI'
RUN pip install --user pandas scipy
RUN pip install --user flask flask-cors requests azure-iot-device paho-mqtt
RUN pip install --user PyQt5 imutils QDarkStyle

RUN pip install --user 'git+https://github.com/facebookresearch/fvcore'
# install detectron2
git clone https://github.com/facebookresearch/detectron2 detectron2_repo
# the following line are used to checkout a specific version
WORKDIR /home/appuser/detectron2_repo
RUN git checkout 3bdf3ab4a4626985b3581da0a5b9e8c534b56980
WORKDIR /home/appuser/

# set FORCE_CUDA because during `docker build` cuda is not accessible
ENV FORCE_CUDA="1"
# This will by default build detectron2 for all common cuda architectures and take a lot more time,
# because inside `docker build`, there is no way to tell which architecture will be used.
ARG TORCH_CUDA_ARCH_LIST="Kepler;Kepler+Tesla;Maxwell;Maxwell+Tegra;Pascal;Volta;Turing"
ENV TORCH_CUDA_ARCH_LIST="${TORCH_CUDA_ARCH_LIST}"

RUN pip install --user -e detectron2_repo


#-----------------------------------------------------
VOLUME /jvision

# Copy changed files accordingly
COPY detectron_jvision/core/detectron2 /home/appuser/detectron2_repo/detectron2
COPY demo/ /home/appuser/detectron2_repo/demo
RUN mkdir -p /home/appuser/detectron2_repo/imgs
COPY detectron_jvision/configs/ /home/appuser/detectron2_repo/configs/
COPY detectron_jvision/tools/ /home/appuser/detectron2_repo/tools/
#COPY testeqt.py /home/appuser/detectron2_repo

#RUN sudo cp -R /home/appuser/detectron2_repo/tools /home/appuser/detectron2_repo/detectron2/
#RUN sudo cp -R /home/appuser/detectron2_repo/configs /home/appuser/detectron2_repo/detectron2/

COPY testcases/ jvision/testcases
COPY run.sh /home/appuser/detectron2_repo/demo
COPY run_mask.sh /home/appuser/detectron2_repo
COPY run_mask2.sh /home/appuser/detectron2_repo

RUN /bin/bash -c 'sudo chmod +x /home/appuser/detectron2_repo/demo/run.sh'
RUN /bin/bash -c 'sudo chmod +x /home/appuser/detectron2_repo/run_mask.sh'
RUN /bin/bash -c 'sudo chmod +x /home/appuser/detectron2_repo/run_mask2.sh'
RUN /bin/bash -c 'sudo chmod 777 /home/appuser/detectron2_repo/demo/imgs'
RUN /bin/bash -c 'sudo chmod 777 /home/appuser/detectron2_repo/imgs'
#-----------------------------------------------------

# Update numpy
RUN pip install --user -U numpy

# Set a fixed model cache directory.
ENV FVCORE_CACHE="/tmp"
WORKDIR /home/appuser/detectron2_repo/demo
RUN wget # here I download some weights for my model
RUN sudo unzip -o weights_models.zip
RUN rm weights_models.zip
CMD ["/home/appuser/detectron2_repo/demo/run.sh"]

我运行它为:docker run -it -p 5000:5000 --gpus all --name vision --memory='16g' jvision:latest /bin/bash

然后我尝试在容器内运行我的代码,这是随机段错误开始的时候。

这是 gdb 输出:

#0  __strcmp_ssse3 () at ../sysdeps/x86_64/multiarch/../strcmp.S:948
#1  0x00007f9d25b59bd7 in ?? () from /usr/lib/x86_64-linux-gnu/libltdl.so.7
#2  0x00007f9d25b5b984 in ?? () from /usr/lib/x86_64-linux-gnu/libltdl.so.7
#3  0x00007f9d25b5c1fa in lt_dlopenadvise () from /usr/lib/x86_64-linux-gnu/libltdl.so.7
#4  0x00007f9d25b5c2c0 in lt_dlopenext () from /usr/lib/x86_64-linux-gnu/libltdl.so.7
#5  0x00007f9d33dd5aa7 in gp_abilities_list_load_dir () from /usr/lib/x86_64-linux-gnu/libgphoto2.so.6
#6  0x00007f9d33dd5e19 in gp_abilities_list_load () from /usr/lib/x86_64-linux-gnu/libgphoto2.so.6
#7  0x00007f9d33dd8334 in gp_camera_autodetect () from /usr/lib/x86_64-linux-gnu/libgphoto2.so.6
#8  0x00007f9d39eecdd5 in ?? () from /usr/lib/x86_64-linux-gnu/libopencv_videoio.so.3.2
#9  0x00007f9d39ef2d65 in ?? () from /usr/lib/x86_64-linux-gnu/libopencv_videoio.so.3.2
#10 0x00007f9d39ef2e98 in ?? () from /usr/lib/x86_64-linux-gnu/libopencv_videoio.so.3.2
#11 0x00007f9d39ed43ea in cv::VideoCapture::open(cv::String const&, int) () from /usr/lib/x86_64-linux-gnu/libopencv_videoio.so.3.2
#12 0x00007f9d39ed45fe in cv::VideoCapture::VideoCapture(cv::String const&) () from /usr/lib/x86_64-linux-gnu/libopencv_videoio.so.3.2
#13 0x00007f9d3df071f0 in ?? () from /usr/lib/python3/dist-packages/cv2.cpython-36m-x86_64-linux-gnu.so
#14 0x000000000050a635 in _PyCFunction_FastCallDict (kwargs=<optimized out>, nargs=<optimized out>, args=<optimized out>, func_obj=<built-in function VideoCapture>) at ../Objects/methodobject.c:231
#15 _PyCFunction_FastCallKeywords (kwnames=<optimized out>, nargs=<optimized out>, stack=<optimized out>, func=<optimized out>) at ../Objects/methodobject.c:294
#16 call_function.lto_priv () at ../Python/ceval.c:4851
#17 0x000000000050bfb4 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3335
#18 0x0000000000507d64 in PyEval_EvalFrameEx (throwflag=0, f=Frame 0x7f9c72e281f8, for file /home/appuser/detectron2_repo/demo/multicam_new.py, line 64, in load_network_stream_thread ())
    at ../Python/ceval.c:754
#19 _PyEval_EvalCodeWithName.lto_priv.1820 () at ../Python/ceval.c:4166
#20 0x0000000000588d41 in PyEval_EvalCodeEx (closure=<optimized out>, kwdefs=<optimized out>, defcount=0, defs=0x0, kwcount=0, kws=0x7f9d3f81d060, argcount=<optimized out>, args=0x7f9d3f81d060, 
    locals=0x0, globals=<optimized out>, _co=<optimized out>) at ../Python/ceval.c:4187
#21 function_call.lto_priv () at ../Objects/funcobject.c:604
#22 0x000000000059fc4e in PyObject_Call () at ../Objects/abstract.c:2261
#23 0x000000000050d356 in do_call_core (kwdict={}, callargs=(), func=<function at remote 0x7f9c61cddb70>) at ../Python/ceval.c:5120
#24 _PyEval_EvalFrameDefault () at ../Python/ceval.c:3404
#25 0x0000000000509758 in PyEval_EvalFrameEx (throwflag=0, 
    f=Frame 0x7f9c72e2b238, for file /usr/lib/python3.6/threading.py, line 864, in run (self=<Thread(_target=<function at remote 0x7f9c61cddb70>, _name='Thread-1', _args=(), _kwargs={}, _daemonic=True, _ident=140309919069952, _tstate_lock=<_thread.lock at remote 0x7f9c72eada58>, _started=<Event(_cond=<Condition(_lock=<_thread.lock at remote 0x7f9c79db0418>, acquire=<built-in method acquire of _thread.lock object at remote 0x7f9c79db0418>, release=<built-in method release of _thread.lock object at remote 0x7f9c79db0418>, _waiters=<collections.deque at remote 0x7f9c7303fa08>) at remote 0x7f9cefecf828>, _flag=True) at remote 0x7f9d3e333fd0>, _is_stopped=False, _initialized=True, _stderr=<_io.TextIOWrapper at remote 0x7f9d3f804708>) at remote 0x7f9c731140b8>)) at ../Python/ceval.c:754
#26 _PyFunction_FastCall (globals=<optimized out>, nargs=140309919085112, args=<optimized out>, co=<optimized out>) at ../Python/ceval.c:4933
#27 fast_function.lto_priv () at ../Python/ceval.c:4968
#28 0x000000000050a48d in call_function.lto_priv () at ../Python/ceval.c:4872
#29 0x000000000050bfb4 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3335
#30 0x0000000000509758 in PyEval_EvalFrameEx (throwflag=0, 
    f=Frame 0x7f9c3c000b38, for file /usr/lib/python3.6/threading.py, line 916, in _bootstrap_inner (self=<Thread(_target=<function at remote 0x7f9c61cddb70>, _name='Thread-1', _args=(), _kwargs={}, _daemonic=True, _ident=140309919069952, _tstate_lock=<_thread.lock at remote 0x7f9c72eada58>, _started=<Event(_cond=<Condition(_lock=<_thread.lock at remote 0x7f9c79db0418>, acquire=<built-in method acquire of _thread.lock object at remote 0x7f9c79db0418>, release=<built-in method release of _thread.lock object at remote 0x7f9c79db0418>, _waiters=<collections.deque at remote 0x7f9c7303fa08>) at remote 0x7f9cefecf828>, _flag=True) at remote 0x7f9d3e333fd0>, _is_stopped=False, _initialized=True, _stderr=<_io.TextIOWrapper at remote 0x7f9d3f804708>) at remote 0x7f9c731140b8>)) at ../Python/ceval.c:754
#31 _PyFunction_FastCall (globals=<optimized out>, nargs=140308998261560, args=<optimized out>, co=<optimized out>) at ../Python/ceval.c:4933
#32 fast_function.lto_priv () at ../Python/ceval.c:4968
#33 0x000000000050a48d in call_function.lto_priv () at ../Python/ceval.c:4872
#34 0x000000000050bfb4 in _PyEval_EvalFrameDefault () at ../Python/ceval.c:3335
#35 0x0000000000508e55 in PyEval_EvalFrameEx (throwflag=0, 
    f=Frame 0x7f9c72e4d7e8, for file /usr/lib/python3.6/threading.py, line 884, in _bootstrap (self=<Thread(_target=<function at remote 0x7f9c61cddb70>, _name='Thread-1', _args=(), _kwargs={}, _daemonic=True, _ident=140309919069952, _tstate_lock=<_thread.lock at remote 0x7f9c72eada58>, _started=<Event(_cond=<Condition(_lock=<_thread.lock at remote 0x7f9c79db0418>, acquire=<built-in method acquire of _thread.lock object at remote 0x7f9c79db0418>, release=<built-in method release of _thread.lock object at remote 0x7f9c79db0418>, _waiters=<collections.deque at remote 0x7f9c7303fa08>) at remote 0x7f9cefecf828>, _flag=True) at remote 0x7f9d3e333fd0>, _is_stopped=False, _initialized=True, _stderr=<_io.TextIOWrapper at remote 0x7f9d3f804708>) at remote 0x7f9c731140b8>)) at ../Python/ceval.c:754
#36 _PyFunction_FastCall (globals=<optimized out>, nargs=<optimized out>, args=<optimized out>, co=<optimized out>) at ../Python/ceval.c:4933
#37 _PyFunction_FastCallDict () at ../Python/ceval.c:5035
#38 0x0000000000594931 in _PyObject_FastCallDict (kwargs=0x0, nargs=1, args=0x7f9c72e26df0, func=<function at remote 0x7f9d3e24e510>) at ../Objects/abstract.c:2310
#39 _PyObject_Call_Prepend (kwargs=0x0, args=<optimized out>, obj=<optimized out>, func=<function at remote 0x7f9d3e24e510>) at ../Objects/abstract.c:2373
#40 method_call.lto_priv () at ../Objects/classobject.c:314
#41 0x000000000059fc4e in PyObject_Call () at ../Objects/abstract.c:2261
#42 0x00000000005e11c2 in t_bootstrap () at ../Modules/_threadmodule.c:1000
#43 0x00000000006319a4 in pythread_wrapper (arg=<optimized out>) at ../Python/thread_pthread.h:205
#44 0x00007f9d3f04f6db in start_thread (arg=0x7f9c72e27700) at pthread_create.c:463
#45 0x00007f9d3f38888f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

multicam_new.py (#18) 的第 64 行是一个 cv2.VideoCapture 命令。我使用线程从多个摄像头流中读取数据并相应地处理它们的帧。

【问题讨论】:

  • 看起来 OpenCV 正在尝试使用 libgphoto2 作为其相机后端,并且与它相关的事情正在中断。要么确保它已正确安装,要么可能破解打开 OpenCV VideoCapture 的代码以明确不使用 Gphoto;请参阅docs.opencv.org/3.4/d4/d15/… 获取列表。
  • 谢谢,感谢您的帮助。我尝试使用 GSTREAMER 和 FFMPEG,但段错误仍然存​​在。然而,解决问题的是 OpenCV 的无头版本 (pip install opencv-python-headless)。我安装了它,一切都按预期工作。

标签: python docker opencv containers


【解决方案1】:

对于可能遇到此问题的任何人,请使用 headless 版本的 opencv pip install opencv-python-headless

这就是最终解决随机分段错误问题的原因。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2022-01-24
    • 1970-01-01
    • 1970-01-01
    • 2013-11-02
    • 2015-10-30
    • 1970-01-01
    • 1970-01-01
    • 2021-12-26
    相关资源
    最近更新 更多