【问题标题】:Airflow KubernetesExecutor scheduler kube watch process diesAirflow KubernetesExecutor 调度程序 kube watch 进程死亡
【发布时间】:2019-11-27 11:34:29
【问题描述】:

在 AWS 上有一个 K8S 集群,正在尝试部署 Airflow Webserver + Scheduler,其中包含 KubernetesExecutor。不幸的是,每次我在 Web 服务器中触发 DAG 时,在 read_timeout 的时间量(在 airflow.cfg 中定义)调度程序都会引发此错误:

[2019-11-27 11:25:26,607] {kubernetes_executor.py:440} ERROR - Error while health checking kube watcher process. Process died for unknown reasons
[2019-11-27 11:25:26,617] {kubernetes_executor.py:344} INFO - Event: and now my watch begins starting at resource_version: 0
[2019-11-27 11:26:26,700] {kubernetes_executor.py:335} ERROR - Unknown error in KubernetesJobWatcher. Failing
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 294, in recv_into
    return self.connection.recv_into(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1840, in recv_into
    self._raise_ssl_error(self._ssl, result)
  File "/usr/local/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1646, in _raise_ssl_error
    raise WantReadError()
OpenSSL.SSL.WantReadError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 360, in _error_catcher
    yield
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 666, in read_chunked
    self._update_chunk_length()
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 598, in _update_chunk_length
    line = self._fp.fp.readline()
  File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 307, in recv_into
    raise timeout('The read operation timed out')
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 333, in run
    self.worker_uuid, self.kube_config)
  File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 357, in _run
    **kwargs):
  File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 144, in stream
    for line in iter_resp_lines(resp):
  File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
    for seg in resp.read_chunked(decode_content=False):
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 694, in read_chunked
    self._original_response.close()
  File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 365, in _error_catcher
    raise ReadTimeoutError(self._pool, None, 'Read timed out.')
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='100.64.0.1', port=443): Read timed out.
Process KubernetesJobWatcher-16:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 294, in recv_into
    return self.connection.recv_into(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1840, in recv_into
    self._raise_ssl_error(self._ssl, result)
  File "/usr/local/lib/python3.7/site-packages/OpenSSL/SSL.py", line 1646, in _raise_ssl_error
    raise WantReadError()
OpenSSL.SSL.WantReadError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 360, in _error_catcher
    yield
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 666, in read_chunked
    self._update_chunk_length()
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 598, in _update_chunk_length
    line = self._fp.fp.readline()
  File "/usr/local/lib/python3.7/socket.py", line 589, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/lib/python3.7/site-packages/urllib3/contrib/pyopenssl.py", line 307, in recv_into
    raise timeout('The read operation timed out')
socket.timeout: The read operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/multiprocessing/process.py", line 297, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 333, in run
    self.worker_uuid, self.kube_config)
  File "/usr/local/lib/python3.7/site-packages/airflow/contrib/executors/kubernetes_executor.py", line 357, in _run
    **kwargs):
  File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 144, in stream
    for line in iter_resp_lines(resp):
  File "/usr/local/lib/python3.7/site-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
    for seg in resp.read_chunked(decode_content=False):
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 694, in read_chunked
    self._original_response.close()
  File "/usr/local/lib/python3.7/contextlib.py", line 130, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/local/lib/python3.7/site-packages/urllib3/response.py", line 365, in _error_catcher
    raise ReadTimeoutError(self._pool, None, 'Read timed out.')
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='100.64.0.1', port=443): Read timed out.
[2019-11-27 11:26:26,898] {kubernetes_executor.py:440} ERROR - Error while health checking kube watcher process. Process died for unknown reasons
[2019-11-27 11:26:26,968] {kubernetes_executor.py:344} INFO - Event: and now my watch begins starting at resource_version: 0

PostgreSQL 是通过 helm 图表安装的。

kubectl 版本

Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.3", GitCommit:"b3cbbae08ec52a7fc73d334838e18d17e8512749", GitTreeState:"clean", BuildDate:"2019-11-14T04:24:29Z", GoVersion:"go1.12.13", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"11", GitVersion:"v1.11.8", GitCommit:"4e209c9383fa00631d124c8adcc011d617339b3c", GitTreeState:"clean", BuildDate:"2019-02-28T18:40:05Z", GoVersion:"go1.10.8", Compiler:"gc", Platform:"linux/amd64"}

100.64.0.1 是一个 kubernetes 服务(集群 ip)。

有什么建议吗?

【问题讨论】:

  • 您是否设置了您的 postgres_default 连接 (github.com/puckel/docker-airflow)?问题是针对所有 DAG 还是针对特定一个?
  • @Nick 嗨,尼克!与 postgres 的连接建立成功,我在 dockerfile 中定义了默认的 postgres 连接参数;
  • 是的,这个问题涉及所有 DAG
  • 您可以将其发布为答案,以便其他用户感激并支持您的问题和答案
  • 我对气流v1.10.0 有同样的问题。你对 postgres PVC 有什么问题?

标签: python docker kubernetes airflow airflow-scheduler


【解决方案1】:

根据我写的一个问题的评论,这个问题不会影响 pod 的运行。 但是,它exists

【讨论】:

    猜你喜欢
    • 2018-02-22
    • 2011-12-09
    • 1970-01-01
    • 1970-01-01
    • 2013-12-12
    • 2015-09-02
    • 2012-12-11
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多