【问题标题】:Kubernetes DNS TroubleshootingKubernetes DNS 故障排除
【发布时间】:2020-11-09 00:38:06
【问题描述】:

我正在尝试解决我们的 K8 集群 v1.19 中的 DNS 问题。有 3 个节点(1 个控制器,2 个工作器)都运行 vanilla Ubuntu 20.04,使用带有 Metallb 的 Calico 网络进行入站负载平衡。这一切都是在本地托管的,并且可以完全访问互联网。在它前面还有一个代理服务器 (Traefik),负责处理通往 K8 集群和基础设施中其他服务的 SSL。

当我升级已经/正在连接到 redis pod 的 helm chart 时发生了这个问题,但过去 36 天一直很高兴运行。

在其中一个 pod 的日志中,它显示一个错误,它无法确定 redis pod(s) 在哪里:

2020-11-09 00:00:00 [1] [verbose]:      [Cache] Attempting connection to redis.
2020-11-09 00:00:00 [1] [verbose]:      [Cache] Successfully connected to redis.
2020-11-09 00:00:00 [1] [verbose]:      [PubSub] Attempting connection to redis.
2020-11-09 00:00:00 [1] [verbose]:      [PubSub] Successfully connected to redis.
2020-11-09 00:00:00 [1] [warn]:         Secret key is weak. Please consider lengthening it for better security.
2020-11-09 00:00:00 [1] [verbose]:      [Database] Connecting to database...
2020-11-09 00:00:00 [1] [info]:         [Database] Successfully connected .
2020-11-09 00:00:00 [1] [verbose]:      [Database] Ran 0 migration(s).
2020-11-09 00:00:00 [1] [verbose]:      Sending request for public key.
Error: getaddrinfo EAI_AGAIN oct-2020-redis-master
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:67:26) {
  errno: -3001,
  code: 'EAI_AGAIN',
  syscall: 'getaddrinfo',
  hostname: 'oct-2020-redis-master'
}
[ioredis] Unhandled error event: Error: getaddrinfo EAI_AGAIN oct-2020-redis-master
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:67:26)
Error: connect ETIMEDOUT
    at Socket.<anonymous> (/app/node_modules/ioredis/built/redis/index.js:307:37)
    at Object.onceWrapper (events.js:421:28)
    at Socket.emit (events.js:315:20)
    at Socket.EventEmitter.emit (domain.js:486:12)
    at Socket._onTimeout (net.js:483:8)
    at listOnTimeout (internal/timers.js:554:17)
    at processTimers (internal/timers.js:497:7) {
  errorno: 'ETIMEDOUT',
  code: 'ETIMEDOUT',
  syscall: 'connect'
}

我已经完成了https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/中列出的步骤

ubuntu@k8-01:~$ kubectl exec -i -t dnsutils -- nslookup kubernetes.default
;; connection timed out; no servers could be reached

command terminated with exit code 1
ubuntu@k8-01:~$ kubectl get pods --namespace=kube-system -l k8s-app=kube-dns
NAME                      READY   STATUS    RESTARTS   AGE
coredns-f9fd979d6-lfm5t   1/1     Running   17         37d
coredns-f9fd979d6-sw2qp   1/1     Running   18         37d
ubuntu@k8-01:~$ kubectl logs --namespace=kube-system -l k8s-app=kube-dns
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 5s
[INFO] plugin/reload: Running configuration MD5 = 3d3f6363f05ccd60e0f885f0eca6c5ff
[INFO] Reloading complete
[INFO] 10.244.210.238:34288 - 28733 "A IN oct-2020-redis-master.default.svc.cluster.local. udp 75 false 512" NOERROR qr,aa,rd 148 0.001300712s
[INFO] 10.244.210.238:44532 - 12032 "A IN oct-2020-redis-master.default.svc.cluster.local. udp 75 false 512" NOERROR qr,aa,rd 148 0.001279312s
[INFO] 10.244.210.235:44595 - 65094 "A IN oct-2020-redis-master.default.svc.cluster.local. udp 75 false 512" NOERROR qr,aa,rd 148 0.000163001s
[INFO] 10.244.210.235:55945 - 20758 "A IN oct-2020-redis-master.default.svc.cluster.local. udp 75 false 512" NOERROR qr,aa,rd 148 0.000141202s
ubuntu@k8-01:~$ kubectl get services --all-namespaces
NAMESPACE     NAME                                               TYPE           CLUSTER-IP       EXTERNAL-IP     PORT(S)                      AGE
default       oct-2020-api                                       ClusterIP      10.107.89.213    <none>          80/TCP                       37d
default       oct-2020-nginx-ingress-controller                  LoadBalancer   10.110.235.175   192.168.2.150   80:30194/TCP,443:31514/TCP   37d
default       oct-2020-nginx-ingress-default-backend             ClusterIP      10.98.147.246    <none>          80/TCP                       37d
default       oct-2020-redis-headless                            ClusterIP      None             <none>          6379/TCP                     37d
default       oct-2020-redis-master                              ClusterIP      10.109.58.236    <none>          6379/TCP                     37d
default       oct-2020-webclient                                 ClusterIP      10.111.204.251   <none>          80/TCP                       37d
default       kubernetes                                         ClusterIP      10.96.0.1        <none>          443/TCP                      37d
kube-system   coredns                                            NodePort       10.101.104.114   <none>          53:31245/UDP                 15h
kube-system   kube-dns                                           ClusterIP      10.96.0.10       <none>          53/UDP,53/TCP,9153/TCP       37d

当我进入吊舱时:

/app # grep "nameserver" /etc/resolv.conf
nameserver 10.96.0.10
/app # nslookup
BusyBox v1.31.1 () multi-call binary.

Usage: nslookup [-type=QUERY_TYPE] [-debug] HOST [DNS_SERVER]

Query DNS about HOST

QUERY_TYPE: soa,ns,a,aaaa,cname,mx,txt,ptr,any
/app # ping 10.96.0.10
PING 10.96.0.10 (10.96.0.10): 56 data bytes
^C
--- 10.96.0.10 ping statistics ---
3 packets transmitted, 0 packets received, 100% packet loss
/app # nslookup oct-20-redis-master
;; connection timed out; no servers could be reached

任何有关故障排除的想法将不胜感激。

【问题讨论】:

标签: kubernetes redis coredns


【解决方案1】:

为了回答我自己的问题,我删除了 DNS pod,然后它再次工作。命令如下:

kubectl delete pod coredns-f9fd979d6-sw2qp --namespace=kube-system

这并没有解决为什么会发生这种情况的根本问题,或者为什么 K8 没有检测到这些 pod 有问题并重新创建它们。我将继续深入研究这一点,并在 DNS pod 上进行更多检测,以查看导致此问题的实际原因。

如果有人对仪器连接或具体查看有任何想法,将不胜感激。

【讨论】:

  • 在这个问题上有什么发现吗?我也遇到了类似的问题。
  • 很遗憾我没有,我们所做的是从 vanilla Kubernetes 切换到使用 Microk8s,我们现在有多个集群,正常运行时间约为 6 个月,并且 DNS 没有问题。
【解决方案2】:

这就是我们测试 dns 的方式

在下面创建部署

apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
  labels:
    app: nginx
spec:
  serviceName: "nginx"
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        ports:
        - containerPort: 80
          name: web
        volumeMounts:
        - name: www
          mountPath: /usr/share/nginx/html
      volumes:
      - name: www
        emptyDir:

运行以下测试

master $ kubectl get po
NAME      READY     STATUS    RESTARTS   AGE
web-0     1/1       Running   0          1m
web-1     1/1       Running   0          1m

master $ kubectl get svc
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   35m
nginx        ClusterIP   None         <none>        80/TCP    2m

master $ kubectl run -i --tty --image busybox:1.28 dns-test --restart=Never --rm
If you don't see a command prompt, try pressing enter.
/ # nslookup nginx
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      nginx
Address 1: 10.40.0.1 web-0.nginx.default.svc.cluster.local
Address 2: 10.40.0.2 web-1.nginx.default.svc.cluster.local
/ #


/ # nslookup web-0.nginx
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      web-0.nginx
Address 1: 10.40.0.1 web-0.nginx.default.svc.cluster.local


/ # nslookup web-0.nginx.default.svc.cluster.local
Server:    10.96.0.10
Address 1: 10.96.0.10 kube-dns.kube-system.svc.cluster.local

Name:      web-0.nginx.default.svc.cluster.local
Address 1: 10.40.0.1 web-0.nginx.default.svc.cluster.local

【讨论】:

  • 看起来是 DNS 服务:kubectl run -i --tty --image busybox:1.28 dns-test --restart=Never --rm If you don't see a command prompt, try pressing enter. / # nslookup nginx Server: 10.96.0.10 Address 1: 10.96.0.10 nslookup: can't resolve 'nginx' / # nslookup web-0.nginx Server: 10.96.0.10 Address 1: 10.96.0.10 nslookup: can't resolve 'web-0.nginx' / # nslookup web-0.nginx.default.svc.cluster.local Server: 10.96.0.10 Address 1: 10.96.0.10 nslookup: can't resolve 'web-0.nginx.default.svc.cluster.local' 关于下一步该做什么的任何想法?
猜你喜欢
  • 1970-01-01
  • 2021-11-14
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2013-07-15
  • 2010-11-30
  • 2021-09-11
  • 2012-02-18
相关资源
最近更新 更多