从 GKE 1.15 更新到 1.16 时的 Prometheus (node_exporter) 问题答案

【问题标题】：Prometheus (node_exporter) issue when update from GKE 1.15 to 1.16从 GKE 1.15 更新到 1.16 时的 Prometheus (node_exporter) 问题
【发布时间】：2020-09-04 13:14:52
【问题描述】：

几个月以来，我一直在 Google GKE 中的 Kubernetes 上使用 Prometheus 和 Grafana 应用程序。比如我以前在Grafana上监控container_cpu_usage_seconds_total。

但自从我将 GKE 的节点从 1.15 升级到 1.16 后，我丢失了 container_* 信息。

为了测试它，我创建了一个 1.15 版本的新集群。我从 Google Marketeplace 安装了 Prometheus，并逐步升级了 GKE，直到出现问题。同样，container_* 监控在 1.16 版本中停止。

Here you can see container_cpu_usage_seconds_total and it stopped when I upgrade the node. There are 3 nodes

只有我一个人有这个问题吗？有人找到解决方案了吗？

感谢您的帮助:)

瓦伦丁

【问题讨论】：

你检查过 prometheus/grafana 容器中的日志吗？
在 node_exporter 中，我有这个：2020-09-08T09:35:26.426156249Z time="2020-09-08T09:35:26Z" level=error msg="ERROR: diskstats collector failed after 0.100237s: invalid line for /host/proc/diskstats for sdl" source="collector.go:123" 在普罗米修斯我有这个：level=warn ts=2020-09-08T09:32:12.538Z caller=klog.go:86 component=k8s_client_runtime func=Warningf msg="/app/discovery/kubernetes/kubernetes.go:263: watch of *v1.Endpoints ended with: too old resource version: 183350035 (183351611)"
能否分享一下你使用的GKE集群的具体版本，你使用的GCP markeplace的具体应用是什么？
GKE 集群 : 1.16.13-gke.400 应用在markeplace : Prometheus & Grafana (v2.2) (node_exporter: v0.15.2 ; prometheus: 2.11.0) 谢谢

标签： kubernetes google-kubernetes-engine prometheus-node-exporter

【解决方案1】：

我发现出了什么问题。使用 docker 或 kubernetes，node-exporter 不会发送 pod 指标 (container_*)。必须安装 Cadvisor（在 Google Marketeplace 中，Cadvisor 安装在 node-exporter 映像中）从 Kubernetes 1.16 开始，Cadvisor 的配置是错误的。您应该编辑配置以解决问题

所有信息都在这篇文章中：Prometheus not receiving metrics from cadvisor in GKE

【讨论】：