【问题标题】:Test Rules AlertManager FAILED: yaml: unmarshal errors: line 1: field groups not found in type main.unitTestFile测试规则 AlertManager 失败:yaml:解组错误:第 1 行:在 main.unitTestFile 类型中找不到字段组
【发布时间】:2019-09-22 03:00:20
【问题描述】:

请帮助我在下面测试警报管理器时收到错误消息

 promtool check rules /etc/prometheus/alert.rules.yml
 Checking /etc/prometheus/alert.rules.yml
 SUCCESS: 3 rules found

 promtool test rules /etc/prometheus/alert.rules.yml
 Unit Testing:  /etc/prometheus/alert.rules.yml
 FAILED:
 yaml: unmarshal errors:
 line 1: field groups not found in type main.unitTestFile

我的alert.rules配置如下:

      cat /etc/prometheus/alert.rules.yml
      groups:
      - alert: MemoryFree10%
        expr: node_exporter:node_memory_free:memory_used_percents >= 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} hight memory usage"
          description: "{{ $labels.instance }} has more than 90% of its memory used."
      - alert: DiskSpace10%Free
        expr: node_exporter:node_filesystem_free:fs_used_percents >= 90
        labels:
          severity: moderate
        annotations:
          summary: "Instance {{ $labels.instance }} is low on disk space"
          description: "{{ $labels.instance }} has only {{ $value }}% free."
      - alert: ExporterDown
        expr: up == 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Exporter down (instance {{ $labels.instance }})"
          description: "Prometheus exporter down\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

我们的文件提醒规则是否有遗漏或不正确?

请帮忙?

谢谢

【问题讨论】:

    标签: alert prometheus prometheus-alertmanager


    【解决方案1】:

    使用 promtool 检查配置文件的语法时,您必须使用“./promtool check config prometheus.yml” 这个 prometheus.yml 是一个父文件,它会调用 prometheus 规则文件 prometheus_rules.yml。 因此,当使用promtool检查规则文件的语法时,你必须使用“./promtool check rules prometheus_rules.yml

    【讨论】:

      【解决方案2】:

      您正在对警报规则文件运行单元测试。您应该先编写测试文件,然后通过promtool test rules test.yml对测试文件运行单元测试。

      这是来自https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/的演示

      alerts.yml

      # This is the rules file.
      
      groups:
      - name: example
        rules:
      
        - alert: InstanceDown
          expr: up == 0
          for: 5m
          labels:
              severity: page
          annotations:
              summary: "Instance {{ $labels.instance }} down"
              description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
      
        - alert: AnotherInstanceDown
          expr: up == 0
          for: 10m
          labels:
              severity: page
          annotations:
              summary: "Instance {{ $labels.instance }} down"
              description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."
      

      test.yml

      # This is the main input for unit testing.
      # Only this file is passed as command line argument.
      
      rule_files:
          - alerts.yml
      
      evaluation_interval: 1m
      
      tests:
          # Test 1.
          - interval: 1m
            # Series data.
            input_series:
                - series: 'up{job="prometheus", instance="localhost:9090"}'
                  values: '0 0 0 0 0 0 0 0 0 0 0 0 0 0 0'
                - series: 'up{job="node_exporter", instance="localhost:9100"}'
                  values: '1+0x6 0 0 0 0 0 0 0 0' # 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0
                - series: 'go_goroutines{job="prometheus", instance="localhost:9090"}'
                  values: '10+10x2 30+20x5' # 10 20 30 30 50 70 90 110 130
                - series: 'go_goroutines{job="node_exporter", instance="localhost:9100"}'
                  values: '10+10x7 10+30x4' # 10 20 30 40 50 60 70 80 10 40 70 100 130
      
            # Unit test for alerting rules.
            alert_rule_test:
                # Unit test 1.
                - eval_time: 10m
                  alertname: InstanceDown
                  exp_alerts:
                      # Alert 1.
                      - exp_labels:
                            severity: page
                            instance: localhost:9090
                            job: prometheus
                        exp_annotations:
                            summary: "Instance localhost:9090 down"
                            description: "localhost:9090 of job prometheus has been down for more than 5 minutes."
            # Unit tests for promql expressions.
            promql_expr_test:
                # Unit test 1.
                - expr: go_goroutines > 5
                  eval_time: 4m
                  exp_samples:
                      # Sample 1.
                      - labels: 'go_goroutines{job="prometheus",instance="localhost:9090"}'
                        value: 50
                      # Sample 2.
                      - labels: 'go_goroutines{job="node_exporter",instance="localhost:9100"}'
                        value: 50
      

      然后你可以运行promtool test rules test.yml,你会得到类似的结果

      Unit Testing:  test.yml
        SUCCESS
      

      【讨论】:

        【解决方案3】:

        您的配置缺少规则。

            groups:
            - name: alert.rules
              rules:
              - alert: HighRequestLatency
              .....
        

        https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

        【讨论】:

        •  嗨,amjad,感谢您的回复。但我已经使用规则测试仍然错误组:-名称:alerting.rules 规则:-警报:ExporterDown expr:up == 0 for:5m 标签:严重性:警告注释:摘要:“出口商关闭(实例 {{ $labels .instance }})" promtool 测试规则 /etc/prometheus/alert.rules.yml 单元测试:/etc/prometheus/alert.rules.yml 失败:yaml:解组错误:第 1 行:在 main 类型中找不到字段组。 unitTestFile 
        • 我已经添加了你完整的 yml 文件并运行了 promtool,一切都成功了,你确定你的 yaml 是正确的还是格式化的? ```组:-名称:规则规则:-警报:MemoryFree10%expr:node_exporter:node_memory_free:memory_used_percents> = 90 for:5m标签:严重性:关键注释:摘要:“实例{{$labels.instance}}高内存usage" description: "{{ $labels.instance }} 已使用超过 90% 的内存。" ```
        • 嗯奇怪我总是尝试错误失败:yaml:解组错误:第 1 行:在 main.unitTestFile 类型中找不到字段组
        • 你能概括一下吗?
        • sorry @MichaelDoubez gist 是什么意思?
        猜你喜欢
        • 2021-08-18
        • 1970-01-01
        • 2022-10-04
        • 1970-01-01
        • 2021-06-05
        • 2015-07-24
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多