Observability

[Prometheus] node_exporter와 Windows_exporter 에 적용시키는 샘플 rules.yml

코_노 2023. 3. 31. 16:18

프로메테우스로 전사의 서버를 커버하려면, Linux 와 Windows Server 에 따라서 exporter 가 다르므로, 서로 다르게 rule을 적용시켜줘야 할 필요가 있다. 

 

아래 샘플은 CPU usage, Memory Usage, Disk Space 를 기준으로 alert을 받기 위한 rules.yml 이다.

참고로 rules.yml 의 파일명은 자유롭게 설정 가능하다!


Linux) node_exporter 를 통해 metric을 수집하므로, 아래의 node_exporter_rules.yml 을 적용 받는다.

# 'node_exporter_rules.yml'

groups:
  - name: custom_rules
    rules:
      - record: node_CPU_Usage_percent
        expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100))

      - record: node_Memory_Usage_percent
        expr: 100 - ((node_memory_MemAvailable_bytes * 100) / node_memory_MemTotal_bytes)

      - record: node_Filesystem_Free_percent
        expr: 100 * node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}

  - name: alert_rules
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance [{{ $labels.instance }}] down"
          description: "[{{ $labels.instance }}] of job [{{ $labels.job }}] has been down for more than 1 minute."
     
      - alert: High_CPU_Usage_80
        expr: node_CPU_Usage_percent >= 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Instance [{{ $labels.instance }}] CPU Usage High - 80%"
          description: "[{{ $labels.instance }}] has ben exceeded 80% of CPU usage in the last 5 minutes."
      
      - alert: High_CPU_Usage_90
        expr: node_CPU_Usage_percent >= 90
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Instance [{{ $labels.instance }}] CPU Usage High - 90%"
          description: "[{{ $labels.instance }}] has ben exceeded 90% of CPU usage in the last 5 minutes."

      - alert: High_Memory_Usage_80
        expr: node_Memory_Usage_percent >=80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Instance [{{ $labels.instance }}] Memory usage High - 80%"
          description: "[{{ $labels.instance }}] of job [{{ $labels.job }}] has been exceeded {{ $value }}% or more of Memory usage."
       
      - alert: High_Memory_Usage_90
        expr: node_Memory_Usage_percent >= 90
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Instance [{{ $labels.instance }}] Memory usage High - 90%"
          description: "[{{ $labels.instance }}] of job [{{ $labels.job }}] has been exceeded {{ $value }}% or more of Memory usage."
         
      - alert: DiskSpace_Free_10
        expr: node_Filesystem_Free_percent <= 10
        labels:
          severity: critical
        annotations:
          summary: "Instance [{{ $labels.instance }}] has 10% or less Free disk space"
          description: "[{{ $labels.instance }}] has only {{ $value }}% or less free."
      
      - alert: DiskSpace_Free_20
        expr: node_Filesytem_Free_percent <= 20
        labels:
          severity: warning
        annotations:
          summary: "Instance [{{ $labels.instance }}] has 20% or less Free disk space"
          description: "[{{ $labels.instance }}] has only {{ $value }}% or less free."

Windows Server) windows_exporter 를 통해 metric을 수집하므로, 아래의 windows_exporter_rules.yml 을 적용 받는다.

# 'windows_exporter_rules.yml'

groups:
  - name: custom_record
    rules:
      - record: Windows_Server_CPU_Usage
        expr: 100 - (avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[2m])) * 100)

      - record: Windows_Server_Memory_Usage
        expr: 100 - ((windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes) * 100)

      - record: Windows_Server_DiskSpace_Usage
        expr: 100 - (windows_logical_disk_free_bytes{volume!~"HarddiskVolume.+"} / windows_logical_disk_size_bytes{volume!~"HarddiskVolume.+"})*100


  - name: alert_rules
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Instance [{{ $labels.instance }}] down"
          description: "[{{ $labels.instance }}] of job [{{ $labels.job }}] has been down for more than 1 minute."
      
      - alert: Windows_Server_CPU_Usage_90
        expr: Windows_Server_CPU_Usage >= 90
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Instance [{{ $labels.instance }}] CPU Usage High - 90%"
          description: "[{{ $labels.instance }}] has ben exceeded 90% of CPU usage in the last 3 minutes."
      
      - alert: Windows_Server_CPU_Usage_80
        expr: Windows_Server_CPU_Usage >= 80
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "Instance [{{ $labels.instance }}] CPU Usage High - 80%"
          description: "[{{ $labels.instance }}] has ben exceeded 80% of CPU usage in the last 3 minutes."
       
      - alert: Windows_Server_Memory_Usage_90
        expr: Windows_Server_Memory_Usage >= 90
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Instance [{{ $labels.instance }}] Memory Usage High - 90%"
          description: "[{{ $labels.instance }}] has ben exceeded 90% of Memory usage in the last 3 minutes."
      
      - alert: Windows_Server_Memory_Usage_80
        expr: Windows_Server_Memory_Usage >= 80 unless on(instance) {instance="예외 적용시킬 호스트의 IP:Port"}
        for: 3m
        labels:
          severity: warning
        annotations:
          summary: "Instance [{{ $labels.instance }}] Memory Usage High - 80%"
          description: "[{{ $labels.instance }}] has ben exceeded 80% of Memory usage in the last 3 minutes."

      
      - alert: Windows_Server_DiskSpace_Usage_90
        expr: Windows_Server_DiskSpace_Usage >= 90
        labels:
          severity: critical
        annotations:
          summary: "Instance [{{ $labels.instance }}] has 10% or less Free disk space"
          description: "Disk usage is more than 90%\n LABELS = {{ $labels }}\n VALUE = {{ $value }}"

위 샘플의 'Windows_Server_Memory_Usage_80' 이라는 이름의 alert을 보면, exprunless on 구문도 사용 가능한 것을 확인할 수 있다. 

 

unless on([앞 promquery 결과 값의 label]) {labelname="원하는 조건"} 

위와 같이 사용하여, unless 기준 앞 구절의 결과로 나온 값에서 원하는 label 을 기준으로 하여 해당 label의 특정 값을 가지면, 이 alert을 적용시키지 않도록 하는 구문이다.

 

또한 lable set ({~}) 구문에서 =~ Operator로 정규식 형태로 표현도 가능하다. 예시는 아래와 같다.

unless on(instance) {instance=~"(A_HOSTNAME)|(B_HOSTNAME)"}


rules.yml 파일 내의 내용이 syntax에 어긋나지 않는 지 확인하기

bin/promtool check rules rules/node_exporter_rules.yml

 

Promquery를 사용하면서 유용한 rules.yml 에 대한 문법들을 알게 되면 여기에 추가적으로 기록할 예정이다.


참고

https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/

 

Alerting rules | Prometheus

An open-source monitoring system with a dimensional data model, flexible query language, efficient time series database and modern alerting approach.

prometheus.io

https://awesome-prometheus-alerts.grep.to/rules.html

 

Awesome Prometheus alerts

Collection of alerting rules

awesome-prometheus-alerts.grep.to

https://stackoverflow.com/questions/69435005/use-conditional-operator-in-prometheus-alert-rules-to-set-severity