프로메테우스로 전사의 서버를 커버하려면, Linux 와 Windows Server 에 따라서 exporter 가 다르므로, 서로 다르게 rule을 적용시켜줘야 할 필요가 있다.
아래 샘플은 CPU usage, Memory Usage, Disk Space 를 기준으로 alert을 받기 위한 rules.yml 이다.
참고로 rules.yml 의 파일명은 자유롭게 설정 가능하다!
Linux) node_exporter 를 통해 metric을 수집하므로, 아래의 node_exporter_rules.yml 을 적용 받는다.
# 'node_exporter_rules.yml'
groups:
- name: custom_rules
rules:
- record: node_CPU_Usage_percent
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m]) * 100))
- record: node_Memory_Usage_percent
expr: 100 - ((node_memory_MemAvailable_bytes * 100) / node_memory_MemTotal_bytes)
- record: node_Filesystem_Free_percent
expr: 100 * node_filesystem_free_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}
- name: alert_rules
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance [{{ $labels.instance }}] down"
description: "[{{ $labels.instance }}] of job [{{ $labels.job }}] has been down for more than 1 minute."
- alert: High_CPU_Usage_80
expr: node_CPU_Usage_percent >= 80
for: 5m
labels:
severity: warning
annotations:
summary: "Instance [{{ $labels.instance }}] CPU Usage High - 80%"
description: "[{{ $labels.instance }}] has ben exceeded 80% of CPU usage in the last 5 minutes."
- alert: High_CPU_Usage_90
expr: node_CPU_Usage_percent >= 90
for: 5m
labels:
severity: critical
annotations:
summary: "Instance [{{ $labels.instance }}] CPU Usage High - 90%"
description: "[{{ $labels.instance }}] has ben exceeded 90% of CPU usage in the last 5 minutes."
- alert: High_Memory_Usage_80
expr: node_Memory_Usage_percent >=80
for: 5m
labels:
severity: warning
annotations:
summary: "Instance [{{ $labels.instance }}] Memory usage High - 80%"
description: "[{{ $labels.instance }}] of job [{{ $labels.job }}] has been exceeded {{ $value }}% or more of Memory usage."
- alert: High_Memory_Usage_90
expr: node_Memory_Usage_percent >= 90
for: 3m
labels:
severity: critical
annotations:
summary: "Instance [{{ $labels.instance }}] Memory usage High - 90%"
description: "[{{ $labels.instance }}] of job [{{ $labels.job }}] has been exceeded {{ $value }}% or more of Memory usage."
- alert: DiskSpace_Free_10
expr: node_Filesystem_Free_percent <= 10
labels:
severity: critical
annotations:
summary: "Instance [{{ $labels.instance }}] has 10% or less Free disk space"
description: "[{{ $labels.instance }}] has only {{ $value }}% or less free."
- alert: DiskSpace_Free_20
expr: node_Filesytem_Free_percent <= 20
labels:
severity: warning
annotations:
summary: "Instance [{{ $labels.instance }}] has 20% or less Free disk space"
description: "[{{ $labels.instance }}] has only {{ $value }}% or less free."
Windows Server) windows_exporter 를 통해 metric을 수집하므로, 아래의 windows_exporter_rules.yml 을 적용 받는다.
# 'windows_exporter_rules.yml'
groups:
- name: custom_record
rules:
- record: Windows_Server_CPU_Usage
expr: 100 - (avg by (instance) (rate(windows_cpu_time_total{mode="idle"}[2m])) * 100)
- record: Windows_Server_Memory_Usage
expr: 100 - ((windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes) * 100)
- record: Windows_Server_DiskSpace_Usage
expr: 100 - (windows_logical_disk_free_bytes{volume!~"HarddiskVolume.+"} / windows_logical_disk_size_bytes{volume!~"HarddiskVolume.+"})*100
- name: alert_rules
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Instance [{{ $labels.instance }}] down"
description: "[{{ $labels.instance }}] of job [{{ $labels.job }}] has been down for more than 1 minute."
- alert: Windows_Server_CPU_Usage_90
expr: Windows_Server_CPU_Usage >= 90
for: 3m
labels:
severity: critical
annotations:
summary: "Instance [{{ $labels.instance }}] CPU Usage High - 90%"
description: "[{{ $labels.instance }}] has ben exceeded 90% of CPU usage in the last 3 minutes."
- alert: Windows_Server_CPU_Usage_80
expr: Windows_Server_CPU_Usage >= 80
for: 3m
labels:
severity: warning
annotations:
summary: "Instance [{{ $labels.instance }}] CPU Usage High - 80%"
description: "[{{ $labels.instance }}] has ben exceeded 80% of CPU usage in the last 3 minutes."
- alert: Windows_Server_Memory_Usage_90
expr: Windows_Server_Memory_Usage >= 90
for: 3m
labels:
severity: critical
annotations:
summary: "Instance [{{ $labels.instance }}] Memory Usage High - 90%"
description: "[{{ $labels.instance }}] has ben exceeded 90% of Memory usage in the last 3 minutes."
- alert: Windows_Server_Memory_Usage_80
expr: Windows_Server_Memory_Usage >= 80 unless on(instance) {instance="예외 적용시킬 호스트의 IP:Port"}
for: 3m
labels:
severity: warning
annotations:
summary: "Instance [{{ $labels.instance }}] Memory Usage High - 80%"
description: "[{{ $labels.instance }}] has ben exceeded 80% of Memory usage in the last 3 minutes."
- alert: Windows_Server_DiskSpace_Usage_90
expr: Windows_Server_DiskSpace_Usage >= 90
labels:
severity: critical
annotations:
summary: "Instance [{{ $labels.instance }}] has 10% or less Free disk space"
description: "Disk usage is more than 90%\n LABELS = {{ $labels }}\n VALUE = {{ $value }}"
위 샘플의 'Windows_Server_Memory_Usage_80' 이라는 이름의 alert을 보면, expr 에 unless on 구문도 사용 가능한 것을 확인할 수 있다.
unless on([앞 promquery 결과 값의 label]) {labelname="원하는 조건"}
위와 같이 사용하여, unless 기준 앞 구절의 결과로 나온 값에서 원하는 label 을 기준으로 하여 해당 label의 특정 값을 가지면, 이 alert을 적용시키지 않도록 하는 구문이다.
또한 lable set ({~}) 구문에서 =~ Operator로 정규식 형태로 표현도 가능하다. 예시는 아래와 같다.
unless on(instance) {instance=~"(A_HOSTNAME)|(B_HOSTNAME)"}
rules.yml 파일 내의 내용이 syntax에 어긋나지 않는 지 확인하기
bin/promtool check rules rules/node_exporter_rules.yml
Promquery를 사용하면서 유용한 rules.yml 에 대한 문법들을 알게 되면 여기에 추가적으로 기록할 예정이다.
참고
https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
https://awesome-prometheus-alerts.grep.to/rules.html
'Observability' 카테고리의 다른 글
[Prometheus] Metric의 Threshold를 지정하고 AlertManager로 메일 통해 알람 받기 (0) | 2023.03.18 |
---|---|
[Prometheus] Node Exporter 를 CentOS 7 서버에 설치하기 (0) | 2023.03.16 |
[Grafana] CentOS 7 에서 Grafana(Prometheus 데이터를 시각화해주는 서버) 설치하고 프로메테우스와 연동하기 (0) | 2023.03.16 |
[Prometheus] CentOS 7 에서 프로메테우스(Prometheus) 설치 및 기본 설정, 실행 (0) | 2023.03.14 |
최근댓글