wiki/grafana/grafana-setup.md

58 KiB

title description published date tags editor dateCreated
Docker Compose Setup für Grafana mit Prometheus, cAdvisor, blackbox und node-exporter true 2024-06-17T10:36:41.442Z markdown 2024-06-17T10:36:38.989Z

Docker Compose Setup für Grafana mit Prometheus, cAdvisor, blackbox und node-exporter

Diese Docker Compose-Konfiguration ermöglicht die einfache Bereitstellung von Grafana in Verbindung mit Prometheus, cAdvisor, blackbox und node-exporter mit einem Alertmanager und Ntfy unter Verwendung von Docker.

Voraussetzungen

Vor der Verwendung dieser Konfiguration stellen Sie sicher, dass Sie Docker, Docker Compose, Watchtower für automatische Aktualisierungen, Keycloak für die Authentifizierung und Flame als Dashboard mit Nginx als Reverse-Proxy auf Ihrem System installiert und konfiguriert haben.

Die Programm die dazu benötigt werden sind in folgenden Anleitungen Dokumentiert:

nginx-installation

ntfy-einrichten

Dienste

grafana

  • Grafana-Dashboard mit Konfiguration für Prometheus und Loki.
  • Authentifizierung über Keycloak.
  • Konfiguriert für die Verwendung mit Nginx als Reverse-Proxy.

Prometheus

  • Es sammelt Metriken von verschiedenen Quellen, speichert diese in einer Zeitreihendatenbank und ermöglicht die Abfrage und Analyse dieser Metriken.

Alertmanager

  • Der Alertmanager ist ein Bestandteil des Prometheus-Ökosystems und dient der Verwaltung von Alarmmeldungen, die von Prometheus generiert werden.
  • Er ermöglicht die Definition, Routing und Benachrichtigung von Alarmen an verschiedene Empfänger wie E-Mail, PagerDuty oder Slack.

Ntfy

  • ntfy ist ein Befehlszeilenwerkzeug, das Benachrichtigungen über den Abschluss von Befehlen oder anderen Ereignissen auf dem System sendet.
  • Nach Ausführung eines Befehls kann ntfy.sh Benachrichtigungen über verschiedene Kanäle wie Slack, Telegram oder den Systembenachrichtigungsmechanismus senden, um den Benutzer über den Abschluss des Befehls, oder Ereignisses vom Alertmanager zu informieren.

node-exporter

  • Exponiert Host-Metriken für Prometheus.

cadvisor

  • Überwacht Container und gibt Metriken für Prometheus aus.

blackbox

  • Blackbox ist ein Open-Source-Tool, das für die externe Überwachung von Netzwerkdiensten verwendet wird.
  • Es ermöglicht das Überprüfen der Erreichbarkeit und Integrität von Diensten durch das Senden von Anfragen und Auswerten der Antworten, um sicherzustellen, dass Dienste ordnungsgemäß funktionieren.

loki

  • Log-Aggregationsservice für Grafana.

promtail

  • Agent, der Log-Daten an Loki sendet.

Anwendung starten

Erstellen sie einen neuen Ordner für die Grafana Installation und wechseln sie dorthin:

mkdir -p /opt/containers/grafana
cd /opt/containers/grafana

Erstellen sie die folgenden Dateien und passen sie Die Variablen an:

nvim ${FILENAME}

docker-compose.yml

version: "3.8"

services:

  alertmanager:
    image: prom/alertmanager:latest
    volumes:
      - ./alertmanager/config.yml:/etc/alertmanager/config.yml
      - ./alertmanager/web.yml:/etc/alertmanager/web.yml
    command:
      - '--config.file=/etc/alertmanager/config.yml'
      - '--web.config.file=/etc/alertmanager/web.yml'
      - '--web.external-url=https://alertmanager.brothertec.eu/'
    restart: unless-stopped
    #ports:
    #  - 9093:9093
    networks:
      default:
      proxy:
      edge-tier:
      backend:
    environment:
      - VIRTUAL_HOST=alertmanager.${DOMAIN}
      - VIRTUAL_PORT=9093
      - LETSENCRYPT_HOST=alertmanager.${DOMAIN}
      - LETSENCRYPT_EMAIL=admin@${DOMAIN}
    labels:
      - "com.centurylinklabs.watchtower.enable=true"
      - flame.type=application
      - flame.name=Alertmanager
      - flame.url=https://alertmanager.${DOMAIN}
      - flame.icon=monitor-dashboard
	  
	ntfy-alertmanager:
    image: xenrox/ntfy-alertmanager:latest
    #build: builds/ntfy-alertmanager/docker/.
    container_name: ntfy-alertmanager
    volumes:
      - ./ntfy-alertmanager/config.scfg:/etc/ntfy-alertmanager/config
    #ports:
    #  - 127.0.0.1:8080:8080
    restart: unless-stopped
    networks:
      - default
	  
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    restart: unless-stopped
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
      - ./node_exporter/textfile_collector/:/node_exporter/textfile_collector:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
      - '--collector.filesystem.mount-points-exclude="^(/rootfs|/host|)/(sys|proc|dev|host|etc)($$|/)"'
      - '--collector.filesystem.fs-types-exclude="^(sys|proc|auto|cgroup|devpts|ns|au|fuse\.lxc|mqueue)(fs|)$$"'
      - '--collector.textfile.directory=/node_exporter/textfile_collector/'
    networks:
      - default
    labels:
      - "com.centurylinklabs.watchtower.enable=true"

  cadvisor:
    image: gcr.io/cadvisor/cadvisor
    #image: gcr.io/cadvisor/cadvisor-arm64:v0.47.2
    container_name: cadvisor
    restart: unless-stopped
    #ports:
    #  - '8889:8080'
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    networks:
      - default
      - proxy
      - edge-tier
    environment:
      - VIRTUAL_HOST=cadvisor.${DOMAIN}
      - VIRTUAL_PORT=8080
      - LETSENCRYPT_HOST=cadvisor.${DOMAIN}
      - LETSENCRYPT_EMAIL=admin@${DOMAIN}
    labels:
      - "com.centurylinklabs.watchtower.enable=true"
      - flame.type=application
      - flame.name=Cadvisor-Dashboard
      - flame.url=https://cadvisor.${DOMAIN}
      - flame.icon=monitor-dashboard
	  
	blackbox:
    image: 'prom/blackbox-exporter:latest'
    #ports:
    #  - 9115/tcp
    command:
      - '--config.file=/config/blackbox.yml'
    container_name: blackbox_exporter
    restart: always
    volumes:
      - './blackbox:/config'
    networks:
      default:
      dns:
        ipv4_address: 172.28.0.5
    labels:
      - "com.centurylinklabs.watchtower.enable=true"

  loki:
    image: grafana/loki:latest
    volumes:
      #- './loki-data:/loki'
      - ./loki/loki-config.yaml:/etc/loki/local-config.yaml
    ports:
      - "127.0.0.1:3100:3100"
    command: -config.file=/etc/loki/local-config.yaml
    restart: always
    depends_on:
      - promtail
    networks:
      - default
    labels:
      - "com.centurylinklabs.watchtower.enable=true"

  promtail:
    image: grafana/promtail:latest
    volumes:
      - /var/log:/var/log
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - /opt/containers/nginx-proxy/logs/:/config/log/nginx/:ro
      - ./promtail/promtail-config.yaml:/etc/promtail/promtail-config.yaml
    command: -config.file=/etc/promtail/promtail-config.yaml
    restart: always
    networks:
      - default
    labels:
      - "com.centurylinklabs.watchtower.enable=true"
	  
	prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    restart: unless-stopped
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/rules.yml:/etc/prometheus/rules.yml
      - ./prometheus/test.yml:/etc/prometheus/test.yml
      #- ./prom-data:/prometheus
      - ./from-docker-labels.json:/tmp/from-docker-labels.json
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.size=20GB'
    depends_on:
      - node-exporter
      - cadvisor
      - blackbox
    networks:
      default:
      proxy:
      edge-tier:
      backend:
      dns:
        ipv4_address: 172.28.0.8
    environment:
      - VIRTUAL_HOST=prometheus.${DOMAIN}
      - VIRTUAL_PORT=9090
      - LETSENCRYPT_HOST=prometheus.${DOMAIN}
      - LETSENCRYPT_EMAIL=admin@${DOMAIN}
    labels:
      - "com.centurylinklabs.watchtower.enable=true"
      - flame.type=application
      - flame.name=Prometheus-Dashboard
      - flame.url=https://prometheus.${DOMAIN}
      - flame.icon=monitor-dashboard

  grafana:
    image: grafana/grafana-oss
    container_name: grafana
    restart: unless-stopped
    environment:
      GF_SERVER_ROOT_URL: https://dashboard.${DOMAIN}/
      GF_INSTALL_PLUGINS: grafana-clock-panel
      GF_AUTH_ANONYMOUS_ENABLED: true
      GF_AUTH_ANONYMOUS_ORG_NAME: Public
      GF_AUTH_ANONYMOUS_ORG_ROLE: Viewer
      GF_AUTH_GENERIC_OAUTH_ENABLED: "true"
      GF_AUTH_GENERIC_OAUTH_NAME: "SingleSignOn"
      GF_AUTH_GENERIC_OAUTH_ALLOW_SIGN_UP: "true"
      GF_AUTH_GENERIC_OAUTH_CLIENT_ID: "Grafana"
      GF_AUTH_GENERIC_OAUTH_CLIENT_SECRET: "${SECRET}"
      GF_AUTH_GENERIC_OAUTH_SCOPES: "openid profile email offline_access roles Grafana-Mapper"
      GF_AUTH_GENERIC_OAUTH_AUTH_URL: "https://auth.${DOMAIN}/realms/master/protocol/openid-connect/auth"
      GF_AUTH_GENERIC_OAUTH_TOKEN_URL: "https://auth.${DOMAIN}/realms/master/protocol/openid-connect/token"
      GF_AUTH_GENERIC_OAUTH_API_URL: "https://auth.${DOMAIN}/realms/master/protocol/openid-connect/userinfo"
      #GF_AUTH_GENERIC_OAUTH_ROLE_ATTRIBUTE_PATH: "contains(roles[*], 'Admin') && 'Admin' || contains(roles[*], 'Editor') && 'Editor' || 'Viewer'"
      GF_AUTH_GENERIC_OAUTH_ROLE_ATTRIBUTE_PATH: "contains(realm_access.roles[*], 'grafana-admin') && 'Admin' || contains(realm_access.roles[*], 'Editor') && 'Editor' || 'Viewer'"
      GF_ALLOW_ASSIGN_GRAFANA_ADMIN: true
      GF_SMTP_ENABLED: true
      GF_SMTP_HOST: mail.${DOMAIN}:587
      GF_SMTP_USER: grafana@${DOMAIN}
      GF_SMTP_PASSWORD: ${SMTP_PASSWORD}
      GF_SMTP_FROM_ADDRESS: grafana@${DOMAIN}
      GF_FEATURE_TOGGLES_ENABLE: publicDashboards
      VIRTUAL_HOST: dashboard.${DOMAIN}
      VIRTUAL_PORT: 3000
      LETSENCRYPT_HOST: dashboard.${DOMAIN}
      LETSENCRYPT_EMAIL: admin@${DOMAIN}

    #ports:
    # - '3000:3000'
    volumes:
      - './grafana_storage:/var/lib/grafana'

    depends_on:
      - prometheus
      - loki

    labels:
      - "com.centurylinklabs.watchtower.enable=true"
      - flame.type=application
      - flame.name=Grafana-Dashboard
      - flame.url=https://dashboard.${DOMAIN}
      - flame.icon=monitor-dashboard

    networks:
      default:
      proxy:
      edge-tier:
      dns:
        ipv4_address: 172.28.0.27

networks:
  proxy:
    name: nginx-proxy
    external: true
  edge-tier:
    name: edge
    external: true
  dns:
    name: dns
    external: true
  backend:
    driver: bridge

prometheus/prometheus.yml

global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - rules.yml
alerting:
  alertmanagers:
  - scheme: http
    static_configs:
      - targets:
        - 'alertmanager:9093'

scrape_configs:
  - job_name: 'alertmanager'
    static_configs:
      - targets: ['alertmanager:9093']

  - job_name: 'prometheus'
    static_configs:
      - targets: ['prometheus:9090']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
	  
  - job_name: 'blackbox'
    metrics_path: /probe
    params:
      module: [http_2xx]  # Look for a HTTP 200 response.
    static_configs:
      - targets:
        - https://${DOMAIN}      # Target to probe with https.
        - https://dashboard.${DOMAIN}
        - https://ntfy.${DOMAIN}
        - https://auth.${DOMAIN}
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox:9115  # The blackbox exporter's real hostname:port.

prometheus/rules.yml

groups:
- name: alert.rules
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 1m
    labels:
      severity: "Critical"
    annotations:
      summary: "Endpoint {{ $labels.instance }} down"
      description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."

  - alert: EndpointDown
    expr: probe_success == 0
    for: 10s
    labels:
      severity: "critical"
    annotations:
      summary: "Endpoint {{ $labels.instance }} down"

### Prometheus

  - alert: PrometheusJobMissing
    expr: absent(up{job="prometheus"})
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus job missing (instance {{ $labels.instance }})
      description: "A Prometheus job has disappeared\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusTargetMissing
    expr: up == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus target missing (instance {{ $labels.instance }})
      description: "A Prometheus target has disappeared. An exporter might be crashed.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusAllTargetsMissing
    expr: sum by (job) (up) == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus all targets missing (instance {{ $labels.instance }})
      description: "A Prometheus job does not have living target anymore.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusTargetMissingWithWarmupTime
    expr: sum by (instance, job) ((up == 0) * on (instance) group_right(job) (node_time_seconds - node_boot_time_seconds > 600))
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus target missing with warmup time (instance {{ $labels.instance }})
      description: "Allow a job time to start up (10 minutes) before alerting that it's down.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusConfigurationReloadFailure
    expr: prometheus_config_last_reload_successful != 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus configuration reload failure (instance {{ $labels.instance }})
      description: "Prometheus configuration reload error\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusTooManyRestarts
    expr: changes(process_start_time_seconds{job=~"prometheus|pushgateway|alertmanager"}[15m]) > 2
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus too many restarts (instance {{ $labels.instance }})
      description: "Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusAlertmanagerJobMissing
    expr: absent(up{job="alertmanager"})
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus AlertManager job missing (instance {{ $labels.instance }})
      description: "A Prometheus AlertManager job has disappeared\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusAlertmanagerConfigurationReloadFailure
    expr: alertmanager_config_last_reload_successful != 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus AlertManager configuration reload failure (instance {{ $labels.instance }})
      description: "AlertManager configuration reload error\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusAlertmanagerConfigNotSynced
    expr: count(count_values("config_hash", alertmanager_config_hash)) > 1
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus AlertManager config not synced (instance {{ $labels.instance }})
      description: "Configurations of AlertManager cluster instances are out of sync\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  #- alert: PrometheusAlertmanagerE2eDeadManSwitch
  #  expr: vector(1)
  #  for: 0m
  #  labels:
  #    severity: critical
  #  annotations:
  #    summary: Prometheus AlertManager E2E dead man switch (instance {{ $labels.instance }})
  #    description: "Prometheus DeadManSwitch is an always-firing alert. It's used as an end-to-end test of Prometheus through the Alertmanager.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusNotConnectedToAlertmanager
    expr: prometheus_notifications_alertmanagers_discovered < 1
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus not connected to alertmanager (instance {{ $labels.instance }})
      description: "Prometheus cannot connect the alertmanager\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusRuleEvaluationFailures
    expr: increase(prometheus_rule_evaluation_failures_total[3m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus rule evaluation failures (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusTemplateTextExpansionFailures
    expr: increase(prometheus_template_text_expansion_failures_total[3m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus template text expansion failures (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} template text expansion failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusRuleEvaluationSlow
    expr: prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus rule evaluation slow (instance {{ $labels.instance }})
      description: "Prometheus rule evaluation took more time than the scheduled interval. It indicates a slower storage backend access or too complex query.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusNotificationsBacklog
    expr: min_over_time(prometheus_notifications_queue_length[10m]) > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus notifications backlog (instance {{ $labels.instance }})
      description: "The Prometheus notification queue has not been empty for 10 minutes\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusAlertmanagerNotificationFailing
    expr: rate(alertmanager_notifications_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus AlertManager notification failing (instance {{ $labels.instance }})
      description: "Alertmanager is failing sending notifications\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"


  - alert: PrometheusTargetEmpty
    expr: prometheus_sd_discovered_targets == 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus target empty (instance {{ $labels.instance }})
      description: "Prometheus has no target in service discovery\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusTargetScrapingSlow
    expr: prometheus_target_interval_length_seconds{quantile="0.9"} / on (interval, instance, job) prometheus_target_interval_length_seconds{quantile="0.5"} > 1.05
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus target scraping slow (instance {{ $labels.instance }})
      description: "Prometheus is scraping exporters slowly since it exceeded the requested interval time. Your Prometheus server is under-provisioned.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusLargeScrape
    expr: increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m]) > 10
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Prometheus large scrape (instance {{ $labels.instance }})
      description: "Prometheus has many scrapes that exceed the sample limit\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusTargetScrapeDuplicate
    expr: increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus target scrape duplicate (instance {{ $labels.instance }})
      description: "Prometheus has many samples rejected due to duplicate timestamps but different values\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusTsdbCheckpointCreationFailures
    expr: increase(prometheus_tsdb_checkpoint_creations_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB checkpoint creation failures (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} checkpoint creation failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusTsdbCheckpointDeletionFailures
    expr: increase(prometheus_tsdb_checkpoint_deletions_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB checkpoint deletion failures (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} checkpoint deletion failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusTsdbCompactionsFailed
    expr: increase(prometheus_tsdb_compactions_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB compactions failed (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB compactions failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusTsdbHeadTruncationsFailed
    expr: increase(prometheus_tsdb_head_truncations_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB head truncations failed (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB head truncation failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusTsdbReloadFailures
    expr: increase(prometheus_tsdb_reloads_failures_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB reload failures (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB reload failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusTsdbWalCorruptions
    expr: increase(prometheus_tsdb_wal_corruptions_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB WAL corruptions (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB WAL corruptions\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusTsdbWalTruncationsFailed
    expr: increase(prometheus_tsdb_wal_truncations_failed_total[1m]) > 0
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Prometheus TSDB WAL truncations failed (instance {{ $labels.instance }})
      description: "Prometheus encountered {{ $value }} TSDB WAL truncation failures\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: PrometheusTimeseriesCardinality
    expr: label_replace(count by(__name__) ({__name__=~".+"}), "name", "$1", "__name__", "(.+)") > 10000
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Prometheus timeseries cardinality (instance {{ $labels.instance }})
      description: "The \"{{ $labels.name }}\" timeseries cardinality is getting very high: {{ $value }}\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

### Node-Exporter

  - alert: HostOutOfMemory
    expr: (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100 < 10) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host out of memory (instance {{ $labels.instance }})
      description: "Node memory is filling up (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostMemoryUnderMemoryPressure
    expr: (rate(node_vmstat_pgmajfault[1m]) > 1000) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host memory under memory pressure (instance {{ $labels.instance }})
      description: "The node is under heavy memory pressure. High rate of major page faults\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  # You may want to increase the alert manager 'repeat_interval' for this type of alert to daily or weekly
  - alert: HostMemoryIsUnderutilized
    expr: (100 - (rate(node_memory_MemAvailable_bytes[30m]) / node_memory_MemTotal_bytes * 100) < 20) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 1w
    labels:
      severity: info
    annotations:
      summary: Host Memory is underutilized (instance {{ $labels.instance }})
      description: "Node memory is < 20% for 1 week. Consider reducing memory space. (instance {{ $labels.instance }})\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostUnusualNetworkThroughputIn
    expr: (sum by (instance) (rate(node_network_receive_bytes_total[2m])) / 1024 / 1024 > 100) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual network throughput in (instance {{ $labels.instance }})
      description: "Host network interfaces are probably receiving too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostUnusualNetworkThroughputOut
    expr: (sum by (instance) (rate(node_network_transmit_bytes_total[2m])) / 1024 / 1024 > 100) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual network throughput out (instance {{ $labels.instance }})
      description: "Host network interfaces are probably sending too much data (> 100 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostUnusualDiskReadRate
    expr: (sum by (instance) (rate(node_disk_read_bytes_total[2m])) / 1024 / 1024 > 50) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk read rate (instance {{ $labels.instance }})
      description: "Disk is probably reading too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostUnusualDiskWriteRate
    expr: (sum by (instance) (rate(node_disk_written_bytes_total[2m])) / 1024 / 1024 > 50) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk write rate (instance {{ $labels.instance }})
      description: "Disk is probably writing too much data (> 50 MB/s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  # Please add ignored mountpoints in node_exporter parameters like
  # "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)".
  # Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users.
  - alert: HostOutOfDiskSpace
    expr: ((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host out of disk space (instance {{ $labels.instance }})
      description: "Disk is almost full (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  # Please add ignored mountpoints in node_exporter parameters like
  # "--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|run)($|/)".
  # Same rule using "node_filesystem_free_bytes" will fire when disk fills for non-root users.
  - alert: HostDiskWillFillIn24Hours
    expr: ((node_filesystem_avail_bytes * 100) / node_filesystem_size_bytes < 10 and ON (instance, device, mountpoint) predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs"}[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host disk will fill in 24 hours (instance {{ $labels.instance }})
      description: "Filesystem is predicted to run out of space within the next 24 hours at current write rate\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostOutOfInodes
    expr: (node_filesystem_files_free / node_filesystem_files * 100 < 10 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host out of inodes (instance {{ $labels.instance }})
      description: "Disk is almost running out of available inodes (< 10% left)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  #- alert: HostFilesystemDeviceError
  #  expr: node_filesystem_device_error == 1
  #  for: 0m
  #  labels:
  #    severity: critical
  #  annotations:
  #    summary: Host filesystem device error (instance {{ $labels.instance }})
  #    description: "{{ $labels.instance }}: Device error with the {{ $labels.mountpoint }} filesystem\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostInodesWillFillIn24Hours
    expr: (node_filesystem_files_free / node_filesystem_files * 100 < 10 and predict_linear(node_filesystem_files_free[1h], 24 * 3600) < 0 and ON (instance, device, mountpoint) node_filesystem_readonly == 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host inodes will fill in 24 hours (instance {{ $labels.instance }})
      description: "Filesystem is predicted to run out of inodes within the next 24 hours at current write rate\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostUnusualDiskReadLatency
    expr: (rate(node_disk_read_time_seconds_total[1m]) / rate(node_disk_reads_completed_total[1m]) > 0.1 and rate(node_disk_reads_completed_total[1m]) > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk read latency (instance {{ $labels.instance }})
      description: "Disk latency is growing (read operations > 100ms)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostUnusualDiskWriteLatency
    expr: (rate(node_disk_write_time_seconds_total[1m]) / rate(node_disk_writes_completed_total[1m]) > 0.1 and rate(node_disk_writes_completed_total[1m]) > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk write latency (instance {{ $labels.instance }})
      description: "Disk latency is growing (write operations > 100ms)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostHighCpuLoad
    expr: (sum by (instance) (avg by (mode, instance) (rate(node_cpu_seconds_total{mode!="idle"}[2m]))) > 0.8) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Host high CPU load (instance {{ $labels.instance }})
      description: "CPU load is > 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  # You may want to increase the alert manager 'repeat_interval' for this type of alert to daily or weekly
  #- alert: HostCpuIsUnderutilized
  #  expr: (100 - (rate(node_cpu_seconds_total{mode="idle"}[30m]) * 100) < 20) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
  #  for: 1w
  #  labels:
  #    severity: info
  #  annotations:
  #    summary: Host CPU is underutilized (instance {{ $labels.instance }})
  #    description: "CPU load is < 20% for 1 week. Consider reducing the number of CPUs.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostCpuStealNoisyNeighbor
    expr: (avg by(instance) (rate(node_cpu_seconds_total{mode="steal"}[5m])) * 100 > 10) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host CPU steal noisy neighbor (instance {{ $labels.instance }})
      description: "CPU steal is > 10%. A noisy neighbor is killing VM performances or a spot instance may be out of credit.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostCpuHighIowait
    expr: (avg by (instance) (rate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100 > 10) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host CPU high iowait (instance {{ $labels.instance }})
      description: "CPU iowait > 10%. A high iowait means that you are disk or network bound.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostUnusualDiskIo
    expr: (rate(node_disk_io_time_seconds_total[1m]) > 0.5) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host unusual disk IO (instance {{ $labels.instance }})
      description: "Time spent in IO is too high on {{ $labels.instance }}. Check storage for issues.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  # 10000 context switches is an arbitrary number.
  # The alert threshold depends on the nature of the application.
  # Please read: https://github.com/samber/awesome-prometheus-alerts/issues/58
  - alert: HostContextSwitching
    expr: ((rate(node_context_switches_total[5m])) / (count without(cpu, mode) (node_cpu_seconds_total{mode="idle"})) > 10000) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host context switching (instance {{ $labels.instance }})
      description: "Context switching is growing on the node (> 10000 / s)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostSwapIsFillingUp
    expr: ((1 - (node_memory_SwapFree_bytes / node_memory_SwapTotal_bytes)) * 100 > 80) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host swap is filling up (instance {{ $labels.instance }})
      description: "Swap is filling up (>80%)\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostSystemdServiceCrashed
    expr: (node_systemd_unit_state{state="failed"} == 1) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host systemd service crashed (instance {{ $labels.instance }})
      description: "systemd service crashed\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostPhysicalComponentTooHot
    expr: ((node_hwmon_temp_celsius * ignoring(label) group_left(instance, job, node, sensor) node_hwmon_sensor_label{label!="tctl"} > 75)) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host physical component too hot (instance {{ $labels.instance }})
      description: "Physical hardware component too hot\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostNodeOvertemperatureAlarm
    expr: (node_hwmon_temp_crit_alarm_celsius == 1) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Host node overtemperature alarm (instance {{ $labels.instance }})
      description: "Physical node temperature alarm triggered\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostRaidArrayGotInactive
    expr: (node_md_state{state="inactive"} > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 0m
    labels:
      severity: critical
    annotations:
      summary: Host RAID array got inactive (instance {{ $labels.instance }})
      description: "RAID array {{ $labels.device }} is in a degraded state due to one or more disk failures. The number of spare drives is insufficient to fix the issue automatically.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostRaidDiskFailure
    expr: (node_md_disks{state="failed"} > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host RAID disk failure (instance {{ $labels.instance }})
      description: "At least one device in RAID array on {{ $labels.instance }} failed. Array {{ $labels.md_device }} needs attention and possibly a disk swap\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostKernelVersionDeviations
    expr: (count(sum(label_replace(node_uname_info, "kernel", "$1", "release", "([0-9]+.[0-9]+.[0-9]+).*")) by (kernel)) > 1) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 6h
    labels:
      severity: warning
    annotations:
      summary: Host kernel version deviations (instance {{ $labels.instance }})
      description: "Different kernel versions are running\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostOomKillDetected
    expr: (increase(node_vmstat_oom_kill[1m]) > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host OOM kill detected (instance {{ $labels.instance }})
      description: "OOM kill detected\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostEdacCorrectableErrorsDetected
    expr: (increase(node_edac_correctable_errors_total[1m]) > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 0m
    labels:
      severity: info
    annotations:
      summary: Host EDAC Correctable Errors detected (instance {{ $labels.instance }})
      description: "Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} correctable memory errors reported by EDAC in the last 5 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostEdacUncorrectableErrorsDetected
    expr: (node_edac_uncorrectable_errors_total > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Host EDAC Uncorrectable Errors detected (instance {{ $labels.instance }})
      description: "Host {{ $labels.instance }} has had {{ printf \"%.0f\" $value }} uncorrectable memory errors reported by EDAC in the last 5 minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostNetworkReceiveErrors
    expr: (rate(node_network_receive_errs_total[2m]) / rate(node_network_receive_packets_total[2m]) > 0.01) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host Network Receive Errors (instance {{ $labels.instance }})
      description: "Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} receive errors in the last two minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostNetworkTransmitErrors
    expr: (rate(node_network_transmit_errs_total[2m]) / rate(node_network_transmit_packets_total[2m]) > 0.01) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host Network Transmit Errors (instance {{ $labels.instance }})
      description: "Host {{ $labels.instance }} interface {{ $labels.device }} has encountered {{ printf \"%.0f\" $value }} transmit errors in the last two minutes.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostNetworkInterfaceSaturated
    expr: ((rate(node_network_receive_bytes_total{device!~"^tap.*|^vnet.*|^veth.*|^tun.*"}[1m]) + rate(node_network_transmit_bytes_total{device!~"^tap.*|^vnet.*|^veth.*|^tun.*"}[1m])) / node_network_speed_bytes{device!~"^tap.*|^vnet.*|^veth.*|^tun.*"} > 0.8 < 10000) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: Host Network Interface Saturated (instance {{ $labels.instance }})
      description: "The network interface \"{{ $labels.device }}\" on \"{{ $labels.instance }}\" is getting overloaded.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostNetworkBondDegraded
    expr: ((node_bonding_active - node_bonding_slaves) != 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host Network Bond Degraded (instance {{ $labels.instance }})
      description: "Bond \"{{ $labels.device }}\" degraded on \"{{ $labels.instance }}\".\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostConntrackLimit
    expr: (node_nf_conntrack_entries / node_nf_conntrack_entries_limit > 0.8) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Host conntrack limit (instance {{ $labels.instance }})
      description: "The number of conntrack is approaching limit\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostClockSkew
    expr: ((node_timex_offset_seconds > 0.05 and deriv(node_timex_offset_seconds[5m]) >= 0) or (node_timex_offset_seconds < -0.05 and deriv(node_timex_offset_seconds[5m]) <= 0)) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: Host clock skew (instance {{ $labels.instance }})
      description: "Clock skew detected. Clock is out of sync. Ensure NTP is configured correctly on this host.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostClockNotSynchronising
    expr: (min_over_time(node_timex_sync_status[1m]) == 0 and node_timex_maxerror_seconds >= 16) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Host clock not synchronising (instance {{ $labels.instance }})
      description: "Clock not synchronising. Ensure NTP is configured on this host.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: HostRequiresReboot
    expr: (node_reboot_required > 0) * on(instance) group_left (nodename) node_uname_info{nodename=~".+"}
    for: 4h
    labels:
      severity: info
    annotations:
      summary: Host requires reboot (instance {{ $labels.instance }})
      description: "{{ $labels.instance }} requires a reboot.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

### cAdvisor

  # This rule can be very noisy in dynamic infra with legitimate container start/stop/deployment.
  - alert: ContainerKilled
    expr: time() - container_last_seen > 600
    for: 0m
    labels:
      severity: warning
    annotations:
      summary: Container killed (instance {{ $labels.instance }})
      description: "A container has disappeared\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  # This rule can be very noisy in dynamic infra with legitimate container start/stop/deployment.
  - alert: ContainerAbsent
    expr: absent(container_last_seen)
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: Container absent (instance {{ $labels.instance }})
      description: "A container is absent for 5 min\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: ContainerHighCpuUtilization
    expr: (sum(rate(container_cpu_usage_seconds_total{name!=""}[3m])) BY (instance, name) * 100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Container High CPU utilization (instance {{ $labels.instance }})
      description: "Container CPU utilization is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  # See https://medium.com/faun/how-much-is-too-much-the-linux-oomkiller-and-used-memory-d32186f29c9d
  - alert: ContainerHighMemoryUsage
    expr: (sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) > 80
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Container High Memory usage (instance {{ $labels.instance }})
      description: "Container Memory usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  #- alert: ContainerVolumeUsage
  #  expr: (1 - (sum(container_fs_inodes_free{name!=""}) BY (instance) / sum(container_fs_inodes_total) BY (instance))) * 100 > 80
  #  for: 2m
  #  labels:
  #    severity: warning
  #  annotations:
  #    summary: Container Volume usage (instance {{ $labels.instance }})
  #    description: "Container Volume usage is above 80%\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  - alert: ContainerHighThrottleRate
    expr: rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: Container high throttle rate (instance {{ $labels.instance }})
      description: "Container is being throttled\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  #- alert: ContainerLowCpuUtilization
  #  expr: (sum(rate(container_cpu_usage_seconds_total{name!=""}[3m])) BY (instance, name) * 100) < 20
  #  for: 7d
  #  labels:
  #    severity: info
  #  annotations:
  #    summary: Container Low CPU utilization (instance {{ $labels.instance }})
  #    description: "Container CPU utilization is under 20% for 1 week. Consider reducing the allocated CPU.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

  #- alert: ContainerLowMemoryUsage
  #  expr: (sum(container_memory_working_set_bytes{name!=""}) BY (instance, name) / sum(container_spec_memory_limit_bytes > 0) BY (instance, name) * 100) < 20
  #  for: 7d
  #  labels:
  #    severity: info
  #  annotations:
  #    summary: Container Low Memory usage (instance {{ $labels.instance }})
  #    description: "Container Memory usage is under 20% for 1 week. Consider reducing the allocated memory.\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

### Custom-Shit

  - alert: DataDiskSpace15%Free
#    expr: 100 - ((node_filesystem_avail_bytes{device="/dev/sda1",mountpoint="/opt/data"} * 100) / node_filesystem_size_bytes{device="/dev/sda1",mountpoint="/opt/data"}) > 85
    expr: 100 - (100 * ((node_filesystem_avail_bytes{mountpoint="/opt/data",fstype!="rootfs"} )  / (node_filesystem_size_bytes{mountpoint="/opt/data",fstype!="rootfs"}) )) > 85
    labels:
      severity: moderate
    annotations:
      summary: "Instance {{ $labels.instance }} is low on disk space"
      description: "diskspace on {{ $labels.instance }} is used over {{ $value }}% ."

  - alert: DataDisk2Space15%Free
#    expr: 100 - ((node_filesystem_avail_bytes{device="/dev/sdb1",mountpoint="/opt/data2"} * 100) / node_filesystem_size_bytes{device="/dev/sdb1",mountpoint="/opt/data2"}) > 85
    expr: 100 - (100 * ((node_filesystem_avail_bytes{mountpoint="/opt/data2",fstype!="rootfs"} )  / (node_filesystem_size_bytes{mountpoint="/opt/data2",fstype!="rootfs"}) )) > 85
    labels:
      severity: moderate
    annotations:
      summary: "Instance {{ $labels.instance }} is low on disk space"
      description: "diskspace on {{ $labels.instance }} is used over {{ $value }}% ."

  - alert: RootDiskSpace10%Free
#    expr: 100 - ((node_filesystem_avail_bytes{device="/dev/nvme0n1p3",mountpoint="/etc/hostname"} * 100) / node_filesystem_size_bytes{device="/dev/nvme0n1p3",mountpoint="/etc/hostname"}) > 90
    expr: 100 - (100 * ((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} )  / (node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"}) )) > 90
    labels:
      severity: moderate
    annotations:
      summary: "Instance {{ $labels.instance }} is low on disk space"
      description: "diskspace on {{ $labels.instance }} is used over {{ $value }}% ."

alertmanager/config.yml

global:
  resolve_timeout: 1m
  # The smarthost and SMTP sender used for mail notifications.
  #smtp_smarthost: 'mail.{DOMAIN}:587'
  #smtp_from: 'grafana@{DOMAIN}'
  #smtp_auth_username: 'grafana@{DOMAIN}'
  #smtp_auth_password: '${SECRET}'

route:
  receiver: 'ntfy'
  #receiver: 'Mail Alerts'

receivers:
  - name: "ntfy"
    webhook_configs:
        - url: "http://ntfy-alertmanager:8080"
          http_config:
            basic_auth:
              username: "${USER}"
              password: "${SECRET}"
  - name: 'Mail Alerts'
    email_configs:
      - smarthost: '${DOMAIN}:587'
        auth_username: 'grafana@{DOMAIN}'
        auth_password: "${SECRET}"
        from: 'grafana@{DOMAIN}'
        to: '${USER}@{DOMAIN}'
        send_resolved: true
        headers:
          subject: 'Prometheus Mail Alerts'

ntfy-alertmanager/config.scfg

# Public facing base URL of the service (e.g. https://ntfy-alertmanager.xenrox.net)
# This setting is required for the "Silence" feature.
base-url https://ntfy-alertmanager.${DOMAIN}
# http listen address
http-address :8080
# Log level (either debug, info, warning, error)
log-level debug
# Log format (either text or json)
log-format text
# When multiple alerts are grouped together by Alertmanager, they can either be sent
# each on their own (single mode) or be kept together (multi mode) (either single or multi; default is multi)
alert-mode single
# Optionally protect with HTTP basic authentication
#user webhookUser
#password webhookPass

labels {
    order "severity,instance"

    severity "critical" {
        priority 5
        tags "rotating_light"
        icon "https://foo.com/critical.png"
        # Forward messages which severity "critical" to the specified email address.
        #email-address ${USER}@${DOMAIN}
        # Call the specified number. Use `yes` to pick the first of your verified numbers.
        #call yes
    }

    severity "warning" {
        priority 3
    }

    severity "info" {
        priority 1
    }

    instance "brothertec.eu" {
        tags "computer,example"
    }
}

# Settings for resolved alerts
resolved {
    tags "resolved,partying_face"
    icon "https://foo.com/resolved.png"
    priority 1
}

ntfy {
    # URL of the ntfy topic - required
    topic https://ntfy.brothertec.eu/alertmanager-alerts
    # ntfy authentication via Basic Auth (https://docs.ntfy.sh/publish/#username-password)
    user simono41
    password ${SECRET}
    # ntfy authentication via access tokens (https://docs.ntfy.sh/publish/#access-tokens)
    # Either access-token or a user/password combination can be used - not both.
    #access-token foobar
    # When using (self signed) certificates that cannot be verified, you can instead specify
    # the SHA512 fingerprint.
    # openssl can be used to obtain it:
    # openssl s_client -connect HOST:PORT | openssl x509 -fingerprint -sha512 -noout
    # For convenience ntfy-alertmanager will convert the certificate to lower case and remove all colons.
    certificate-fingerprint 13:6D:2B:88:9C:57:36:D0:81:B4:B2:9C:79:09:27:62:92:CF:B8:6A:6B:D3:AD:46:35:CB:70:17:EB:99:6E:28:08:2A:B8:C6:79:4B:F6:2E:81:79:41:98:1D:53:C8:07:B3:5C:24:5F:B1:8E:B6:FB:66:B5:DD:B4:D0:5C:29:91
    # Forward all messages to the specified email address.
    #email-address ${USER}@${DOMAIN}
    # Call the specified number for all alerts. Use `yes` to pick the first of your verified numbers.
    #call +49123456789
}

alertmanager {
    # If set, the ntfy message will contain a "Silence" button, which can be used
    # to create a silence via the Alertmanager API. Because of limitations in ntfy,
    # the request will be proxied through ntfy-alertmanager. Therefore ntfy-alertmanager
    # needs to be exposed to external network requests and base-url has to be set.
    #
    # When alert-mode is set to "single" all alert labels will be used to create the silence.
    # When it is "multi" common labels between all the alerts will be used. WARNING: This
    # could silence unwanted alerts.
    silence-duration 24h
    # Basic authentication (https://prometheus.io/docs/alerting/latest/https/)
    user simono41
    password ${SECRET}
    # By default the Alertmanager URL gets parsed from the webhook. In case that
    # Alertmanger is not reachable under that URL, it can be overwritten here.
    url https://alertmanager.${DOMAIN}
}

# When the alert-mode is set to single, ntfy-alertmanager will cache each single alert
# to avoid sending recurrences.
cache {
    # The type of cache that will be used (either disabled, memory or redis; default is disabled).
    type redis
    # How long messages stay in the cache for
    duration 24h

    # Memory cache settings
    # Interval in which the cache is cleaned up
    #cleanup-interval 1h

    # Redis cache settings
    # URL to connect to redis (default: redis://localhost:6379)
    #redis-url redis://user:password@localhost:6789/3
    redis-url redis://redis:6379
}

blackbox/blackbox.yml

modules:
  http_2xx:
    prober: http
    http:
      preferred_ip_protocol: "ip4"
  http_post_2xx:
    prober: http
    http:
      method: POST
  http_basic_auth:
    prober: http
    timeout: 15s
    http:
      preferred_ip_protocol: "ip4"
      method: POST
      basic_auth:
        username: "${USER}"
        password: "${SECRET}"
  tcp_connect:
    prober: tcp
  pop3s_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^+OK"
      tls: true
      tls_config:
        insecure_skip_verify: false
  grpc:
    prober: grpc
    grpc:
      tls: true
      preferred_ip_protocol: "ip4"
  grpc_plain:
    prober: grpc
    grpc:
      tls: false
      service: "service1"
  ssh_banner:
    prober: tcp
    tcp:
      query_response:
      - expect: "^SSH-2.0-"
      - send: "SSH-2.0-blackbox-ssh-check"
  irc_banner:
    prober: tcp
    tcp:
      query_response:
      - send: "NICK prober"
      - send: "USER prober prober prober :prober"
      - expect: "PING :([^ ]+)"
        send: "PONG ${1}"
      - expect: "^:[^ ]+ 001"
  icmp:
    prober: icmp
  icmp_ttl5:
    prober: icmp
    timeout: 5s
    icmp:
      ttl: 5

  smtp_starttls:
    prober: tcp
    timeout: 20s
    tcp:
      preferred_ip_protocol: ip4
      tls_config:
        insecure_skip_verify: true
      query_response:
        - expect: "^220 ([^ ]+) ESMTP (.+)$"
        - send: "EHLO prober"
        - expect: "^250-(.*)"
        - send: "STARTTLS"
        - expect: "^220"
        - starttls: true
        - send: "EHLO prober"
        - expect: "^250-"
        - send: "QUIT"

loki/loki-config.yaml

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  instance_addr: 127.0.0.1
  path_prefix: /loki
  storage:
    filesystem:
      chunks_directory: /loki/chunks
      rules_directory: /loki/rules
  replication_factor: 1
  ring:
    kvstore:
      store: inmemory

query_range:
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        max_size_mb: 100

limits_config:
   reject_old_samples: true
   reject_old_samples_max_age: 168h
   retention_period: 360h
   max_query_series: 100000
   max_query_parallelism: 2

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

ruler:
  alertmanager_url: http://alertmanager:9093

# By default, Loki will send anonymous, but uniquely-identifiable usage and configuration
# analytics to Grafana Labs. These statistics are sent to https://stats.grafana.org/
#
# Statistics help us better understand how Loki is used, and they show us performance
# levels for most users. This helps us prioritize features and documentation.
# For more information on what's sent, look at
# https://github.com/grafana/loki/blob/main/pkg/usagestats/stats.go
# Refer to the buildReport method to see what goes into a report.
#
# If you would like to disable reporting, uncomment the following lines:
#analytics:
#  reporting_enabled: false

promtail/promtail-config.yaml

server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
- job_name: system
  static_configs:
  - targets:
      - localhost
    labels:
      job: varlogs
      __path__: /var/log/*log

- job_name: nginx
  static_configs:
  - targets:
      - localhost
    labels:
      job: nginxlogs
      __path__: /config/log/nginx/*log

- job_name: containers
  static_configs:
  - targets:
      - localhost
    labels:
      job: containerlogs
      __path__: /var/lib/docker/containers/*/*log

  pipeline_stages:
  - json:
      expressions:
        output: log
        stream: stream
        attrs:
  - json:
      expressions:
        tag:
      source: attrs
  - regex:
      expression: (?P<image_name>(?:[^|]*[^|])).(?P<container_name>(?:[^|]*[^|])).(?P<image_id>(?:[^|]*[^|])).(?P<container_id>(?:[^|]*[^|]))
      source: tag
  - timestamp:
      format: RFC3339Nano
      source: time
  - labels:
      tag:
      stream:
      image_name:
      container_name:
      image_id:
      container_id:
  - output:
      source: output

Führen Sie den folgenden Befehl im Verzeichnis aus, in dem sich die docker-compose.yml-Datei befindet:

docker-compose up -d

Dies startet die Dienste im Hintergrund.

Anpassungen

Passen Sie die Umgebungsvariablen, Volumes, Netzwerke und Labels in der docker-compose.yml-Datei an Ihre spezifischen Anforderungen an.

Weitere Informationen

Für weitere Informationen zu den einzelnen Diensten und deren Konfigurationen, siehe die entsprechenden Dokumentationen: