Build a Production Monitoring Stack with Prometheus and Grafana

Overview

Flying blind on your infrastructure is a liability. Without metrics you are left guessing whether a host is healthy, when a disk will fill, or whether a service silently died overnight. Prometheus and Grafana solve this with a pull-based metrics pipeline that is easy to deploy, cheaply self-hosted, and battle-tested at scale.

By the end of this project you will have a running observability stack that collects Linux host metrics, Docker container stats, and HTTP endpoint health, visualises everything through curated Grafana dashboards, and pages you over email or Slack when something goes wrong. The entire stack runs in Docker Compose — no Kubernetes required — and can be stood up on a single $5/month VPS or an old homelab node.

What you will build

Prometheus — time-series database that scrapes metrics on a configurable interval
Node Exporter — exposes CPU, memory, disk, and network metrics from the Linux host
cAdvisor — exposes per-container resource usage metrics from the Docker daemon
Blackbox Exporter — probes HTTP/HTTPS endpoints and returns availability and latency metrics
AlertManager — receives firing alerts from Prometheus and routes them to email or Slack
Grafana — query engine and dashboard UI backed by Prometheus as a data source

Architecture

  ┌──────────────────────────────────────────────────────────────┐
  │  Docker host                                                  │
  │                                                               │
  │  Node Exporter :9100  ──┐                                    │
  │  cAdvisor      :8080  ──┼──► Prometheus :9090 ──► Grafana :3000
  │  Blackbox Exp  :9115  ──┘         │                          │
  │                                   └──► AlertManager :9093    │
  │                                              │                │
  └──────────────────────────────────────────────┼────────────────┘
                                                 ▼
                                       Email / Slack webhook

Prometheus pulls (scrapes) metrics from each exporter on a configurable interval (default 15 s). This inversion of control means exporters are stateless — they just expose an HTTP endpoint — and Prometheus owns the retention policy. Grafana sits in front of Prometheus's query API and turns PromQL results into panels. AlertManager receives rule violations from Prometheus and handles deduplication, silencing, and routing.

All five services share a single Docker network (monitoring) so they can resolve each other by container name. Only Grafana and Prometheus expose ports to the host; everything else is internal.

Prerequisites

Docker Engine 24+ and Docker Compose v2
A Linux host (bare-metal, VM, or VPS) — 2 vCPU / 2 GB RAM is sufficient
A domain or local DNS for the Grafana reverse proxy (optional but recommended)
An SMTP account or Slack incoming webhook for alert routing

Step 1 — Project layout

Create the directory tree before writing any configs:

mkdir -p monitoring/{prometheus,alertmanager,grafana/provisioning/{datasources,dashboards}}
cd monitoring

Final layout:

monitoring/
├── docker-compose.yml
├── prometheus/
│   ├── prometheus.yml
│   └── rules/
│       └── alerts.yml
├── alertmanager/
│   └── alertmanager.yml
└── grafana/
    └── provisioning/
        ├── datasources/
        │   └── prometheus.yml
        └── dashboards/
            └── dashboards.yml

Step 2 — Docker Compose

docker-compose.yml:

version: "3.9"
 
networks:
  monitoring:
    driver: bridge
 
volumes:
  prometheus_data: {}
  grafana_data: {}
 
services:
 
  prometheus:
    image: prom/prometheus:v2.52.0
    container_name: prometheus
    restart: unless-stopped
    networks: [monitoring]
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/rules:/etc/prometheus/rules:ro
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"
      - "--web.enable-lifecycle"
 
  node-exporter:
    image: prom/node-exporter:v1.8.1
    container_name: node-exporter
    restart: unless-stopped
    networks: [monitoring]
    pid: host
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - "--path.procfs=/host/proc"
      - "--path.sysfs=/host/sys"
      - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
 
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    container_name: cadvisor
    restart: unless-stopped
    networks: [monitoring]
    privileged: true
    devices:
      - /dev/kmsg
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
      - /cgroup:/cgroup:ro
 
  blackbox-exporter:
    image: prom/blackbox-exporter:v0.25.0
    container_name: blackbox-exporter
    restart: unless-stopped
    networks: [monitoring]
 
  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    restart: unless-stopped
    networks: [monitoring]
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
 
  grafana:
    image: grafana/grafana:11.1.0
    container_name: grafana
    restart: unless-stopped
    networks: [monitoring]
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=changeme          # change this
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_SERVER_ROOT_URL=https://grafana.example.com
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro

Security note: In production, move GF_SECURITY_ADMIN_PASSWORD to a .env file excluded from version control, and put Grafana behind a reverse proxy with TLS (Traefik or Nginx).

Step 3 — Prometheus configuration

prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
 
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]
 
rule_files:
  - "/etc/prometheus/rules/*.yml"
 
scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]
 
  - job_name: node
    static_configs:
      - targets: ["node-exporter:9100"]
 
  - job_name: cadvisor
    static_configs:
      - targets: ["cadvisor:8080"]
 
  - job_name: blackbox_http
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://labs.cosmicbytez.ca
          - https://grafana.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

The blackbox_http job is a multi-target scrape pattern: for each URL in targets, Prometheus relabels the address to point at the Blackbox Exporter and passes the real URL as a query parameter. The exporter performs an HTTP probe and returns PASS/FAIL plus response time.

Step 4 — Alert rules

prometheus/rules/alerts.yml:

groups:
  - name: host
    rules:
      - alert: HostDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Exporter {{ $labels.job }} is down"
          description: "Target {{ $labels.instance }} has been unreachable for more than 2 minutes."
 
      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.1f\" }}% (threshold 90%)."
 
      - alert: DiskFillingUp
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk filling up on {{ $labels.instance }}"
          description: "Root filesystem has {{ $value | printf \"%.1f\" }}% free space remaining."
 
      - alert: MemoryPressure
        expr: |
          (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low memory on {{ $labels.instance }}"
          description: "Available memory is {{ $value | printf \"%.1f\" }}% of total."
 
      - alert: HTTPEndpointDown
        expr: probe_success == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "HTTP probe failed for {{ $labels.instance }}"
          description: "{{ $labels.instance }} has been unreachable for more than 3 minutes."
 
      - alert: SSLCertExpiringSoon
        expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL cert expiring soon for {{ $labels.instance }}"
          description: "Certificate expires in {{ $value | humanizeDuration }}."

Step 5 — AlertManager routing

alertmanager/alertmanager.yml:

global:
  smtp_smarthost: "smtp.example.com:587"
  smtp_from: "alerts@example.com"
  smtp_auth_username: "alerts@example.com"
  smtp_auth_password: "your-smtp-password"   # use env var in production
 
route:
  receiver: default
  group_by: [alertname, instance]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  routes:
    - match:
        severity: critical
      receiver: critical-alerts
      continue: true
 
receivers:
  - name: default
    email_configs:
      - to: "you@example.com"
        send_resolved: true
 
  - name: critical-alerts
    email_configs:
      - to: "oncall@example.com"
        send_resolved: true
    slack_configs:
      - api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
        channel: "#alerts"
        title: "{{ .CommonAnnotations.summary }}"
        text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
        send_resolved: true
 
inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: [alertname, instance]

The inhibit_rules block suppresses warning-level alerts for the same instance when a critical alert is already firing — this prevents alert storms where a downed host generates both a HostDown critical and multiple metric-based warnings simultaneously.

Step 6 — Grafana provisioning

Grafana supports automatic provisioning of data sources and dashboard playlists via YAML files, eliminating manual UI setup after restarts.

grafana/provisioning/datasources/prometheus.yml:

apiVersion: 1
 
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      timeInterval: "15s"
      httpMethod: POST

grafana/provisioning/dashboards/dashboards.yml:

apiVersion: 1
 
providers:
  - name: default
    type: file
    updateIntervalSeconds: 30
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

To import community dashboards (JSON files exported from grafana.com), download them and mount under /var/lib/grafana/dashboards. The most useful ones for this stack:

Dashboard ID	Name
1860	Node Exporter Full
14282	Kubernetes / Container (cAdvisor)
7587	Prometheus Blackbox Exporter
9578	Node Exporter for Prometheus Dashboard

Download via the Grafana UI at Dashboards → Import → Enter ID, or pull the JSON directly:

curl -o grafana/dashboards/node-exporter.json \
  "https://grafana.com/api/dashboards/1860/revisions/latest/download"

Then add a volume mount in docker-compose.yml:

grafana:
  volumes:
    - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
    - ./grafana/provisioning:/etc/grafana/provisioning:ro

Step 7 — Start the stack

# Pull all images first (optional but faster on first run)
docker compose pull
 
# Start everything detached
docker compose up -d
 
# Watch startup logs
docker compose logs -f --tail=50

Check that all six containers are healthy:

docker compose ps

Expected output (all running):

NAME                STATUS
alertmanager        running
blackbox-exporter   running
cadvisor            running
grafana             running
node-exporter       running
prometheus          running

Testing

1. Verify Prometheus targets

Open http://<host>:9090/targets. All four targets (prometheus self, node-exporter, cadvisor, blackbox) should show green UP status. A red DOWN means a network or port issue; check docker compose logs <container>.

2. Run a manual PromQL query

In the Prometheus UI at http://<host>:9090/graph, run:

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

You should see a time-series value representing current CPU utilisation percentage. A result of 0.8 means the host is 0.8% busy — healthy for an idle node.

Navigate to http://<host>:3000, log in with the admin credentials from docker-compose.yml, and confirm the Prometheus data source shows a green checkmark under Connections → Data sources → Prometheus → Test.

4. Import and verify the Node Exporter Full dashboard

Go to Dashboards → Import, enter ID 1860, and select the Prometheus data source. You should immediately see CPU, memory, disk I/O, and network panels populated with live data.

5. Trigger a test alert

Temporarily lower the CPU alert threshold to 0 to force it to fire:

expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 0

After reloading Prometheus config with:

curl -X POST http://localhost:9090/-/reload

Check http://<host>:9090/alerts — the HighCPU alert should appear in PENDING state (it fires after the for: 5m hold period). Revert the threshold and reload again.

6. Verify Blackbox probes

curl "http://localhost:9115/probe?target=https://labs.cosmicbytez.ca&module=http_2xx"

Look for probe_success 1 in the output. A 0 means the target is unreachable or returned a non-2xx status.

Hardening for production

Prometheus and AlertManager should not be publicly accessible. Two options:

Bind only to localhost: In docker-compose.yml, change port bindings to 127.0.0.1:9090:9090 and 127.0.0.1:9093:9093. Access via SSH tunnel or a VPN.
Put behind Traefik with basic auth: If you followed the Traefik reverse proxy project, add Traefik labels to the Prometheus service and restrict with a basic-auth middleware:

prometheus:
  labels:
    - "traefik.enable=true"
    - "traefik.http.routers.prometheus.rule=Host(`prometheus.example.com`)"
    - "traefik.http.routers.prometheus.tls.certresolver=letsencrypt"
    - "traefik.http.routers.prometheus.middlewares=prometheus-auth"
    - "traefik.http.middlewares.prometheus-auth.basicauth.users=admin:$$apr1$$..."

Generate the bcrypt hash with: htpasswd -nb admin yourpassword

Grafana hardening checklist:

Change the default admin password immediately
Set GF_AUTH_ANONYMOUS_ENABLED=false
Enable GF_SECURITY_COOKIE_SECURE=true when behind TLS
Rotate API keys and revoke unused service accounts regularly
Review Grafana's RBAC — viewers should not have Edit permission on dashboards

Retention and storage sizing

Prometheus compresses time-series data aggressively. As a rough guide:

Hosts monitored	Scrape interval	30-day storage
1 host, 3 exporters	15 s	~500 MB
5 hosts, 3 exporters each	15 s	~2.5 GB
20 hosts	15 s	~10 GB

The --storage.tsdb.retention.time=30d flag in the Compose file controls how far back Prometheus stores data. Adjust to 15d for smaller disks or 90d if you have the space. For long-term retention (years), add Thanos or VictoriaMetrics as a remote write target.

Extensions and next steps

Add more exporters:

mysqld_exporter / postgres_exporter — database metrics
redis_exporter — cache hit rates and memory
snmp_exporter — network switch and router metrics via SNMP
smartctl_exporter — hard disk S.M.A.R.T. data

Upgrade alerting:

Integrate PagerDuty or OpsGenie for on-call escalation
Add silence/inhibit rules in AlertManager to suppress planned-maintenance noise
Use Grafana Oncall (open-source) for a full on-call schedule UI

Long-term storage with VictoriaMetrics: Replace the Prometheus TSDB with VictoriaMetrics as a drop-in remote write target for 10x better compression and multi-year retention without a Thanos sidecar:

victoriametrics:
  image: victoriametrics/victoria-metrics:v1.101.0
  command:
    - "--retentionPeriod=12"    # months
  volumes:
    - vm_data:/storage

Then add to prometheus.yml:

remote_write:
  - url: http://victoriametrics:8428/api/v1/write

Automate dashboard-as-code with Grafonnet: Manage dashboards in Jsonnet source files instead of JSON blobs, enabling diff-friendly version control and templated multi-environment dashboards.

Pair with a log stack: Prometheus handles metrics; logs need a separate pipeline. Add Loki + Promtail alongside this stack for correlated metrics-and-logs investigations directly inside Grafana — both data sources render in the same Explore view.

Overview

What you will build

Prometheus — time-series database that scrapes metrics on a configurable interval
Node Exporter — exposes CPU, memory, disk, and network metrics from the Linux host
cAdvisor — exposes per-container resource usage metrics from the Docker daemon
Blackbox Exporter — probes HTTP/HTTPS endpoints and returns availability and latency metrics
AlertManager — receives firing alerts from Prometheus and routes them to email or Slack
Grafana — query engine and dashboard UI backed by Prometheus as a data source

Architecture

  ┌──────────────────────────────────────────────────────────────┐
  │  Docker host                                                  │
  │                                                               │
  │  Node Exporter :9100  ──┐                                    │
  │  cAdvisor      :8080  ──┼──► Prometheus :9090 ──► Grafana :3000
  │  Blackbox Exp  :9115  ──┘         │                          │
  │                                   └──► AlertManager :9093    │
  │                                              │                │
  └──────────────────────────────────────────────┼────────────────┘
                                                 ▼
                                       Email / Slack webhook

All five services share a single Docker network (monitoring) so they can resolve each other by container name. Only Grafana and Prometheus expose ports to the host; everything else is internal.

Prerequisites

Docker Engine 24+ and Docker Compose v2
A Linux host (bare-metal, VM, or VPS) — 2 vCPU / 2 GB RAM is sufficient
A domain or local DNS for the Grafana reverse proxy (optional but recommended)
An SMTP account or Slack incoming webhook for alert routing

Step 1 — Project layout

Create the directory tree before writing any configs:

mkdir -p monitoring/{prometheus,alertmanager,grafana/provisioning/{datasources,dashboards}}
cd monitoring

Final layout:

monitoring/
├── docker-compose.yml
├── prometheus/
│   ├── prometheus.yml
│   └── rules/
│       └── alerts.yml
├── alertmanager/
│   └── alertmanager.yml
└── grafana/
    └── provisioning/
        ├── datasources/
        │   └── prometheus.yml
        └── dashboards/
            └── dashboards.yml

Step 2 — Docker Compose

docker-compose.yml:

version: "3.9"
 
networks:
  monitoring:
    driver: bridge
 
volumes:
  prometheus_data: {}
  grafana_data: {}
 
services:
 
  prometheus:
    image: prom/prometheus:v2.52.0
    container_name: prometheus
    restart: unless-stopped
    networks: [monitoring]
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/rules:/etc/prometheus/rules:ro
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"
      - "--web.enable-lifecycle"
 
  node-exporter:
    image: prom/node-exporter:v1.8.1
    container_name: node-exporter
    restart: unless-stopped
    networks: [monitoring]
    pid: host
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - "--path.procfs=/host/proc"
      - "--path.sysfs=/host/sys"
      - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
 
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    container_name: cadvisor
    restart: unless-stopped
    networks: [monitoring]
    privileged: true
    devices:
      - /dev/kmsg
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
      - /cgroup:/cgroup:ro
 
  blackbox-exporter:
    image: prom/blackbox-exporter:v0.25.0
    container_name: blackbox-exporter
    restart: unless-stopped
    networks: [monitoring]
 
  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    restart: unless-stopped
    networks: [monitoring]
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
 
  grafana:
    image: grafana/grafana:11.1.0
    container_name: grafana
    restart: unless-stopped
    networks: [monitoring]
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=changeme          # change this
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_SERVER_ROOT_URL=https://grafana.example.com
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro

Security note: In production, move GF_SECURITY_ADMIN_PASSWORD to a .env file excluded from version control, and put Grafana behind a reverse proxy with TLS (Traefik or Nginx).

Step 3 — Prometheus configuration

prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
 
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]
 
rule_files:
  - "/etc/prometheus/rules/*.yml"
 
scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]
 
  - job_name: node
    static_configs:
      - targets: ["node-exporter:9100"]
 
  - job_name: cadvisor
    static_configs:
      - targets: ["cadvisor:8080"]
 
  - job_name: blackbox_http
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://labs.cosmicbytez.ca
          - https://grafana.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Step 4 — Alert rules

prometheus/rules/alerts.yml:

groups:
  - name: host
    rules:
      - alert: HostDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Exporter {{ $labels.job }} is down"
          description: "Target {{ $labels.instance }} has been unreachable for more than 2 minutes."
 
      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.1f\" }}% (threshold 90%)."
 
      - alert: DiskFillingUp
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk filling up on {{ $labels.instance }}"
          description: "Root filesystem has {{ $value | printf \"%.1f\" }}% free space remaining."
 
      - alert: MemoryPressure
        expr: |
          (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low memory on {{ $labels.instance }}"
          description: "Available memory is {{ $value | printf \"%.1f\" }}% of total."
 
      - alert: HTTPEndpointDown
        expr: probe_success == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "HTTP probe failed for {{ $labels.instance }}"
          description: "{{ $labels.instance }} has been unreachable for more than 3 minutes."
 
      - alert: SSLCertExpiringSoon
        expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL cert expiring soon for {{ $labels.instance }}"
          description: "Certificate expires in {{ $value | humanizeDuration }}."

Step 5 — AlertManager routing

alertmanager/alertmanager.yml:

global:
  smtp_smarthost: "smtp.example.com:587"
  smtp_from: "alerts@example.com"
  smtp_auth_username: "alerts@example.com"
  smtp_auth_password: "your-smtp-password"   # use env var in production
 
route:
  receiver: default
  group_by: [alertname, instance]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  routes:
    - match:
        severity: critical
      receiver: critical-alerts
      continue: true
 
receivers:
  - name: default
    email_configs:
      - to: "you@example.com"
        send_resolved: true
 
  - name: critical-alerts
    email_configs:
      - to: "oncall@example.com"
        send_resolved: true
    slack_configs:
      - api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
        channel: "#alerts"
        title: "{{ .CommonAnnotations.summary }}"
        text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
        send_resolved: true
 
inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: [alertname, instance]

Step 6 — Grafana provisioning

Grafana supports automatic provisioning of data sources and dashboard playlists via YAML files, eliminating manual UI setup after restarts.

grafana/provisioning/datasources/prometheus.yml:

apiVersion: 1
 
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      timeInterval: "15s"
      httpMethod: POST

grafana/provisioning/dashboards/dashboards.yml:

apiVersion: 1
 
providers:
  - name: default
    type: file
    updateIntervalSeconds: 30
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

To import community dashboards (JSON files exported from grafana.com), download them and mount under /var/lib/grafana/dashboards. The most useful ones for this stack:

Dashboard ID	Name
1860	Node Exporter Full
14282	Kubernetes / Container (cAdvisor)
7587	Prometheus Blackbox Exporter
9578	Node Exporter for Prometheus Dashboard

Download via the Grafana UI at Dashboards → Import → Enter ID, or pull the JSON directly:

curl -o grafana/dashboards/node-exporter.json \
  "https://grafana.com/api/dashboards/1860/revisions/latest/download"

Then add a volume mount in docker-compose.yml:

grafana:
  volumes:
    - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
    - ./grafana/provisioning:/etc/grafana/provisioning:ro

Step 7 — Start the stack

# Pull all images first (optional but faster on first run)
docker compose pull
 
# Start everything detached
docker compose up -d
 
# Watch startup logs
docker compose logs -f --tail=50

Check that all six containers are healthy:

docker compose ps

Expected output (all running):

NAME                STATUS
alertmanager        running
blackbox-exporter   running
cadvisor            running
grafana             running
node-exporter       running
prometheus          running

Testing

1. Verify Prometheus targets

2. Run a manual PromQL query

In the Prometheus UI at http://<host>:9090/graph, run:

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

You should see a time-series value representing current CPU utilisation percentage. A result of 0.8 means the host is 0.8% busy — healthy for an idle node.

4. Import and verify the Node Exporter Full dashboard

Go to Dashboards → Import, enter ID 1860, and select the Prometheus data source. You should immediately see CPU, memory, disk I/O, and network panels populated with live data.

5. Trigger a test alert

Temporarily lower the CPU alert threshold to 0 to force it to fire:

expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 0

After reloading Prometheus config with:

curl -X POST http://localhost:9090/-/reload

Check http://<host>:9090/alerts — the HighCPU alert should appear in PENDING state (it fires after the for: 5m hold period). Revert the threshold and reload again.

6. Verify Blackbox probes

curl "http://localhost:9115/probe?target=https://labs.cosmicbytez.ca&module=http_2xx"

Look for probe_success 1 in the output. A 0 means the target is unreachable or returned a non-2xx status.

Hardening for production

Prometheus and AlertManager should not be publicly accessible. Two options:

Bind only to localhost: In docker-compose.yml, change port bindings to 127.0.0.1:9090:9090 and 127.0.0.1:9093:9093. Access via SSH tunnel or a VPN.
Put behind Traefik with basic auth: If you followed the Traefik reverse proxy project, add Traefik labels to the Prometheus service and restrict with a basic-auth middleware:

prometheus:
  labels:
    - "traefik.enable=true"
    - "traefik.http.routers.prometheus.rule=Host(`prometheus.example.com`)"
    - "traefik.http.routers.prometheus.tls.certresolver=letsencrypt"
    - "traefik.http.routers.prometheus.middlewares=prometheus-auth"
    - "traefik.http.middlewares.prometheus-auth.basicauth.users=admin:$$apr1$$..."

Generate the bcrypt hash with: htpasswd -nb admin yourpassword

Grafana hardening checklist:

Change the default admin password immediately
Set GF_AUTH_ANONYMOUS_ENABLED=false
Enable GF_SECURITY_COOKIE_SECURE=true when behind TLS
Rotate API keys and revoke unused service accounts regularly
Review Grafana's RBAC — viewers should not have Edit permission on dashboards

Retention and storage sizing

Prometheus compresses time-series data aggressively. As a rough guide:

Hosts monitored	Scrape interval	30-day storage
1 host, 3 exporters	15 s	~500 MB
5 hosts, 3 exporters each	15 s	~2.5 GB
20 hosts	15 s	~10 GB

Extensions and next steps

Add more exporters:

mysqld_exporter / postgres_exporter — database metrics
redis_exporter — cache hit rates and memory
snmp_exporter — network switch and router metrics via SNMP
smartctl_exporter — hard disk S.M.A.R.T. data

Upgrade alerting:

Integrate PagerDuty or OpsGenie for on-call escalation
Add silence/inhibit rules in AlertManager to suppress planned-maintenance noise
Use Grafana Oncall (open-source) for a full on-call schedule UI

victoriametrics:
  image: victoriametrics/victoria-metrics:v1.101.0
  command:
    - "--retentionPeriod=12"    # months
  volumes:
    - vm_data:/storage

Then add to prometheus.yml:

remote_write:
  - url: http://victoriametrics:8428/api/v1/write

Automate dashboard-as-code with Grafonnet: Manage dashboards in Jsonnet source files instead of JSON blobs, enabling diff-friendly version control and templated multi-environment dashboards.

Tools & Technologies

Overview

What you will build

Architecture

Prerequisites

Step 1 — Project layout

Step 2 — Docker Compose

Step 3 — Prometheus configuration

Step 4 — Alert rules

Step 5 — AlertManager routing

Step 6 — Grafana provisioning

Step 7 — Start the stack

Testing

1. Verify Prometheus targets

2. Run a manual PromQL query

3. Verify Grafana login

4. Import and verify the Node Exporter Full dashboard

5. Trigger a test alert

6. Verify Blackbox probes

Hardening for production

Retention and storage sizing

Extensions and next steps

Tools & Technologies

Overview

What you will build

Architecture

Prerequisites

Step 1 — Project layout

Step 2 — Docker Compose

Step 3 — Prometheus configuration

Step 4 — Alert rules

Step 5 — AlertManager routing

Step 6 — Grafana provisioning

Step 7 — Start the stack

Testing

1. Verify Prometheus targets

2. Run a manual PromQL query

3. Verify Grafana login

4. Import and verify the Node Exporter Full dashboard

5. Trigger a test alert

6. Verify Blackbox probes

Hardening for production

Retention and storage sizing

Extensions and next steps