Skip to main content
COSMICBYTEZLABS
NewsSecurityHOWTOsToolsStudyTraining
ProjectsChecklistsAI RankingsNewsletterStatusTagsAbout
Subscribe

Press Enter to search or Esc to close

News
Security
HOWTOs
Tools
Study
Training
Projects
Checklists
AI Rankings
Newsletter
Status
Tags
About
RSS Feed
Reading List
Subscribe

Stay in the Loop

Get the latest security alerts, tutorials, and tech insights delivered to your inbox.

Subscribe NowFree forever. No spam.
COSMICBYTEZLABS

Your trusted source for IT intelligence, cybersecurity insights, and hands-on technical guides.

980+ Articles
124+ Guides

CONTENT

  • Latest News
  • Security Alerts
  • HOWTOs
  • Projects
  • Exam Prep

RESOURCES

  • Search
  • Browse Tags
  • Newsletter Archive
  • Reading List
  • RSS Feed

COMPANY

  • About Us
  • Contact
  • Privacy Policy
  • Terms of Service

© 2026 CosmicBytez Labs. All rights reserved.

System Status: Operational
  1. Home
  2. Projects
  3. Build a Production Monitoring Stack with Prometheus and Grafana
Build a Production Monitoring Stack with Prometheus and Grafana
PROJECTIntermediate

Build a Production Monitoring Stack with Prometheus and Grafana

Deploy a full observability stack — Prometheus metrics collection, Grafana dashboards, AlertManager notifications, and three exporters — all containerized with Docker Compose and secured for homelab or small-team production use.

Dylan H.

Projects

May 13, 2026
8 min read
3-5 hours

Tools & Technologies

DockerDocker ComposePrometheusGrafanaAlertManagerNode ExportercAdvisorBlackbox Exporter

Overview

Flying blind on your infrastructure is a liability. Without metrics you are left guessing whether a host is healthy, when a disk will fill, or whether a service silently died overnight. Prometheus and Grafana solve this with a pull-based metrics pipeline that is easy to deploy, cheaply self-hosted, and battle-tested at scale.

By the end of this project you will have a running observability stack that collects Linux host metrics, Docker container stats, and HTTP endpoint health, visualises everything through curated Grafana dashboards, and pages you over email or Slack when something goes wrong. The entire stack runs in Docker Compose — no Kubernetes required — and can be stood up on a single $5/month VPS or an old homelab node.

What you will build

  • Prometheus — time-series database that scrapes metrics on a configurable interval
  • Node Exporter — exposes CPU, memory, disk, and network metrics from the Linux host
  • cAdvisor — exposes per-container resource usage metrics from the Docker daemon
  • Blackbox Exporter — probes HTTP/HTTPS endpoints and returns availability and latency metrics
  • AlertManager — receives firing alerts from Prometheus and routes them to email or Slack
  • Grafana — query engine and dashboard UI backed by Prometheus as a data source

Architecture

  ┌──────────────────────────────────────────────────────────────┐
  │  Docker host                                                  │
  │                                                               │
  │  Node Exporter :9100  ──┐                                    │
  │  cAdvisor      :8080  ──┼──► Prometheus :9090 ──► Grafana :3000
  │  Blackbox Exp  :9115  ──┘         │                          │
  │                                   └──► AlertManager :9093    │
  │                                              │                │
  └──────────────────────────────────────────────┼────────────────┘
                                                 ▼
                                       Email / Slack webhook

Prometheus pulls (scrapes) metrics from each exporter on a configurable interval (default 15 s). This inversion of control means exporters are stateless — they just expose an HTTP endpoint — and Prometheus owns the retention policy. Grafana sits in front of Prometheus's query API and turns PromQL results into panels. AlertManager receives rule violations from Prometheus and handles deduplication, silencing, and routing.

All five services share a single Docker network (monitoring) so they can resolve each other by container name. Only Grafana and Prometheus expose ports to the host; everything else is internal.


Prerequisites

  • Docker Engine 24+ and Docker Compose v2
  • A Linux host (bare-metal, VM, or VPS) — 2 vCPU / 2 GB RAM is sufficient
  • A domain or local DNS for the Grafana reverse proxy (optional but recommended)
  • An SMTP account or Slack incoming webhook for alert routing

Step 1 — Project layout

Create the directory tree before writing any configs:

mkdir -p monitoring/{prometheus,alertmanager,grafana/provisioning/{datasources,dashboards}}
cd monitoring

Final layout:

monitoring/
├── docker-compose.yml
├── prometheus/
│   ├── prometheus.yml
│   └── rules/
│       └── alerts.yml
├── alertmanager/
│   └── alertmanager.yml
└── grafana/
    └── provisioning/
        ├── datasources/
        │   └── prometheus.yml
        └── dashboards/
            └── dashboards.yml

Step 2 — Docker Compose

docker-compose.yml:

version: "3.9"
 
networks:
  monitoring:
    driver: bridge
 
volumes:
  prometheus_data: {}
  grafana_data: {}
 
services:
 
  prometheus:
    image: prom/prometheus:v2.52.0
    container_name: prometheus
    restart: unless-stopped
    networks: [monitoring]
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/rules:/etc/prometheus/rules:ro
      - prometheus_data:/prometheus
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.path=/prometheus"
      - "--storage.tsdb.retention.time=30d"
      - "--web.enable-lifecycle"
 
  node-exporter:
    image: prom/node-exporter:v1.8.1
    container_name: node-exporter
    restart: unless-stopped
    networks: [monitoring]
    pid: host
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - "--path.procfs=/host/proc"
      - "--path.sysfs=/host/sys"
      - "--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)"
 
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    container_name: cadvisor
    restart: unless-stopped
    networks: [monitoring]
    privileged: true
    devices:
      - /dev/kmsg
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker:/var/lib/docker:ro
      - /cgroup:/cgroup:ro
 
  blackbox-exporter:
    image: prom/blackbox-exporter:v0.25.0
    container_name: blackbox-exporter
    restart: unless-stopped
    networks: [monitoring]
 
  alertmanager:
    image: prom/alertmanager:v0.27.0
    container_name: alertmanager
    restart: unless-stopped
    networks: [monitoring]
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
 
  grafana:
    image: grafana/grafana:11.1.0
    container_name: grafana
    restart: unless-stopped
    networks: [monitoring]
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=changeme          # change this
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_SERVER_ROOT_URL=https://grafana.example.com
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro

Security note: In production, move GF_SECURITY_ADMIN_PASSWORD to a .env file excluded from version control, and put Grafana behind a reverse proxy with TLS (Traefik or Nginx).


Step 3 — Prometheus configuration

prometheus/prometheus.yml:

global:
  scrape_interval: 15s
  evaluation_interval: 15s
 
alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]
 
rule_files:
  - "/etc/prometheus/rules/*.yml"
 
scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]
 
  - job_name: node
    static_configs:
      - targets: ["node-exporter:9100"]
 
  - job_name: cadvisor
    static_configs:
      - targets: ["cadvisor:8080"]
 
  - job_name: blackbox_http
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://labs.cosmicbytez.ca
          - https://grafana.example.com
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

The blackbox_http job is a multi-target scrape pattern: for each URL in targets, Prometheus relabels the address to point at the Blackbox Exporter and passes the real URL as a query parameter. The exporter performs an HTTP probe and returns PASS/FAIL plus response time.


Step 4 — Alert rules

prometheus/rules/alerts.yml:

groups:
  - name: host
    rules:
      - alert: HostDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Exporter {{ $labels.job }} is down"
          description: "Target {{ $labels.instance }} has been unreachable for more than 2 minutes."
 
      - alert: HighCPU
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.1f\" }}% (threshold 90%)."
 
      - alert: DiskFillingUp
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 15
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk filling up on {{ $labels.instance }}"
          description: "Root filesystem has {{ $value | printf \"%.1f\" }}% free space remaining."
 
      - alert: MemoryPressure
        expr: |
          (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 < 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low memory on {{ $labels.instance }}"
          description: "Available memory is {{ $value | printf \"%.1f\" }}% of total."
 
      - alert: HTTPEndpointDown
        expr: probe_success == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "HTTP probe failed for {{ $labels.instance }}"
          description: "{{ $labels.instance }} has been unreachable for more than 3 minutes."
 
      - alert: SSLCertExpiringSoon
        expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 14
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "SSL cert expiring soon for {{ $labels.instance }}"
          description: "Certificate expires in {{ $value | humanizeDuration }}."

Step 5 — AlertManager routing

alertmanager/alertmanager.yml:

global:
  smtp_smarthost: "smtp.example.com:587"
  smtp_from: "alerts@example.com"
  smtp_auth_username: "alerts@example.com"
  smtp_auth_password: "your-smtp-password"   # use env var in production
 
route:
  receiver: default
  group_by: [alertname, instance]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  routes:
    - match:
        severity: critical
      receiver: critical-alerts
      continue: true
 
receivers:
  - name: default
    email_configs:
      - to: "you@example.com"
        send_resolved: true
 
  - name: critical-alerts
    email_configs:
      - to: "oncall@example.com"
        send_resolved: true
    slack_configs:
      - api_url: "https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
        channel: "#alerts"
        title: "{{ .CommonAnnotations.summary }}"
        text: "{{ range .Alerts }}{{ .Annotations.description }}\n{{ end }}"
        send_resolved: true
 
inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: [alertname, instance]

The inhibit_rules block suppresses warning-level alerts for the same instance when a critical alert is already firing — this prevents alert storms where a downed host generates both a HostDown critical and multiple metric-based warnings simultaneously.


Step 6 — Grafana provisioning

Grafana supports automatic provisioning of data sources and dashboard playlists via YAML files, eliminating manual UI setup after restarts.

grafana/provisioning/datasources/prometheus.yml:

apiVersion: 1
 
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    jsonData:
      timeInterval: "15s"
      httpMethod: POST

grafana/provisioning/dashboards/dashboards.yml:

apiVersion: 1
 
providers:
  - name: default
    type: file
    updateIntervalSeconds: 30
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

To import community dashboards (JSON files exported from grafana.com), download them and mount under /var/lib/grafana/dashboards. The most useful ones for this stack:

Dashboard IDName
1860Node Exporter Full
14282Kubernetes / Container (cAdvisor)
7587Prometheus Blackbox Exporter
9578Node Exporter for Prometheus Dashboard

Download via the Grafana UI at Dashboards → Import → Enter ID, or pull the JSON directly:

curl -o grafana/dashboards/node-exporter.json \
  "https://grafana.com/api/dashboards/1860/revisions/latest/download"

Then add a volume mount in docker-compose.yml:

grafana:
  volumes:
    - ./grafana/dashboards:/var/lib/grafana/dashboards:ro
    - ./grafana/provisioning:/etc/grafana/provisioning:ro

Step 7 — Start the stack

# Pull all images first (optional but faster on first run)
docker compose pull
 
# Start everything detached
docker compose up -d
 
# Watch startup logs
docker compose logs -f --tail=50

Check that all six containers are healthy:

docker compose ps

Expected output (all running):

NAME                STATUS
alertmanager        running
blackbox-exporter   running
cadvisor            running
grafana             running
node-exporter       running
prometheus          running

Testing

1. Verify Prometheus targets

Open http://<host>:9090/targets. All four targets (prometheus self, node-exporter, cadvisor, blackbox) should show green UP status. A red DOWN means a network or port issue; check docker compose logs <container>.

2. Run a manual PromQL query

In the Prometheus UI at http://<host>:9090/graph, run:

100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

You should see a time-series value representing current CPU utilisation percentage. A result of 0.8 means the host is 0.8% busy — healthy for an idle node.

3. Verify Grafana login

Navigate to http://<host>:3000, log in with the admin credentials from docker-compose.yml, and confirm the Prometheus data source shows a green checkmark under Connections → Data sources → Prometheus → Test.

4. Import and verify the Node Exporter Full dashboard

Go to Dashboards → Import, enter ID 1860, and select the Prometheus data source. You should immediately see CPU, memory, disk I/O, and network panels populated with live data.

5. Trigger a test alert

Temporarily lower the CPU alert threshold to 0 to force it to fire:

expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 0

After reloading Prometheus config with:

curl -X POST http://localhost:9090/-/reload

Check http://<host>:9090/alerts — the HighCPU alert should appear in PENDING state (it fires after the for: 5m hold period). Revert the threshold and reload again.

6. Verify Blackbox probes

curl "http://localhost:9115/probe?target=https://labs.cosmicbytez.ca&module=http_2xx"

Look for probe_success 1 in the output. A 0 means the target is unreachable or returned a non-2xx status.


Hardening for production

Prometheus and AlertManager should not be publicly accessible. Two options:

  1. Bind only to localhost: In docker-compose.yml, change port bindings to 127.0.0.1:9090:9090 and 127.0.0.1:9093:9093. Access via SSH tunnel or a VPN.

  2. Put behind Traefik with basic auth: If you followed the Traefik reverse proxy project, add Traefik labels to the Prometheus service and restrict with a basic-auth middleware:

prometheus:
  labels:
    - "traefik.enable=true"
    - "traefik.http.routers.prometheus.rule=Host(`prometheus.example.com`)"
    - "traefik.http.routers.prometheus.tls.certresolver=letsencrypt"
    - "traefik.http.routers.prometheus.middlewares=prometheus-auth"
    - "traefik.http.middlewares.prometheus-auth.basicauth.users=admin:$$apr1$$..."

Generate the bcrypt hash with: htpasswd -nb admin yourpassword

Grafana hardening checklist:

  • Change the default admin password immediately
  • Set GF_AUTH_ANONYMOUS_ENABLED=false
  • Enable GF_SECURITY_COOKIE_SECURE=true when behind TLS
  • Rotate API keys and revoke unused service accounts regularly
  • Review Grafana's RBAC — viewers should not have Edit permission on dashboards

Retention and storage sizing

Prometheus compresses time-series data aggressively. As a rough guide:

Hosts monitoredScrape interval30-day storage
1 host, 3 exporters15 s~500 MB
5 hosts, 3 exporters each15 s~2.5 GB
20 hosts15 s~10 GB

The --storage.tsdb.retention.time=30d flag in the Compose file controls how far back Prometheus stores data. Adjust to 15d for smaller disks or 90d if you have the space. For long-term retention (years), add Thanos or VictoriaMetrics as a remote write target.


Extensions and next steps

Add more exporters:

  • mysqld_exporter / postgres_exporter — database metrics
  • redis_exporter — cache hit rates and memory
  • snmp_exporter — network switch and router metrics via SNMP
  • smartctl_exporter — hard disk S.M.A.R.T. data

Upgrade alerting:

  • Integrate PagerDuty or OpsGenie for on-call escalation
  • Add silence/inhibit rules in AlertManager to suppress planned-maintenance noise
  • Use Grafana Oncall (open-source) for a full on-call schedule UI

Long-term storage with VictoriaMetrics: Replace the Prometheus TSDB with VictoriaMetrics as a drop-in remote write target for 10x better compression and multi-year retention without a Thanos sidecar:

victoriametrics:
  image: victoriametrics/victoria-metrics:v1.101.0
  command:
    - "--retentionPeriod=12"    # months
  volumes:
    - vm_data:/storage

Then add to prometheus.yml:

remote_write:
  - url: http://victoriametrics:8428/api/v1/write

Automate dashboard-as-code with Grafonnet: Manage dashboards in Jsonnet source files instead of JSON blobs, enabling diff-friendly version control and templated multi-environment dashboards.

Pair with a log stack: Prometheus handles metrics; logs need a separate pipeline. Add Loki + Promtail alongside this stack for correlated metrics-and-logs investigations directly inside Grafana — both data sources render in the same Explore view.

#Prometheus#Grafana#Monitoring#Observability#AlertManager#Docker#Homelab#DevOps#Infrastructure

Related Articles

Building a Production-Ready Reverse Proxy with Traefik v3 and Docker

Deploy Traefik v3 as a Docker-native reverse proxy with automatic Let's Encrypt TLS, label-based routing, and security middleware — no more port juggling...

10 min read

Build a Centralized Log Management System with Loki and

Deploy a scalable log management solution using Grafana Loki. Learn to aggregate, search, and alert on logs from your entire infrastructure.

6 min read

Self-Hosted Password Manager with Vaultwarden

Deploy a fully self-hosted, Bitwarden-compatible password manager using Vaultwarden on Docker with Caddy reverse proxy, automatic TLS, WebSocket...

10 min read
Back to all Projects