Prometheus High Memory Usage — Fix

The Problem

Prometheus uses an increasing amount of RAM over time, eventually consuming several gigabytes and causing OOM kills or system instability. Common symptoms:

Updated March 2026: Verified with latest Docker images and configurations.

  • Container restarting due to OOM (Out of Memory) kills
  • Server swapping heavily with Prometheus as the top consumer
  • docker stats showing Prometheus using 2-8+ GB RAM
  • Error in logs: storage: no space left on device or out of memory

The Cause

Prometheus stores recent data in memory before writing it to disk. Three factors drive memory usage:

  1. High cardinality — too many unique time series (label combinations)
  2. Long retention — keeping data for months with default settings
  3. Large scrape targets — endpoints returning thousands of metrics per scrape
  4. Head block size — the in-memory block grows proportionally with active series count

The Fix

Method 1: Reduce Retention Period

Prometheus defaults to 15 days of retention. For homelabs, this is often more than needed:

services:
  prometheus:
    image: prom/prometheus:v3.10.0
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=7d'
      - '--storage.tsdb.retention.size=5GB'
    restart: unless-stopped
FlagEffect
--storage.tsdb.retention.time=7dDelete data older than 7 days
--storage.tsdb.retention.size=5GBDelete oldest data when storage exceeds 5 GB

Both can be set simultaneously — whichever triggers first wins. Restart Prometheus after changing:

docker compose restart prometheus

Method 2: Reduce Cardinality

High cardinality is the most common cause. Check your cardinality:

# In Prometheus UI — count total active time series
prometheus_tsdb_head_series

If this number is over 100,000, you likely have a cardinality problem.

Find the culprits:

# Top 10 metrics by cardinality
topk(10, count by (__name__)({__name__=~".+"}))

Common high-cardinality offenders:

MetricTypical CardinalityFix
container_* (cAdvisor)500+ per containerDrop unused metrics
node_cpu_seconds_totalPer-core × per-modeNormal, but limit cores scraped
Custom app metrics with high-cardinality labelsVariesRemove instance_id, request_id labels

Drop unused metrics with metric_relabel_configs:

# prometheus.yml
scrape_configs:
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
    metric_relabel_configs:
      # Drop metrics you don't need
      - source_labels: [__name__]
        regex: 'container_tasks_state|container_memory_failures_total|container_blkio.*'
        action: drop

Method 3: Limit Scrape Interval

More frequent scraping = more data in memory:

# prometheus.yml
global:
  scrape_interval: 30s      # Default is 15s — double it to halve data rate
  evaluation_interval: 30s

scrape_configs:
  - job_name: 'node'
    scrape_interval: 60s     # Less critical targets can scrape less often
    static_configs:
      - targets: ['node-exporter:9100']

For homelab monitoring, 30-60 second intervals are perfectly adequate. You don’t need 15-second granularity for tracking server health.

Method 4: Set Memory Limits

Prevent Prometheus from consuming all available RAM:

services:
  prometheus:
    image: prom/prometheus:v3.10.0
    deploy:
      resources:
        limits:
          memory: 2G
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=7d'

If Prometheus hits the limit, it will restart rather than consuming all system memory. This protects other services.

Method 5: Compact the WAL

If disk usage is high from a large Write-Ahead Log:

# Check WAL size
docker exec prometheus du -sh /prometheus/wal

# Force a compaction (Prometheus must be stopped)
docker compose stop prometheus
docker compose start prometheus

Prometheus compacts the WAL on startup. A clean restart often reclaims significant disk space.

Prevention

  • Set retention.time and retention.size explicitly — don’t rely on defaults
  • Monitor Prometheus’s own metrics (prometheus_tsdb_head_series, process_resident_memory_bytes)
  • Use metric_relabel_configs to drop unused metrics from noisy exporters
  • Increase scrape intervals for non-critical targets
  • Set Docker memory limits to prevent system-wide impact
  • Consider Thanos or VictoriaMetrics for long-term storage instead of increasing Prometheus retention

Comments