Self-Hosting ArchiveBox with Docker Compose

What Is ArchiveBox?

ArchiveBox is a self-hosted web archiver that saves snapshots of web pages in multiple formats — HTML, PDF, screenshot, WARC, and more. Feed it URLs from bookmarks, RSS feeds, or browser history, and it builds a searchable, offline-accessible archive. Think of it as your own personal Wayback Machine. It replaces reliance on archive.org, Pocket’s saved pages, and browser bookmark rot.

Official site: archivebox.io | GitHub

Docker Compose Configuration

Create a directory for ArchiveBox:

mkdir archivebox && cd archivebox
mkdir data

Create a docker-compose.yml file:

services:
  archivebox:
    image: archivebox/archivebox:0.8.5rc52
    container_name: archivebox
    command: server --quick-init 0.0.0.0:8000
    restart: unless-stopped
    ports:
      - "8000:8000"
    environment:
      - ALLOWED_HOSTS=*
      - PUBLIC_INDEX=true
      - PUBLIC_SNAPSHOTS=true
      - PUBLIC_ADD_VIEW=false
      - SEARCH_BACKEND_ENGINE=ripgrep
      - MEDIA_MAX_SIZE=750m
      - TIMEOUT=60
      - CHECK_SSL_VALIDITY=true
      - SAVE_ARCHIVE_DOT_ORG=true
    volumes:
      - ./data:/data
    networks:
      - archivebox-net

  # Optional: Sonic full-text search (faster than ripgrep for large archives)
  # sonic:
  #   image: valeriansaliou/sonic:v1.4.9
  #   container_name: archivebox-sonic
  #   restart: unless-stopped
  #   environment:
  #     - SEARCH_BACKEND_PASSWORD=changeme_sonic
  #   volumes:
  #     - sonic-data:/var/lib/sonic/store
  #   networks:
  #     - archivebox-net

volumes:
  sonic-data:

networks:
  archivebox-net:

Initialize the archive and create an admin account:

docker compose run --rm archivebox init --setup
docker compose run --rm archivebox manage createsuperuser

Start the server:

docker compose up -d

Prerequisites

  • A Linux server (Ubuntu 22.04+ recommended)
  • Docker and Docker Compose installed (guide)
  • 5 GB of free disk space (grows with your archive)
  • 1 GB of RAM minimum, 2 GB recommended
  • A domain name (optional, for remote access)

Initial Setup

Access ArchiveBox at http://your-server-ip:8000. Log in with the superuser credentials you created during initialization.

To add URLs to your archive:

Via the web UI: Click “Add” in the top bar and paste URLs (one per line).

Via CLI:

# Add a single URL
docker compose run --rm archivebox add "https://example.com/article"

# Add from a bookmarks file
docker compose run --rm archivebox add < bookmarks.html

# Add from an RSS feed
docker compose run --rm archivebox add "https://example.com/feed.xml"

What Gets Archived

For each URL, ArchiveBox saves multiple output formats:

FormatToolDescription
HTMLwgetFull static HTML snapshot with assets
PDFChrome/ChromiumRendered PDF of the page
ScreenshotChrome/ChromiumFull-page PNG screenshot
WARCwgetWeb Archive format (industry standard)
ReadabilityMozilla ReadabilityClean article text extraction
SingleFileSingleFileComplete page in one HTML file
GitgitClone entire git repos
Mediayt-dlpDownload videos, audio, playlists
HeaderscurlHTTP response headers

Configuration

Key environment variables:

VariableDefaultDescription
ALLOWED_HOSTS*Restrict access by domain (comma-separated)
PUBLIC_INDEXtrueMake archive index publicly accessible
PUBLIC_SNAPSHOTStrueMake individual snapshots publicly accessible
PUBLIC_ADD_VIEWfalseAllow unauthenticated users to add URLs
SEARCH_BACKEND_ENGINEripgrepSearch backend: ripgrep, sonic, or sqlite
MEDIA_MAX_SIZE750mMaximum file size for media downloads
TIMEOUT60Download timeout in seconds per extractor
CHECK_SSL_VALIDITYtrueSkip pages with invalid SSL certificates if false
SAVE_ARCHIVE_DOT_ORGtrueSubmit URLs to the Wayback Machine as backup

Scheduled Archiving

Add a scheduler service to automatically re-archive URLs on a schedule:

  scheduler:
    image: archivebox/archivebox:0.8.5rc52
    container_name: archivebox-scheduler
    command: schedule --foreground --every=day --depth=0
    restart: unless-stopped
    environment:
      - TIMEOUT=120
    volumes:
      - ./data:/data
    networks:
      - archivebox-net

Full-Text Search with Sonic

For large archives (10,000+ pages), switch from ripgrep to Sonic for faster full-text search. Uncomment the sonic service in the Compose file and update:

SEARCH_BACKEND_ENGINE=sonic
SEARCH_BACKEND_HOST_NAME=sonic
SEARCH_BACKEND_PASSWORD=changeme_sonic

Reverse Proxy

ArchiveBox serves on port 8000. For HTTPS with a reverse proxy, see Reverse Proxy Setup.

Set ALLOWED_HOSTS to your domain name when using a reverse proxy:

ALLOWED_HOSTS=archive.example.com

Backup

The entire archive lives in the ./data directory. Back up this directory to preserve:

  • ./data/archive/ — all archived page snapshots
  • ./data/index.sqlite3 — the database of all URLs and metadata
  • ./data/ArchiveBox.conf — your configuration
tar czf archivebox-backup-$(date +%Y%m%d).tar.gz ./data

For a comprehensive backup strategy, see Backup Strategy.

Troubleshooting

Chrome/Chromium Fails to Start

Symptom: PDF and screenshot extraction fails with “Chrome not found” or “Failed to launch Chrome.”

Fix: The Docker image ships with Chromium. If you’re running outside Docker, install Chromium:

apt install chromium-browser

“Permission Denied” on Data Directory

Symptom: ArchiveBox can’t write to /data inside the container.

Fix: Set ownership on the host data directory:

sudo chown -R 911:911 ./data

Large Archives Slow Down

Symptom: Search and browsing become slow above 5,000+ snapshots.

Fix: Switch from ripgrep to the Sonic search backend. Add the Sonic service and update SEARCH_BACKEND_ENGINE=sonic.

yt-dlp Errors on Media Downloads

Symptom: Video downloads fail with “Unable to extract” or similar errors.

Fix: yt-dlp needs frequent updates as sites change. Update the container image or run:

docker compose exec archivebox pip install --upgrade yt-dlp

Resource Requirements

  • RAM: ~300 MB idle, spikes to 1-2 GB during active archiving (Chrome rendering)
  • CPU: Medium — Chrome PDF/screenshot generation is CPU-intensive
  • Disk: ~1 MB per page average (varies widely — media-heavy pages use much more)

Verdict

ArchiveBox is the most comprehensive self-hosted web archiver available. The multi-format approach (HTML + PDF + screenshot + WARC) means you have redundant copies of everything. It’s ideal for researchers, journalists, or anyone who’s lost a crucial bookmark to link rot. The trade-off is resource usage — Chrome-based archiving is heavy. For lighter use cases where you just want to save article text, Wallabag or Hoarder are simpler options.

Comments