What Is ArchiveBox?

ArchiveBox is a self-hosted web archiver that saves snapshots of web pages in multiple formats — HTML, PDF, screenshot, WARC, and more. Feed it URLs from bookmarks, RSS feeds, or browser history, and it builds a searchable, offline-accessible archive. Think of it as your own personal Wayback Machine. It replaces reliance on archive.org, Pocket’s saved pages, and browser bookmark rot.

Official site: archivebox.io | GitHub

Docker Compose Configuration

Create a directory for ArchiveBox:

mkdir archivebox && cd archivebox
mkdir data

Create a docker-compose.yml file:

services:
  archivebox:
    image: archivebox/archivebox:0.8.5rc52
    container_name: archivebox
    command: server --quick-init 0.0.0.0:8000
    restart: unless-stopped
    ports:
      - "8000:8000"
    environment:
      - ALLOWED_HOSTS=*
      - PUBLIC_INDEX=true
      - PUBLIC_SNAPSHOTS=true
      - PUBLIC_ADD_VIEW=false
      - SEARCH_BACKEND_ENGINE=ripgrep
      - MEDIA_MAX_SIZE=750m
      - TIMEOUT=60
      - CHECK_SSL_VALIDITY=true
      - SAVE_ARCHIVE_DOT_ORG=true
    volumes:
      - ./data:/data
    networks:
      - archivebox-net

  # Optional: Sonic full-text search (faster than ripgrep for large archives)
  # sonic:
  #   image: valeriansaliou/sonic:v1.4.9
  #   container_name: archivebox-sonic
  #   restart: unless-stopped
  #   environment:
  #     - SEARCH_BACKEND_PASSWORD=changeme_sonic
  #   volumes:
  #     - sonic-data:/var/lib/sonic/store
  #   networks:
  #     - archivebox-net

volumes:
  sonic-data:

networks:
  archivebox-net:

Initialize the archive and create an admin account:

docker compose run --rm archivebox init --setup
docker compose run --rm archivebox manage createsuperuser

Start the server:

docker compose up -d

Prerequisites

A Linux server (Ubuntu 22.04+ recommended)
Docker and Docker Compose installed (guide)
5 GB of free disk space (grows with your archive)
1 GB of RAM minimum, 2 GB recommended
A domain name (optional, for remote access)

Initial Setup

Access ArchiveBox at http://your-server-ip:8000. Log in with the superuser credentials you created during initialization.

To add URLs to your archive:

Via the web UI: Click “Add” in the top bar and paste URLs (one per line).

Via CLI:

# Add a single URL
docker compose run --rm archivebox add "https://example.com/article"

# Add from a bookmarks file
docker compose run --rm archivebox add < bookmarks.html

# Add from an RSS feed
docker compose run --rm archivebox add "https://example.com/feed.xml"

What Gets Archived

For each URL, ArchiveBox saves multiple output formats:

Format	Tool	Description
HTML	wget	Full static HTML snapshot with assets
PDF	Chrome/Chromium	Rendered PDF of the page
Screenshot	Chrome/Chromium	Full-page PNG screenshot
WARC	wget	Web Archive format (industry standard)
Readability	Mozilla Readability	Clean article text extraction
SingleFile	SingleFile	Complete page in one HTML file
Git	git	Clone entire git repos
Media	yt-dlp	Download videos, audio, playlists
Headers	curl	HTTP response headers

Configuration

Key environment variables:

Variable	Default	Description
`ALLOWED_HOSTS`	`*`	Restrict access by domain (comma-separated)
`PUBLIC_INDEX`	`true`	Make archive index publicly accessible
`PUBLIC_SNAPSHOTS`	`true`	Make individual snapshots publicly accessible
`PUBLIC_ADD_VIEW`	`false`	Allow unauthenticated users to add URLs
`SEARCH_BACKEND_ENGINE`	`ripgrep`	Search backend: `ripgrep`, `sonic`, or `sqlite`
`MEDIA_MAX_SIZE`	`750m`	Maximum file size for media downloads
`TIMEOUT`	`60`	Download timeout in seconds per extractor
`CHECK_SSL_VALIDITY`	`true`	Skip pages with invalid SSL certificates if false
`SAVE_ARCHIVE_DOT_ORG`	`true`	Submit URLs to the Wayback Machine as backup

Scheduled Archiving

Add a scheduler service to automatically re-archive URLs on a schedule:

  scheduler:
    image: archivebox/archivebox:0.8.5rc52
    container_name: archivebox-scheduler
    command: schedule --foreground --every=day --depth=0
    restart: unless-stopped
    environment:
      - TIMEOUT=120
    volumes:
      - ./data:/data
    networks:
      - archivebox-net

Full-Text Search with Sonic

For large archives (10,000+ pages), switch from ripgrep to Sonic for faster full-text search. Uncomment the sonic service in the Compose file and update:

SEARCH_BACKEND_ENGINE=sonic
SEARCH_BACKEND_HOST_NAME=sonic
SEARCH_BACKEND_PASSWORD=changeme_sonic

Reverse Proxy

ArchiveBox serves on port 8000. For HTTPS with a reverse proxy, see Reverse Proxy Setup.

Set ALLOWED_HOSTS to your domain name when using a reverse proxy:

ALLOWED_HOSTS=archive.example.com

Backup

The entire archive lives in the ./data directory. Back up this directory to preserve:

./data/archive/ — all archived page snapshots
./data/index.sqlite3 — the database of all URLs and metadata
./data/ArchiveBox.conf — your configuration

tar czf archivebox-backup-$(date +%Y%m%d).tar.gz ./data

For a comprehensive backup strategy, see Backup Strategy.

Troubleshooting

Chrome/Chromium Fails to Start

Symptom: PDF and screenshot extraction fails with “Chrome not found” or “Failed to launch Chrome.”

Fix: The Docker image ships with Chromium. If you’re running outside Docker, install Chromium:

apt install chromium-browser

“Permission Denied” on Data Directory

Symptom: ArchiveBox can’t write to /data inside the container.

Fix: Set ownership on the host data directory:

sudo chown -R 911:911 ./data

Large Archives Slow Down

Symptom: Search and browsing become slow above 5,000+ snapshots.

Fix: Switch from ripgrep to the Sonic search backend. Add the Sonic service and update SEARCH_BACKEND_ENGINE=sonic.

yt-dlp Errors on Media Downloads

Symptom: Video downloads fail with “Unable to extract” or similar errors.

Fix: yt-dlp needs frequent updates as sites change. Update the container image or run:

docker compose exec archivebox pip install --upgrade yt-dlp

Resource Requirements

RAM: ~300 MB idle, spikes to 1-2 GB during active archiving (Chrome rendering)
CPU: Medium — Chrome PDF/screenshot generation is CPU-intensive
Disk: ~1 MB per page average (varies widely — media-heavy pages use much more)

Verdict

ArchiveBox is the most comprehensive self-hosted web archiver available. The multi-format approach (HTML + PDF + screenshot + WARC) means you have redundant copies of everything. It’s ideal for researchers, journalists, or anyone who’s lost a crucial bookmark to link rot. The trade-off is resource usage — Chrome-based archiving is heavy. For lighter use cases where you just want to save article text, Wallabag or Hoarder are simpler options.

Self-Hosting ArchiveBox with Docker Compose

What Is ArchiveBox?

Docker Compose Configuration

Prerequisites

Initial Setup

What Gets Archived

Configuration

Scheduled Archiving

Full-Text Search with Sonic

Reverse Proxy

Backup

Troubleshooting

Chrome/Chromium Fails to Start

“Permission Denied” on Data Directory

Large Archives Slow Down

yt-dlp Errors on Media Downloads

Resource Requirements

Verdict

Comments

What Is ArchiveBox?

Docker Compose Configuration

Prerequisites

Initial Setup

What Gets Archived

Configuration

Scheduled Archiving

Full-Text Search with Sonic

Reverse Proxy

Backup

Troubleshooting

Chrome/Chromium Fails to Start

“Permission Denied” on Data Directory

Large Archives Slow Down

yt-dlp Errors on Media Downloads

Resource Requirements

Verdict

Related

Related Articles

ArchiveBox vs Kiwix: Which to Self-Host?

Best Self-Hosted Archiving Tools in 2026

Self-Hosted Alternatives to Pocket

Self-Hosting Kiwix with Docker Compose

Self-Hosted Alternatives to the Wayback Machine

ArchiveBox vs Wallabag: Which Should You Self-Host?

Get self-hosting tips in your inbox

Comments