Self-Hosting ArchiveBox with Docker Compose
What Is ArchiveBox?
ArchiveBox is a self-hosted web archiver that saves snapshots of web pages in multiple formats — HTML, PDF, screenshot, WARC, and more. Feed it URLs from bookmarks, RSS feeds, or browser history, and it builds a searchable, offline-accessible archive. Think of it as your own personal Wayback Machine. It replaces reliance on archive.org, Pocket’s saved pages, and browser bookmark rot.
Official site: archivebox.io | GitHub
Docker Compose Configuration
Create a directory for ArchiveBox:
mkdir archivebox && cd archivebox
mkdir data
Create a docker-compose.yml file:
services:
archivebox:
image: archivebox/archivebox:0.8.5rc52
container_name: archivebox
command: server --quick-init 0.0.0.0:8000
restart: unless-stopped
ports:
- "8000:8000"
environment:
- ALLOWED_HOSTS=*
- PUBLIC_INDEX=true
- PUBLIC_SNAPSHOTS=true
- PUBLIC_ADD_VIEW=false
- SEARCH_BACKEND_ENGINE=ripgrep
- MEDIA_MAX_SIZE=750m
- TIMEOUT=60
- CHECK_SSL_VALIDITY=true
- SAVE_ARCHIVE_DOT_ORG=true
volumes:
- ./data:/data
networks:
- archivebox-net
# Optional: Sonic full-text search (faster than ripgrep for large archives)
# sonic:
# image: valeriansaliou/sonic:v1.4.9
# container_name: archivebox-sonic
# restart: unless-stopped
# environment:
# - SEARCH_BACKEND_PASSWORD=changeme_sonic
# volumes:
# - sonic-data:/var/lib/sonic/store
# networks:
# - archivebox-net
volumes:
sonic-data:
networks:
archivebox-net:
Initialize the archive and create an admin account:
docker compose run --rm archivebox init --setup
docker compose run --rm archivebox manage createsuperuser
Start the server:
docker compose up -d
Prerequisites
- A Linux server (Ubuntu 22.04+ recommended)
- Docker and Docker Compose installed (guide)
- 5 GB of free disk space (grows with your archive)
- 1 GB of RAM minimum, 2 GB recommended
- A domain name (optional, for remote access)
Initial Setup
Access ArchiveBox at http://your-server-ip:8000. Log in with the superuser credentials you created during initialization.
To add URLs to your archive:
Via the web UI: Click “Add” in the top bar and paste URLs (one per line).
Via CLI:
# Add a single URL
docker compose run --rm archivebox add "https://example.com/article"
# Add from a bookmarks file
docker compose run --rm archivebox add < bookmarks.html
# Add from an RSS feed
docker compose run --rm archivebox add "https://example.com/feed.xml"
What Gets Archived
For each URL, ArchiveBox saves multiple output formats:
| Format | Tool | Description |
|---|---|---|
| HTML | wget | Full static HTML snapshot with assets |
| Chrome/Chromium | Rendered PDF of the page | |
| Screenshot | Chrome/Chromium | Full-page PNG screenshot |
| WARC | wget | Web Archive format (industry standard) |
| Readability | Mozilla Readability | Clean article text extraction |
| SingleFile | SingleFile | Complete page in one HTML file |
| Git | git | Clone entire git repos |
| Media | yt-dlp | Download videos, audio, playlists |
| Headers | curl | HTTP response headers |
Configuration
Key environment variables:
| Variable | Default | Description |
|---|---|---|
ALLOWED_HOSTS | * | Restrict access by domain (comma-separated) |
PUBLIC_INDEX | true | Make archive index publicly accessible |
PUBLIC_SNAPSHOTS | true | Make individual snapshots publicly accessible |
PUBLIC_ADD_VIEW | false | Allow unauthenticated users to add URLs |
SEARCH_BACKEND_ENGINE | ripgrep | Search backend: ripgrep, sonic, or sqlite |
MEDIA_MAX_SIZE | 750m | Maximum file size for media downloads |
TIMEOUT | 60 | Download timeout in seconds per extractor |
CHECK_SSL_VALIDITY | true | Skip pages with invalid SSL certificates if false |
SAVE_ARCHIVE_DOT_ORG | true | Submit URLs to the Wayback Machine as backup |
Scheduled Archiving
Add a scheduler service to automatically re-archive URLs on a schedule:
scheduler:
image: archivebox/archivebox:0.8.5rc52
container_name: archivebox-scheduler
command: schedule --foreground --every=day --depth=0
restart: unless-stopped
environment:
- TIMEOUT=120
volumes:
- ./data:/data
networks:
- archivebox-net
Full-Text Search with Sonic
For large archives (10,000+ pages), switch from ripgrep to Sonic for faster full-text search. Uncomment the sonic service in the Compose file and update:
SEARCH_BACKEND_ENGINE=sonic
SEARCH_BACKEND_HOST_NAME=sonic
SEARCH_BACKEND_PASSWORD=changeme_sonic
Reverse Proxy
ArchiveBox serves on port 8000. For HTTPS with a reverse proxy, see Reverse Proxy Setup.
Set ALLOWED_HOSTS to your domain name when using a reverse proxy:
ALLOWED_HOSTS=archive.example.com
Backup
The entire archive lives in the ./data directory. Back up this directory to preserve:
./data/archive/— all archived page snapshots./data/index.sqlite3— the database of all URLs and metadata./data/ArchiveBox.conf— your configuration
tar czf archivebox-backup-$(date +%Y%m%d).tar.gz ./data
For a comprehensive backup strategy, see Backup Strategy.
Troubleshooting
Chrome/Chromium Fails to Start
Symptom: PDF and screenshot extraction fails with “Chrome not found” or “Failed to launch Chrome.”
Fix: The Docker image ships with Chromium. If you’re running outside Docker, install Chromium:
apt install chromium-browser
“Permission Denied” on Data Directory
Symptom: ArchiveBox can’t write to /data inside the container.
Fix: Set ownership on the host data directory:
sudo chown -R 911:911 ./data
Large Archives Slow Down
Symptom: Search and browsing become slow above 5,000+ snapshots.
Fix: Switch from ripgrep to the Sonic search backend. Add the Sonic service and update SEARCH_BACKEND_ENGINE=sonic.
yt-dlp Errors on Media Downloads
Symptom: Video downloads fail with “Unable to extract” or similar errors.
Fix: yt-dlp needs frequent updates as sites change. Update the container image or run:
docker compose exec archivebox pip install --upgrade yt-dlp
Resource Requirements
- RAM: ~300 MB idle, spikes to 1-2 GB during active archiving (Chrome rendering)
- CPU: Medium — Chrome PDF/screenshot generation is CPU-intensive
- Disk: ~1 MB per page average (varies widely — media-heavy pages use much more)
Verdict
ArchiveBox is the most comprehensive self-hosted web archiver available. The multi-format approach (HTML + PDF + screenshot + WARC) means you have redundant copies of everything. It’s ideal for researchers, journalists, or anyone who’s lost a crucial bookmark to link rot. The trade-off is resource usage — Chrome-based archiving is heavy. For lighter use cases where you just want to save article text, Wallabag or Hoarder are simpler options.
Related
- Guide to Self-Hosted Web Archiving
- ArchiveBox vs Kiwix: Which to Self-Host?
- ArchiveBox vs Wallabag: Which Should You Self-Host?
- ArchiveBox vs Wayback Machine: Self-Hosted Archiving
- Self-Hosting Wallabag with Docker Compose
- Self-Hosting Linkwarden with Docker Compose
- Self-Hosting Hoarder with Docker Compose
- Best Self-Hosted Bookmarks & Read Later Tools
- Docker Compose Basics
- Reverse Proxy Setup
- Backup Strategy
Get self-hosting tips in your inbox
Get the Docker Compose configs, hardware picks, and setup shortcuts we don't put in articles. Weekly. No spam.
Comments