Self-Hosted Web Archiving: Complete Guide
What Is Self-Hosted Web Archiving?
Web archiving is the practice of saving copies of web pages, articles, and online content for long-term access. Web content disappears constantly — link rot, deleted pages, paywalled content, and shutting down services mean that the page you read today may not exist tomorrow.
Updated March 2026: Verified with latest Docker images and configurations.
Self-hosted web archiving gives you a personal Wayback Machine: save pages you care about, access them offline, search through your archive, and ensure important content survives regardless of what happens to the original source.
Why Self-Host Your Archives?
| Concern | Cloud Solution | Self-Hosted Solution |
|---|---|---|
| Privacy | Pocket, Instapaper see everything you save | Your server, your data |
| Availability | Services shut down (Google Reader, Omnivore) | Runs as long as your server does |
| Completeness | Often saves text only | Full page snapshots (HTML, CSS, images, JS) |
| Search | Limited to service’s search | Full-text search across your entire archive |
| Offline Access | Usually requires internet | Access everything locally |
| Cost | $5-10/month subscriptions | One-time server cost |
Prerequisites
- A Linux server with Docker and Docker Compose (guide)
- 2 GB of free RAM
- 20+ GB of disk space (archives grow quickly — plan for 100+ GB long-term)
- Basic command-line familiarity
Choosing Your Tools
The self-hosted archiving ecosystem has three distinct use cases, each served by different tools:
| Use Case | Best Tool | What It Does |
|---|---|---|
| Save web pages with full fidelity | ArchiveBox | Saves HTML, screenshots, PDFs, WARC files, and media |
| Offline access to entire websites | Kiwix | Serves ZIM files — compressed website snapshots |
| Torrent/DHT content discovery | Bitmagnet | Crawls BitTorrent DHT network for content indexing |
| Read-later with archiving | Wallabag | Saves article text with tagging and reading features |
| Bookmark management | Linkwarden | Saves bookmarks with page snapshots and screenshots |
ArchiveBox: The Personal Wayback Machine
ArchiveBox is the most comprehensive self-hosted archiving tool. For every URL you feed it, it creates multiple archive formats:
- HTML snapshot — full page with CSS and images
- Screenshot — visual capture of the page
- PDF — printable version
- WARC file — web archive standard format
- Article text — extracted readable content
- Git history — for pages hosted on GitHub/GitLab
Best for: Researchers, journalists, anyone who needs reliable copies of web content that might disappear.
Kiwix: Offline Wikipedia and More
Kiwix serves ZIM files — compressed snapshots of entire websites. The Kiwix project maintains pre-built ZIM files for Wikipedia (all languages), Stack Overflow, Project Gutenberg, TED Talks, and hundreds of other resources.
Best for: Offline access to reference material, education in areas with limited internet, disaster preparedness.
Building a Complete Archive Stack
For maximum coverage, combine tools:
services:
# Save individual pages and articles
archivebox:
image: archivebox/archivebox:0.8.5rc54
container_name: archivebox
ports:
- "8000:8000"
volumes:
- archivebox-data:/data
environment:
- ADMIN_USERNAME=admin
- ADMIN_PASSWORD=${ARCHIVEBOX_PASSWORD}
restart: unless-stopped
# Serve offline reference material
kiwix:
image: ghcr.io/kiwix/kiwix-serve:3.8.2
container_name: kiwix
ports:
- "8080:8080"
volumes:
- ./zim:/data
command: /data/*.zim
restart: unless-stopped
volumes:
archivebox-data:
Archiving Workflows
Manual Page Saving
Add individual URLs to ArchiveBox:
# Save a single page
docker compose exec archivebox archivebox add "https://example.com/important-article"
# Save multiple URLs from a file
docker compose exec archivebox archivebox add --depth=0 < urls.txt
# Save with full depth (follow links 1 level deep)
docker compose exec archivebox archivebox add --depth=1 "https://blog.example.com"
Automated Archiving
Set up automatic archiving of content you care about:
RSS Feed Archiving:
# Archive all articles from an RSS feed
docker compose exec archivebox archivebox add "https://blog.example.com/rss.xml"
Run this on a schedule with cron or a Docker cron job:
# Add to crontab: archive RSS feeds every 6 hours
0 */6 * * * docker compose -f /path/to/docker-compose.yml exec -T archivebox archivebox add "https://blog.example.com/rss.xml"
Browser Extension Integration:
ArchiveBox works with the ArchiveBox browser extension (Chrome/Firefox). Save pages with one click while browsing — they’re sent to your self-hosted instance automatically.
Downloading ZIM Files for Kiwix
Browse the Kiwix library at library.kiwix.org and download ZIM files:
# Download English Wikipedia (full, ~100 GB)
wget -P ./zim/ "https://download.kiwix.org/zim/wikipedia/wikipedia_en_all_maxi_2026-01.zim"
# Download a smaller version (no images, ~25 GB)
wget -P ./zim/ "https://download.kiwix.org/zim/wikipedia/wikipedia_en_all_nopic_2026-01.zim"
# Download Stack Overflow (~30 GB)
wget -P ./zim/ "https://download.kiwix.org/zim/stackoverflow/stackoverflow.com_en_all_2026-01.zim"
Storage Planning
Archives grow continuously. Plan your storage accordingly:
| Content Type | Storage Per Item | Example |
|---|---|---|
| Web page (full HTML + assets) | 1-10 MB | News article with images |
| Web page (PDF only) | 200-500 KB | Text-heavy blog post |
| Screenshot | 100-500 KB | Standard 1080p capture |
| WARC file | 1-20 MB | Full page with all assets |
| Wikipedia (English, full) | ~100 GB | Complete encyclopedia |
| Stack Overflow | ~30 GB | All questions and answers |
Recommendation: Start with 100 GB of dedicated storage. Use separate volumes or disks for archive data — this keeps your system disk clean and makes backups simpler.
Common Mistakes
-
Not planning for storage growth. A few hundred archived pages is manageable. A few thousand with full HTML snapshots, screenshots, and PDFs can consume 50+ GB quickly.
-
Archiving only text. The full page context (layout, images, sidebar) matters. Save full HTML snapshots, not just extracted article text.
-
Not backing up the archive. Your archive IS the backup of web content. But the archive itself needs backing up too. See Backup Strategy.
-
Archiving everything. Be selective. Focus on content that matters to you — articles you reference, documentation that might disappear, research sources.
Next Steps
- Set up ArchiveBox: How to Self-Host ArchiveBox
- Set up Kiwix: How to Self-Host Kiwix
- Try Bitmagnet for DHT indexing: How to Self-Host Bitmagnet
- Compare archiving tools: ArchiveBox vs Kiwix
Related
Get self-hosting tips in your inbox
Get the Docker Compose configs, hardware picks, and setup shortcuts we don't put in articles. Weekly. No spam.
Comments