Self-Hosted Web Archiving: Complete Guide

What Is Self-Hosted Web Archiving?

Web archiving is the practice of saving copies of web pages, articles, and online content for long-term access. Web content disappears constantly — link rot, deleted pages, paywalled content, and shutting down services mean that the page you read today may not exist tomorrow.

Updated March 2026: Verified with latest Docker images and configurations.

Self-hosted web archiving gives you a personal Wayback Machine: save pages you care about, access them offline, search through your archive, and ensure important content survives regardless of what happens to the original source.

Why Self-Host Your Archives?

ConcernCloud SolutionSelf-Hosted Solution
PrivacyPocket, Instapaper see everything you saveYour server, your data
AvailabilityServices shut down (Google Reader, Omnivore)Runs as long as your server does
CompletenessOften saves text onlyFull page snapshots (HTML, CSS, images, JS)
SearchLimited to service’s searchFull-text search across your entire archive
Offline AccessUsually requires internetAccess everything locally
Cost$5-10/month subscriptionsOne-time server cost

Prerequisites

  • A Linux server with Docker and Docker Compose (guide)
  • 2 GB of free RAM
  • 20+ GB of disk space (archives grow quickly — plan for 100+ GB long-term)
  • Basic command-line familiarity

Choosing Your Tools

The self-hosted archiving ecosystem has three distinct use cases, each served by different tools:

Use CaseBest ToolWhat It Does
Save web pages with full fidelityArchiveBoxSaves HTML, screenshots, PDFs, WARC files, and media
Offline access to entire websitesKiwixServes ZIM files — compressed website snapshots
Torrent/DHT content discoveryBitmagnetCrawls BitTorrent DHT network for content indexing
Read-later with archivingWallabagSaves article text with tagging and reading features
Bookmark managementLinkwardenSaves bookmarks with page snapshots and screenshots

ArchiveBox: The Personal Wayback Machine

ArchiveBox is the most comprehensive self-hosted archiving tool. For every URL you feed it, it creates multiple archive formats:

  • HTML snapshot — full page with CSS and images
  • Screenshot — visual capture of the page
  • PDF — printable version
  • WARC file — web archive standard format
  • Article text — extracted readable content
  • Git history — for pages hosted on GitHub/GitLab

Best for: Researchers, journalists, anyone who needs reliable copies of web content that might disappear.

Kiwix: Offline Wikipedia and More

Kiwix serves ZIM files — compressed snapshots of entire websites. The Kiwix project maintains pre-built ZIM files for Wikipedia (all languages), Stack Overflow, Project Gutenberg, TED Talks, and hundreds of other resources.

Best for: Offline access to reference material, education in areas with limited internet, disaster preparedness.

Building a Complete Archive Stack

For maximum coverage, combine tools:

services:
  # Save individual pages and articles
  archivebox:
    image: archivebox/archivebox:0.8.5rc54
    container_name: archivebox
    ports:
      - "8000:8000"
    volumes:
      - archivebox-data:/data
    environment:
      - ADMIN_USERNAME=admin
      - ADMIN_PASSWORD=${ARCHIVEBOX_PASSWORD}
    restart: unless-stopped

  # Serve offline reference material
  kiwix:
    image: ghcr.io/kiwix/kiwix-serve:3.8.2
    container_name: kiwix
    ports:
      - "8080:8080"
    volumes:
      - ./zim:/data
    command: /data/*.zim
    restart: unless-stopped

volumes:
  archivebox-data:

Archiving Workflows

Manual Page Saving

Add individual URLs to ArchiveBox:

# Save a single page
docker compose exec archivebox archivebox add "https://example.com/important-article"

# Save multiple URLs from a file
docker compose exec archivebox archivebox add --depth=0 < urls.txt

# Save with full depth (follow links 1 level deep)
docker compose exec archivebox archivebox add --depth=1 "https://blog.example.com"

Automated Archiving

Set up automatic archiving of content you care about:

RSS Feed Archiving:

# Archive all articles from an RSS feed
docker compose exec archivebox archivebox add "https://blog.example.com/rss.xml"

Run this on a schedule with cron or a Docker cron job:

# Add to crontab: archive RSS feeds every 6 hours
0 */6 * * * docker compose -f /path/to/docker-compose.yml exec -T archivebox archivebox add "https://blog.example.com/rss.xml"

Browser Extension Integration:

ArchiveBox works with the ArchiveBox browser extension (Chrome/Firefox). Save pages with one click while browsing — they’re sent to your self-hosted instance automatically.

Downloading ZIM Files for Kiwix

Browse the Kiwix library at library.kiwix.org and download ZIM files:

# Download English Wikipedia (full, ~100 GB)
wget -P ./zim/ "https://download.kiwix.org/zim/wikipedia/wikipedia_en_all_maxi_2026-01.zim"

# Download a smaller version (no images, ~25 GB)
wget -P ./zim/ "https://download.kiwix.org/zim/wikipedia/wikipedia_en_all_nopic_2026-01.zim"

# Download Stack Overflow (~30 GB)
wget -P ./zim/ "https://download.kiwix.org/zim/stackoverflow/stackoverflow.com_en_all_2026-01.zim"

Storage Planning

Archives grow continuously. Plan your storage accordingly:

Content TypeStorage Per ItemExample
Web page (full HTML + assets)1-10 MBNews article with images
Web page (PDF only)200-500 KBText-heavy blog post
Screenshot100-500 KBStandard 1080p capture
WARC file1-20 MBFull page with all assets
Wikipedia (English, full)~100 GBComplete encyclopedia
Stack Overflow~30 GBAll questions and answers

Recommendation: Start with 100 GB of dedicated storage. Use separate volumes or disks for archive data — this keeps your system disk clean and makes backups simpler.

Common Mistakes

  1. Not planning for storage growth. A few hundred archived pages is manageable. A few thousand with full HTML snapshots, screenshots, and PDFs can consume 50+ GB quickly.

  2. Archiving only text. The full page context (layout, images, sidebar) matters. Save full HTML snapshots, not just extracted article text.

  3. Not backing up the archive. Your archive IS the backup of web content. But the archive itself needs backing up too. See Backup Strategy.

  4. Archiving everything. Be selective. Focus on content that matters to you — articles you reference, documentation that might disappear, research sources.

Next Steps

Comments