How to Self-Host Whisper with Docker Compose

What Is Whisper?

Whisper is OpenAI’s open-source speech-to-text model. It transcribes audio in 99+ languages with high accuracy, including translation to English. Self-hosting Whisper means your audio never leaves your server — no API costs, no data sharing. Several community projects wrap Whisper in a REST API for easy integration.

Prerequisites

  • A Linux server (Ubuntu 22.04+ recommended)
  • Docker and Docker Compose installed (guide)
  • 4 GB+ RAM (CPU mode) or NVIDIA GPU with 4+ GB VRAM
  • 5 GB+ free disk space
  • NVIDIA Container Toolkit (for GPU mode)

Docker Compose Configuration

The best Docker-based Whisper deployment is Speaches (formerly Faster Whisper Server), which provides an OpenAI-compatible API:

services:
  whisper:
    image: ghcr.io/speaches-ai/speaches:v0.8.3
    # Formerly fedirz/faster-whisper-server — project renamed to speaches
    container_name: whisper
    ports:
      - "8000:8000"
    volumes:
      - whisper_models:/root/.cache/huggingface
    environment:
      - WHISPER__MODEL=Systran/faster-whisper-large-v3
      # Smaller, faster models:
      # - WHISPER__MODEL=Systran/faster-whisper-medium
      # - WHISPER__MODEL=Systran/faster-whisper-small
      # - WHISPER__MODEL=Systran/faster-whisper-base
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

volumes:
  whisper_models:

Start the stack:

docker compose up -d

The model downloads on first start (large-v3 is ~3 GB).

Initial Setup

Test transcription with a curl command:

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-1"

The response includes the transcribed text in JSON format, compatible with OpenAI’s API response format.

Configuration

Model Selection

ModelSizeVRAM RequiredSpeedAccuracy
faster-whisper-tiny~75 MB~1 GBVery fastLow
faster-whisper-base~140 MB~1 GBFastMedium
faster-whisper-small~460 MB~2 GBModerateGood
faster-whisper-medium~1.5 GB~3 GBSlowBetter
faster-whisper-large-v3~3 GB~5 GBSlowestBest

For most use cases, faster-whisper-small or faster-whisper-medium offers the best speed/accuracy trade-off.

API Endpoints

The API is OpenAI-compatible:

EndpointMethodDescription
/v1/audio/transcriptionsPOSTTranscribe audio to text
/v1/audio/translationsPOSTTranslate audio to English

Translation

Translate any language to English:

curl -X POST http://localhost:8000/v1/audio/translations \
  -F "[email protected]" \
  -F "model=whisper-1"

Advanced Configuration

Timestamps

Get word-level timestamps:

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-1" \
  -F "response_format=verbose_json" \
  -F "timestamp_granularities[]=word"

Integration with Open WebUI

Open WebUI supports Whisper for voice input. Set the Whisper API URL in Open WebUI’s settings to http://whisper:8000/v1 to enable voice-to-text in your ChatGPT alternative.

Reverse Proxy

Configure your reverse proxy to forward to port 8000. See Reverse Proxy Setup.

Backup

The models volume stores downloaded Whisper models. These can be re-downloaded, so backups are optional. See Backup Strategy.

Troubleshooting

Transcription Returns Empty

Symptom: API returns empty text. Fix: Check that the audio file is in a supported format (mp3, wav, m4a, flac, ogg, webm). Verify the file isn’t corrupted. Check container logs: docker logs whisper.

Out of Memory

Symptom: Container crashes with OOM error. Fix: Use a smaller model. faster-whisper-small works well on 4 GB VRAM. For CPU mode, ensure sufficient system RAM (model size + 2 GB overhead).

Slow Transcription

Symptom: Transcription takes minutes for short audio. Fix: Ensure GPU mode is active (check nvidia-smi). Use a smaller model. Use the CUDA image variant, not CPU.

Resource Requirements

  • VRAM: 1-5 GB depending on model size
  • RAM: 2-8 GB (CPU mode depends on model size)
  • CPU: Medium (CPU-only mode is 5-10x slower than GPU)
  • Disk: 100 MB - 3 GB per model

Verdict

Self-hosted Whisper gives you private, unlimited speech-to-text transcription with no API costs. The faster-whisper implementation is significantly faster than the original OpenAI Whisper while maintaining accuracy. For most self-hosters, the small or medium model provides the best speed/accuracy balance.

Choose Whisper for standalone speech-to-text. Choose LocalAI if you want Whisper combined with LLM inference and image generation in one service.

Frequently Asked Questions

Does Whisper require a GPU?

No, but it is strongly recommended. Whisper runs on CPU but is 5-10x slower. A 10-minute audio file that takes 30 seconds on an NVIDIA GPU may take 5+ minutes on CPU. For occasional transcription, CPU works. For regular or batch use, an NVIDIA GPU with 4+ GB VRAM is essential. Remove the deploy.resources section from the Compose file for CPU-only mode.

How does self-hosted Whisper compare to OpenAI’s Whisper API?

Same model, same accuracy — the difference is cost and privacy. OpenAI’s API charges $0.006/minute. Self-hosted Whisper costs only electricity after the initial setup. Your audio never leaves your server. The Speaches wrapper provides an OpenAI-compatible API, so switching between self-hosted and cloud requires only changing the endpoint URL.

What audio formats does Whisper support?

Whisper supports mp3, wav, m4a, flac, ogg, and webm. Files are automatically resampled to 16kHz mono internally. For best results, provide clean audio with minimal background noise. There is no hard file size limit, but very long files (2+ hours) should be split into chunks for reliable processing.

Can Whisper generate subtitles?

Yes. Use the response_format=srt or response_format=vtt parameter to get subtitle files with timestamps. The verbose_json format provides word-level timestamps for precise subtitle alignment. This makes self-hosted Whisper an excellent tool for generating subtitles for video content.

Which Whisper model should I use?

For most use cases, faster-whisper-small offers the best speed-to-accuracy ratio. It runs on 2 GB VRAM and handles English transcription well. Use faster-whisper-medium for multilingual content or higher accuracy needs. Only use large-v3 when maximum accuracy is critical — it requires 5 GB VRAM and is significantly slower.

Can I use Whisper with Home Assistant or other self-hosted apps?

Yes. Any application that supports the OpenAI Whisper API format can connect to your self-hosted instance. Open WebUI supports it natively for voice input. Home Assistant can use it for voice commands via the Whisper add-on or by pointing to your API endpoint.

Comments