How to Self-Host Whisper with Docker Compose

What Is Whisper?

Whisper is OpenAI’s open-source speech-to-text model. It transcribes audio in 99+ languages with high accuracy, including translation to English. Self-hosting Whisper means your audio never leaves your server — no API costs, no data sharing. Several community projects wrap Whisper in a REST API for easy integration.

Prerequisites

  • A Linux server (Ubuntu 22.04+ recommended)
  • Docker and Docker Compose installed (guide)
  • 4 GB+ RAM (CPU mode) or NVIDIA GPU with 4+ GB VRAM
  • 5 GB+ free disk space
  • NVIDIA Container Toolkit (for GPU mode)

Docker Compose Configuration

The best Docker-based Whisper deployment is Speaches (formerly Faster Whisper Server), which provides an OpenAI-compatible API:

services:
  whisper:
    image: ghcr.io/speaches-ai/speaches:v0.8.3
    # Formerly fedirz/faster-whisper-server — project renamed to speaches
    container_name: whisper
    ports:
      - "8000:8000"
    volumes:
      - whisper_models:/root/.cache/huggingface
    environment:
      - WHISPER__MODEL=Systran/faster-whisper-large-v3
      # Smaller, faster models:
      # - WHISPER__MODEL=Systran/faster-whisper-medium
      # - WHISPER__MODEL=Systran/faster-whisper-small
      # - WHISPER__MODEL=Systran/faster-whisper-base
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

volumes:
  whisper_models:

Start the stack:

docker compose up -d

The model downloads on first start (large-v3 is ~3 GB).

Initial Setup

Test transcription with a curl command:

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-1"

The response includes the transcribed text in JSON format, compatible with OpenAI’s API response format.

Configuration

Model Selection

ModelSizeVRAM RequiredSpeedAccuracy
faster-whisper-tiny~75 MB~1 GBVery fastLow
faster-whisper-base~140 MB~1 GBFastMedium
faster-whisper-small~460 MB~2 GBModerateGood
faster-whisper-medium~1.5 GB~3 GBSlowBetter
faster-whisper-large-v3~3 GB~5 GBSlowestBest

For most use cases, faster-whisper-small or faster-whisper-medium offers the best speed/accuracy trade-off.

API Endpoints

The API is OpenAI-compatible:

EndpointMethodDescription
/v1/audio/transcriptionsPOSTTranscribe audio to text
/v1/audio/translationsPOSTTranslate audio to English

Translation

Translate any language to English:

curl -X POST http://localhost:8000/v1/audio/translations \
  -F "[email protected]" \
  -F "model=whisper-1"

Advanced Configuration

Timestamps

Get word-level timestamps:

curl -X POST http://localhost:8000/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-1" \
  -F "response_format=verbose_json" \
  -F "timestamp_granularities[]=word"

Integration with Open WebUI

Open WebUI supports Whisper for voice input. Set the Whisper API URL in Open WebUI’s settings to http://whisper:8000/v1 to enable voice-to-text in your ChatGPT alternative.

Reverse Proxy

Configure your reverse proxy to forward to port 8000. See Reverse Proxy Setup.

Backup

The models volume stores downloaded Whisper models. These can be re-downloaded, so backups are optional. See Backup Strategy.

Troubleshooting

Transcription Returns Empty

Symptom: API returns empty text. Fix: Check that the audio file is in a supported format (mp3, wav, m4a, flac, ogg, webm). Verify the file isn’t corrupted. Check container logs: docker logs whisper.

Out of Memory

Symptom: Container crashes with OOM error. Fix: Use a smaller model. faster-whisper-small works well on 4 GB VRAM. For CPU mode, ensure sufficient system RAM (model size + 2 GB overhead).

Slow Transcription

Symptom: Transcription takes minutes for short audio. Fix: Ensure GPU mode is active (check nvidia-smi). Use a smaller model. Use the CUDA image variant, not CPU.

Resource Requirements

  • VRAM: 1-5 GB depending on model size
  • RAM: 2-8 GB (CPU mode depends on model size)
  • CPU: Medium (CPU-only mode is 5-10x slower than GPU)
  • Disk: 100 MB - 3 GB per model

Verdict

Self-hosted Whisper gives you private, unlimited speech-to-text transcription with no API costs. The faster-whisper implementation is significantly faster than the original OpenAI Whisper while maintaining accuracy. For most self-hosters, the small or medium model provides the best speed/accuracy balance.

Choose Whisper for standalone speech-to-text. Choose LocalAI if you want Whisper combined with LLM inference and image generation in one service.