How to Self-Host Whisper with Docker Compose
What Is Whisper?
Whisper is OpenAI’s open-source speech-to-text model. It transcribes audio in 99+ languages with high accuracy, including translation to English. Self-hosting Whisper means your audio never leaves your server — no API costs, no data sharing. Several community projects wrap Whisper in a REST API for easy integration.
Prerequisites
- A Linux server (Ubuntu 22.04+ recommended)
- Docker and Docker Compose installed (guide)
- 4 GB+ RAM (CPU mode) or NVIDIA GPU with 4+ GB VRAM
- 5 GB+ free disk space
- NVIDIA Container Toolkit (for GPU mode)
Docker Compose Configuration
The best Docker-based Whisper deployment is Speaches (formerly Faster Whisper Server), which provides an OpenAI-compatible API:
services:
whisper:
image: ghcr.io/speaches-ai/speaches:v0.8.3
# Formerly fedirz/faster-whisper-server — project renamed to speaches
container_name: whisper
ports:
- "8000:8000"
volumes:
- whisper_models:/root/.cache/huggingface
environment:
- WHISPER__MODEL=Systran/faster-whisper-large-v3
# Smaller, faster models:
# - WHISPER__MODEL=Systran/faster-whisper-medium
# - WHISPER__MODEL=Systran/faster-whisper-small
# - WHISPER__MODEL=Systran/faster-whisper-base
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
restart: unless-stopped
volumes:
whisper_models:
Start the stack:
docker compose up -d
The model downloads on first start (large-v3 is ~3 GB).
Initial Setup
Test transcription with a curl command:
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-F "[email protected]" \
-F "model=whisper-1"
The response includes the transcribed text in JSON format, compatible with OpenAI’s API response format.
Configuration
Model Selection
| Model | Size | VRAM Required | Speed | Accuracy |
|---|---|---|---|---|
faster-whisper-tiny | ~75 MB | ~1 GB | Very fast | Low |
faster-whisper-base | ~140 MB | ~1 GB | Fast | Medium |
faster-whisper-small | ~460 MB | ~2 GB | Moderate | Good |
faster-whisper-medium | ~1.5 GB | ~3 GB | Slow | Better |
faster-whisper-large-v3 | ~3 GB | ~5 GB | Slowest | Best |
For most use cases, faster-whisper-small or faster-whisper-medium offers the best speed/accuracy trade-off.
API Endpoints
The API is OpenAI-compatible:
| Endpoint | Method | Description |
|---|---|---|
/v1/audio/transcriptions | POST | Transcribe audio to text |
/v1/audio/translations | POST | Translate audio to English |
Translation
Translate any language to English:
curl -X POST http://localhost:8000/v1/audio/translations \
-F "[email protected]" \
-F "model=whisper-1"
Advanced Configuration
Timestamps
Get word-level timestamps:
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-F "[email protected]" \
-F "model=whisper-1" \
-F "response_format=verbose_json" \
-F "timestamp_granularities[]=word"
Integration with Open WebUI
Open WebUI supports Whisper for voice input. Set the Whisper API URL in Open WebUI’s settings to http://whisper:8000/v1 to enable voice-to-text in your ChatGPT alternative.
Reverse Proxy
Configure your reverse proxy to forward to port 8000. See Reverse Proxy Setup.
Backup
The models volume stores downloaded Whisper models. These can be re-downloaded, so backups are optional. See Backup Strategy.
Troubleshooting
Transcription Returns Empty
Symptom: API returns empty text.
Fix: Check that the audio file is in a supported format (mp3, wav, m4a, flac, ogg, webm). Verify the file isn’t corrupted. Check container logs: docker logs whisper.
Out of Memory
Symptom: Container crashes with OOM error.
Fix: Use a smaller model. faster-whisper-small works well on 4 GB VRAM. For CPU mode, ensure sufficient system RAM (model size + 2 GB overhead).
Slow Transcription
Symptom: Transcription takes minutes for short audio.
Fix: Ensure GPU mode is active (check nvidia-smi). Use a smaller model. Use the CUDA image variant, not CPU.
Resource Requirements
- VRAM: 1-5 GB depending on model size
- RAM: 2-8 GB (CPU mode depends on model size)
- CPU: Medium (CPU-only mode is 5-10x slower than GPU)
- Disk: 100 MB - 3 GB per model
Verdict
Self-hosted Whisper gives you private, unlimited speech-to-text transcription with no API costs. The faster-whisper implementation is significantly faster than the original OpenAI Whisper while maintaining accuracy. For most self-hosters, the small or medium model provides the best speed/accuracy balance.
Choose Whisper for standalone speech-to-text. Choose LocalAI if you want Whisper combined with LLM inference and image generation in one service.
Frequently Asked Questions
Does Whisper require a GPU?
No, but it is strongly recommended. Whisper runs on CPU but is 5-10x slower. A 10-minute audio file that takes 30 seconds on an NVIDIA GPU may take 5+ minutes on CPU. For occasional transcription, CPU works. For regular or batch use, an NVIDIA GPU with 4+ GB VRAM is essential. Remove the deploy.resources section from the Compose file for CPU-only mode.
How does self-hosted Whisper compare to OpenAI’s Whisper API?
Same model, same accuracy — the difference is cost and privacy. OpenAI’s API charges $0.006/minute. Self-hosted Whisper costs only electricity after the initial setup. Your audio never leaves your server. The Speaches wrapper provides an OpenAI-compatible API, so switching between self-hosted and cloud requires only changing the endpoint URL.
What audio formats does Whisper support?
Whisper supports mp3, wav, m4a, flac, ogg, and webm. Files are automatically resampled to 16kHz mono internally. For best results, provide clean audio with minimal background noise. There is no hard file size limit, but very long files (2+ hours) should be split into chunks for reliable processing.
Can Whisper generate subtitles?
Yes. Use the response_format=srt or response_format=vtt parameter to get subtitle files with timestamps. The verbose_json format provides word-level timestamps for precise subtitle alignment. This makes self-hosted Whisper an excellent tool for generating subtitles for video content.
Which Whisper model should I use?
For most use cases, faster-whisper-small offers the best speed-to-accuracy ratio. It runs on 2 GB VRAM and handles English transcription well. Use faster-whisper-medium for multilingual content or higher accuracy needs. Only use large-v3 when maximum accuracy is critical — it requires 5 GB VRAM and is significantly slower.
Can I use Whisper with Home Assistant or other self-hosted apps?
Yes. Any application that supports the OpenAI Whisper API format can connect to your self-hosted instance. Open WebUI supports it natively for voice input. Home Assistant can use it for voice commands via the Whisper add-on or by pointing to your API endpoint.
Related
Get self-hosting tips in your inbox
Get the Docker Compose configs, hardware picks, and setup shortcuts we don't put in articles. Weekly. No spam.
Comments