How to Self-Host vLLM with Docker Compose
What Is vLLM?
vLLM is a high-throughput LLM inference engine designed for production serving. It invented PagedAttention, which manages GPU memory like an operating system manages virtual memory — dramatically improving throughput for concurrent requests. vLLM serves an OpenAI-compatible API, making it a drop-in backend for applications built against the OpenAI API.
Prerequisites
- A Linux server (Ubuntu 22.04+ recommended)
- Docker and Docker Compose installed (guide)
- NVIDIA GPU with 16+ GB VRAM (required — no CPU mode)
- NVIDIA Container Toolkit installed
- 30 GB+ free disk space (for model downloads)
- HuggingFace account (for gated models like Llama)
Docker Compose Configuration
Create a docker-compose.yml file:
services:
vllm:
image: vllm/vllm-openai:v0.17.1
container_name: vllm
ports:
- "8000:8000"
volumes:
- huggingface_cache:/root/.cache/huggingface
environment:
# Required for gated models (Llama, Mistral, etc.)
- HUGGING_FACE_HUB_TOKEN=hf_your_token_here
command: >
--model mistralai/Mistral-7B-Instruct-v0.3
--max-model-len 4096
--gpu-memory-utilization 0.9
--host 0.0.0.0
--port 8000
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
ipc: host
restart: unless-stopped
volumes:
huggingface_cache:
Create a .env file:
# Get your token from https://huggingface.co/settings/tokens
HUGGING_FACE_HUB_TOKEN=hf_your_token_here
Start the stack:
docker compose up -d
The first start downloads the model from HuggingFace (may take several minutes depending on model size and connection speed).
Initial Setup
Verify the API is responding:
curl http://localhost:8000/v1/models
Test with a chat completion:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.3",
"messages": [{"role": "user", "content": "What is self-hosting?"}],
"max_tokens": 200
}'
Configuration
Key Command-Line Arguments
| Argument | Default | Description |
|---|---|---|
--model | Required | HuggingFace model ID or local path |
--max-model-len | Model’s max | Maximum context window length |
--gpu-memory-utilization | 0.9 | Fraction of GPU VRAM to use (0.0-1.0) |
--tensor-parallel-size | 1 | Number of GPUs for tensor parallelism |
--dtype | auto | Data type: auto, float16, bfloat16 |
--quantization | None | Quantization method: awq, gptq, fp8 |
--max-num-seqs | 256 | Max concurrent sequences |
--host | 0.0.0.0 | Host to bind |
--port | 8000 | Port to listen on |
--api-key | None | API key for authentication |
Popular Models
| Model | VRAM Required | Notes |
|---|---|---|
mistralai/Mistral-7B-Instruct-v0.3 | ~16 GB | Good balance of quality and speed |
meta-llama/Llama-3.2-8B-Instruct | ~18 GB | Requires HF token (gated) |
Qwen/Qwen2.5-7B-Instruct | ~16 GB | Strong multilingual support |
TheBloke/Mistral-7B-Instruct-v0.2-AWQ | ~6 GB | AWQ quantized, less VRAM |
Multi-GPU Setup
For models that don’t fit in a single GPU:
command: >
--model meta-llama/Llama-3.1-70B-Instruct
--tensor-parallel-size 2
--max-model-len 4096
--gpu-memory-utilization 0.9
Set --tensor-parallel-size to the number of GPUs available.
Advanced Configuration
Quantized Models (Lower VRAM)
Run AWQ or GPTQ quantized models to reduce VRAM requirements:
command: >
--model TheBloke/Mistral-7B-Instruct-v0.2-AWQ
--quantization awq
--max-model-len 4096
API Key Authentication
command: >
--model mistralai/Mistral-7B-Instruct-v0.3
--api-key your-secret-api-key
Clients must include Authorization: Bearer your-secret-api-key in requests.
Speculative Decoding
For faster generation with a draft model:
command: >
--model meta-llama/Llama-3.2-8B-Instruct
--speculative-model meta-llama/Llama-3.2-1B-Instruct
--num-speculative-tokens 5
Reverse Proxy
Configure your reverse proxy to forward to port 8000. WebSocket support is recommended for streaming. See Reverse Proxy Setup.
Backup
The HuggingFace cache volume stores downloaded models. Models can be re-downloaded from HuggingFace, so backups are optional but save time on re-deployment.
docker run --rm -v huggingface_cache:/data -v $(pwd):/backup alpine \
tar czf /backup/vllm-models-backup.tar.gz /data
See Backup Strategy for a comprehensive approach.
Troubleshooting
CUDA Out of Memory
Symptom: torch.cuda.OutOfMemoryError on startup.
Fix: Reduce --max-model-len (try 2048). Lower --gpu-memory-utilization to 0.8. Use a quantized model (AWQ/GPTQ). Use --tensor-parallel-size 2 if you have multiple GPUs.
Model Download Fails
Symptom: 401 or 403 error when downloading gated models.
Fix: Set HUGGING_FACE_HUB_TOKEN with a valid HuggingFace token. Accept the model’s license on the HuggingFace website first.
Slow First Request
Symptom: First request takes 30+ seconds after startup. Fix: This is normal. vLLM compiles CUDA kernels on first use. Subsequent requests are fast. Use the warm-up period for health checks.
Container Crashes Immediately
Symptom: Container exits with code 1.
Fix: Check docker logs vllm. Common causes: GPU not detected (install NVIDIA Container Toolkit), model too large for available VRAM, missing ipc: host in Compose file.
Resource Requirements
- VRAM: 16-24 GB for 7B models (FP16), 6-8 GB for quantized
- RAM: 16 GB+ system RAM recommended
- CPU: Moderate (GPU does the heavy lifting)
- Disk: 10-50 GB per model
Verdict
vLLM is the production-grade LLM serving engine. If you need to serve multiple concurrent users with consistent latency, vLLM’s PagedAttention and continuous batching make it 2-5x more efficient than sequential inference engines. The trade-off is a hard GPU requirement and more complex setup.
Choose vLLM if you’re building an application that serves LLM responses to multiple users. Choose Ollama if you want simpler setup for personal use or small teams.
Frequently Asked Questions
How does vLLM compare to Ollama?
vLLM is a production inference engine optimized for high-throughput concurrent serving with PagedAttention and continuous batching. Ollama is designed for ease of use — simple CLI, one-command model downloads, and lower setup complexity. Choose vLLM for multi-user production workloads; choose Ollama for personal use or small teams.
Does vLLM require a GPU?
Yes. vLLM requires an NVIDIA GPU with CUDA support. There is no CPU inference mode. This is a hard requirement — vLLM’s entire value proposition is GPU-accelerated high-throughput serving. For CPU inference, use Ollama with llama.cpp or Text Generation WebUI.
Is vLLM’s API compatible with the OpenAI API?
Yes. vLLM provides an OpenAI-compatible API server out of the box. Applications that support custom OpenAI endpoints can connect directly. It supports /v1/completions, /v1/chat/completions, and /v1/embeddings endpoints.
What is PagedAttention and why does it matter?
PagedAttention is vLLM’s key innovation — it manages GPU memory for attention key-value caches the way operating systems manage virtual memory with paging. This eliminates memory waste from pre-allocation, allowing vLLM to serve 2-5x more concurrent requests than sequential inference engines with the same GPU.
Can vLLM serve multiple models simultaneously?
A single vLLM instance serves one model at a time. To serve multiple models, run separate vLLM instances on different ports. With --tensor-parallel-size, you can split a single large model across multiple GPUs, but you can’t load different models on the same instance.
What model formats does vLLM support?
vLLM supports HuggingFace Transformers format (safetensors), AWQ quantization, and GPTQ quantization. It does not support GGUF (llama.cpp format). For GGUF models, use Ollama or Text Generation WebUI instead.
Related
Get self-hosting tips in your inbox
Get the Docker Compose configs, hardware picks, and setup shortcuts we don't put in articles. Weekly. No spam.
Comments