How to Self-Host Ollama with Docker Compose
What Is Ollama?
Ollama is a local LLM runtime that lets you run large language models like Llama 3, Mistral, Gemma, and dozens more on your own hardware. It handles model downloading, quantization, GPU acceleration, and serves an OpenAI-compatible REST API. Think of it as the Docker of LLMs — pull a model, run it, interact with it through an API or CLI.
Updated March 2026: Verified with latest Docker images and configurations.
Prerequisites
- A Linux server (Ubuntu 22.04+ recommended)
- Docker and Docker Compose installed (guide)
- 8 GB of RAM minimum (16 GB+ recommended for larger models)
- 20-50 GB of free disk space for models
- NVIDIA GPU with CUDA support (optional but strongly recommended for performance)
- If using GPU: NVIDIA drivers 531+ and NVIDIA Container Toolkit installed
Docker Compose Configuration
Create a docker-compose.yml file:
CPU-Only Setup
services:
ollama:
image: ollama/ollama:v0.18.2
container_name: ollama
ports:
- "11434:11434"
volumes:
# Stores downloaded models and configuration
- ollama_data:/root/.ollama
environment:
# How long to keep models loaded in memory (default 5m, use -1 for always)
- OLLAMA_KEEP_ALIVE=5m
# Maximum parallel requests per model
- OLLAMA_NUM_PARALLEL=1
# Allow connections from other containers
- OLLAMA_ORIGINS=*
restart: unless-stopped
volumes:
ollama_data:
With NVIDIA GPU
services:
ollama:
image: ollama/ollama:v0.18.2
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_KEEP_ALIVE=5m
# Increase parallel requests when GPU is available
- OLLAMA_NUM_PARALLEL=4
# Limit to 1 model loaded at a time (saves VRAM)
- OLLAMA_MAX_LOADED_MODELS=1
# Enable flash attention for better performance
- OLLAMA_FLASH_ATTENTION=1
- OLLAMA_ORIGINS=*
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
ollama_data:
With AMD GPU (ROCm)
services:
ollama:
image: ollama/ollama:rocm
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
devices:
- /dev/kfd
- /dev/dri
environment:
- OLLAMA_KEEP_ALIVE=5m
- OLLAMA_NUM_PARALLEL=4
- OLLAMA_ORIGINS=*
restart: unless-stopped
volumes:
ollama_data:
Start the stack:
docker compose up -d
Initial Setup
Ollama starts with no models. Pull your first model:
# Pull Llama 3.1 8B (4.7 GB)
docker exec ollama ollama pull llama3.1
# Pull a smaller model for testing (2 GB)
docker exec ollama ollama pull phi3:mini
Test it works:
# Interactive chat
docker exec -it ollama ollama run llama3.1
# API request
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1",
"prompt": "What is self-hosting?",
"stream": false
}'
List downloaded models:
docker exec ollama ollama list
Configuration
Key Environment Variables
| Variable | Default | Purpose |
|---|---|---|
OLLAMA_KEEP_ALIVE | 5m | Time to keep models in memory. Use -1 for always, 0 to unload immediately |
OLLAMA_NUM_PARALLEL | 1 | Max concurrent requests per model. Increase with GPU |
OLLAMA_MAX_LOADED_MODELS | 0 (unlimited) | Max models loaded simultaneously. Set to 1 on limited VRAM |
OLLAMA_MAX_QUEUE | 512 | Maximum queued requests |
OLLAMA_ORIGINS | localhost only | CORS origins. Set to * for container access |
OLLAMA_FLASH_ATTENTION | disabled | Enable flash attention (reduces VRAM, improves speed) |
OLLAMA_CONTEXT_LENGTH | auto | Override default context window (e.g., 8192) |
OLLAMA_HOST | 0.0.0.0:11434 | Bind address (already set correctly in Docker image) |
Model Management
# Pull a model
docker exec ollama ollama pull mistral
# Remove a model to free disk space
docker exec ollama ollama rm mistral
# Show model details
docker exec ollama ollama show llama3.1
# Copy/rename a model
docker exec ollama ollama cp llama3.1 my-custom-model
Popular Models
| Model | Size | Use Case |
|---|---|---|
llama3.1 | 4.7 GB | General purpose, good balance |
llama3.1:70b | 40 GB | High quality, needs 48 GB+ VRAM |
mistral | 4.1 GB | Fast, good for coding |
codellama | 3.8 GB | Code generation and completion |
phi3:mini | 2.0 GB | Lightweight, good for low-resource servers |
gemma2 | 5.4 GB | Google’s model, strong reasoning |
deepseek-coder-v2 | 8.9 GB | Code-focused, excellent quality |
Creating Custom Models
Create a Modelfile to customize model behavior:
FROM llama3.1
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
SYSTEM "You are a helpful assistant specialized in Linux system administration."
docker exec ollama ollama create sysadmin-helper -f /path/to/Modelfile
Advanced Configuration (Optional)
GPU Selection
If you have multiple NVIDIA GPUs, select specific ones:
environment:
- CUDA_VISIBLE_DEVICES=0,1 # Use GPU 0 and 1 only
Spread Model Across Multiple GPUs
environment:
- OLLAMA_SCHED_SPREAD=1 # Distribute model layers across all GPUs
KV Cache Quantization (Save VRAM)
environment:
- OLLAMA_KV_CACHE_TYPE=q8_0 # Options: f16 (default), q8_0, q4_0
Using q8_0 reduces VRAM usage with minimal quality loss. q4_0 saves more VRAM but may impact output quality.
Connecting to Open WebUI
Ollama’s API is designed to work with web interfaces. The most popular is Open WebUI:
services:
ollama:
image: ollama/ollama:v0.18.2
container_name: ollama
volumes:
- ollama_data:/root/.ollama
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
open-webui:
image: ghcr.io/open-webui/open-webui:v0.8.10
container_name: open-webui
ports:
- "3000:8080"
volumes:
- open-webui_data:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://ollama:11434
- WEBUI_SECRET_KEY=change-this-to-a-random-string
depends_on:
- ollama
restart: unless-stopped
volumes:
ollama_data:
open-webui_data:
See the full Open WebUI guide for detailed setup.
Reverse Proxy
Behind Nginx Proxy Manager or another reverse proxy, forward to port 11434 (API) or port 3000 if using Open WebUI.
For the Ollama API specifically, ensure WebSocket passthrough is enabled and increase proxy timeouts — model inference can take 30+ seconds for large prompts.
Nginx config snippet:
location /ollama/ {
proxy_pass http://localhost:11434/;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_read_timeout 300s;
proxy_send_timeout 300s;
}
See Reverse Proxy Setup for full configuration.
Backup
Back up the models volume:
docker compose stop ollama
docker run --rm -v ollama_data:/data -v $(pwd):/backup alpine \
tar czf /backup/ollama-backup.tar.gz /data
docker compose start ollama
The /root/.ollama volume contains downloaded models and configuration. Models can be re-downloaded, so backing up is optional if bandwidth isn’t a concern.
See Backup Strategy for a comprehensive approach.
Troubleshooting
GPU not detected
Symptom: Ollama falls back to CPU despite having an NVIDIA GPU.
Fix: Verify the NVIDIA Container Toolkit is installed and configured:
nvidia-smi # Should show your GPU
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi # Should work in Docker
If the second command fails, reconfigure the toolkit:
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
Model download fails
Symptom: ollama pull hangs or fails.
Fix: Check disk space (df -h) — models are large. Verify network connectivity from inside the container:
docker exec ollama curl -I https://registry.ollama.ai
Out of memory (OOM)
Symptom: Container is killed during inference.
Fix: Use a smaller model or a quantized variant. Add swap space:
sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
Or limit the model’s context length:
environment:
- OLLAMA_CONTEXT_LENGTH=4096
Slow inference on CPU
Symptom: Responses take minutes.
Fix: CPU inference is inherently slow for large models. Use smaller models (phi3:mini, tinyllama), reduce context length, or add a GPU. Even a used NVIDIA GTX 1070 with 8 GB VRAM dramatically improves performance.
Cannot connect from another container
Symptom: Other containers get “connection refused” when hitting Ollama’s API.
Fix: Ensure OLLAMA_ORIGINS=* is set and containers are on the same Docker network. Use http://ollama:11434 (the service name) as the URL, not localhost.
Resource Requirements
- RAM (CPU mode): Model size + 2-4 GB overhead. A 7B model needs ~8 GB total.
- RAM (GPU mode): System RAM for overhead (~4 GB), VRAM for the model.
- VRAM: Roughly matches model file size. 8 GB VRAM = 7B parameter models. 24 GB VRAM = 70B quantized models.
- CPU: More cores = faster CPU inference. Ollama uses all available cores.
- Disk: 20-100+ GB depending on model collection.
Verdict
Ollama is the best way to run LLMs locally. The Docker setup is dead simple, GPU passthrough works reliably, and the OpenAI-compatible API means it integrates with virtually every AI tool. Pair it with Open WebUI for a ChatGPT-like interface, or use the API directly from your applications.
For a complete self-hosted AI stack, Ollama is the runtime and Open WebUI is the interface. If you need an OpenAI API drop-in replacement with support for multiple model formats, LocalAI is an alternative — but Ollama is simpler and faster for most use cases.
Frequently Asked Questions
How much RAM do I need for Ollama?
It depends on the model size. The rule of thumb: you need roughly 1 GB of RAM per billion parameters when running quantized models (Q4). Llama 3.1 8B needs ~6 GB, Mistral 7B needs ~5 GB, and Llama 3.1 70B needs ~40 GB. These are for the quantized (Q4_K_M) versions — full-precision models need 2x more. If using a GPU, VRAM is what matters. If running on CPU only, system RAM is used. Always leave headroom for the OS and other applications.
Can I run Ollama without a GPU?
Yes. Ollama runs on CPU-only systems. Inference is slower — expect 5-15 tokens/second for a 7B model on a modern 8-core CPU vs 40-80 tokens/second on an NVIDIA RTX 3060. For small models (3B-7B), CPU-only performance is usable for personal chat. For larger models (13B+) or production API serving, a GPU is strongly recommended. Intel CPUs with AVX2/AVX-512 support get better performance than older CPUs without these instruction sets.
Which GPU should I use with Ollama?
NVIDIA GPUs are the best supported. The RTX 3060 12GB is the community’s top value pick — 12 GB VRAM runs all 7B models and most 13B quantized models for around $250 used. The RTX 3090 (24 GB VRAM) runs 70B quantized models. AMD GPU support exists through ROCm but is less stable. Apple Silicon Macs with Metal are well-supported for the native (non-Docker) Ollama install but not inside Docker containers on macOS.
How do I use Ollama with Open WebUI?
Open WebUI provides a ChatGPT-like web interface for Ollama. Run it as a separate container and point it at your Ollama API:
docker run -d --name open-webui \
-p 3000:8080 \
-e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
--restart unless-stopped \
ghcr.io/open-webui/open-webui:v0.8.10
Open WebUI supports model switching, conversation history, RAG (document upload), and multi-user accounts. It is the most popular frontend for self-hosted Ollama.
Can I use Ollama as an OpenAI API replacement?
Yes. Ollama exposes an OpenAI-compatible API at /v1/chat/completions. Any application that supports custom OpenAI API endpoints can use Ollama as a drop-in replacement — set the base URL to http://your-server:11434/v1 and use the model name as the model parameter. This works with LangChain, LlamaIndex, Continue (VS Code extension), and many other tools. No API key is needed for local access.
How do I download and manage models?
Use the ollama CLI inside the container:
docker exec -it ollama ollama pull llama3.1:8b # Download a model
docker exec -it ollama ollama list # List installed models
docker exec -it ollama ollama rm mistral:7b # Delete a model
docker exec -it ollama ollama show llama3.1:8b # Show model details
Models are stored in the /root/.ollama volume. A 7B model is typically 4-5 GB on disk (quantized). You can also pull models via the API: curl http://localhost:11434/api/pull -d '{"name": "llama3.1:8b"}'.
Can multiple users query Ollama simultaneously?
Yes, but requests are processed sequentially by default for a single model. If two users send requests at the same time, one waits for the other to finish. You can set OLLAMA_NUM_PARALLEL to allow concurrent requests to the same model (requires enough VRAM to hold multiple inference contexts). For true multi-user serving with load balancing and queueing, consider vLLM which is designed for production serving.
Related
Get self-hosting tips in your inbox
Get the Docker Compose configs, hardware picks, and setup shortcuts we don't put in articles. Weekly. No spam.
Comments