How to Self-Host Ollama with Docker Compose

What Is Ollama?

Ollama is a local LLM runtime that lets you run large language models like Llama 3, Mistral, Gemma, and dozens more on your own hardware. It handles model downloading, quantization, GPU acceleration, and serves an OpenAI-compatible REST API. Think of it as the Docker of LLMs — pull a model, run it, interact with it through an API or CLI.

Updated March 2026: Verified with latest Docker images and configurations.

Prerequisites

A Linux server (Ubuntu 22.04+ recommended)
Docker and Docker Compose installed (guide)
8 GB of RAM minimum (16 GB+ recommended for larger models)
20-50 GB of free disk space for models
NVIDIA GPU with CUDA support (optional but strongly recommended for performance)
If using GPU: NVIDIA drivers 531+ and NVIDIA Container Toolkit installed

Docker Compose Configuration

Create a docker-compose.yml file:

CPU-Only Setup

services:
  ollama:
    image: ollama/ollama:v0.18.2
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      # Stores downloaded models and configuration
      - ollama_data:/root/.ollama
    environment:
      # How long to keep models loaded in memory (default 5m, use -1 for always)
      - OLLAMA_KEEP_ALIVE=5m
      # Maximum parallel requests per model
      - OLLAMA_NUM_PARALLEL=1
      # Allow connections from other containers
      - OLLAMA_ORIGINS=*
    restart: unless-stopped

volumes:
  ollama_data:

With NVIDIA GPU

services:
  ollama:
    image: ollama/ollama:v0.18.2
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    environment:
      - OLLAMA_KEEP_ALIVE=5m
      # Increase parallel requests when GPU is available
      - OLLAMA_NUM_PARALLEL=4
      # Limit to 1 model loaded at a time (saves VRAM)
      - OLLAMA_MAX_LOADED_MODELS=1
      # Enable flash attention for better performance
      - OLLAMA_FLASH_ATTENTION=1
      - OLLAMA_ORIGINS=*
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

volumes:
  ollama_data:

With AMD GPU (ROCm)

services:
  ollama:
    image: ollama/ollama:rocm
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    devices:
      - /dev/kfd
      - /dev/dri
    environment:
      - OLLAMA_KEEP_ALIVE=5m
      - OLLAMA_NUM_PARALLEL=4
      - OLLAMA_ORIGINS=*
    restart: unless-stopped

volumes:
  ollama_data:

Start the stack:

docker compose up -d

Initial Setup

Ollama starts with no models. Pull your first model:

# Pull Llama 3.1 8B (4.7 GB)
docker exec ollama ollama pull llama3.1

# Pull a smaller model for testing (2 GB)
docker exec ollama ollama pull phi3:mini

Test it works:

# Interactive chat
docker exec -it ollama ollama run llama3.1

# API request
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1",
  "prompt": "What is self-hosting?",
  "stream": false
}'

List downloaded models:

docker exec ollama ollama list

Configuration

Key Environment Variables

Variable	Default	Purpose
`OLLAMA_KEEP_ALIVE`	`5m`	Time to keep models in memory. Use `-1` for always, `0` to unload immediately
`OLLAMA_NUM_PARALLEL`	`1`	Max concurrent requests per model. Increase with GPU
`OLLAMA_MAX_LOADED_MODELS`	`0` (unlimited)	Max models loaded simultaneously. Set to `1` on limited VRAM
`OLLAMA_MAX_QUEUE`	`512`	Maximum queued requests
`OLLAMA_ORIGINS`	`localhost only`	CORS origins. Set to `*` for container access
`OLLAMA_FLASH_ATTENTION`	disabled	Enable flash attention (reduces VRAM, improves speed)
`OLLAMA_CONTEXT_LENGTH`	auto	Override default context window (e.g., `8192`)
`OLLAMA_HOST`	`0.0.0.0:11434`	Bind address (already set correctly in Docker image)

Model Management

# Pull a model
docker exec ollama ollama pull mistral

# Remove a model to free disk space
docker exec ollama ollama rm mistral

# Show model details
docker exec ollama ollama show llama3.1

# Copy/rename a model
docker exec ollama ollama cp llama3.1 my-custom-model

Popular Models

Model	Size	Use Case
`llama3.1`	4.7 GB	General purpose, good balance
`llama3.1:70b`	40 GB	High quality, needs 48 GB+ VRAM
`mistral`	4.1 GB	Fast, good for coding
`codellama`	3.8 GB	Code generation and completion
`phi3:mini`	2.0 GB	Lightweight, good for low-resource servers
`gemma2`	5.4 GB	Google’s model, strong reasoning
`deepseek-coder-v2`	8.9 GB	Code-focused, excellent quality

Creating Custom Models

Create a Modelfile to customize model behavior:

FROM llama3.1
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
SYSTEM "You are a helpful assistant specialized in Linux system administration."

docker exec ollama ollama create sysadmin-helper -f /path/to/Modelfile

Advanced Configuration (Optional)

GPU Selection

If you have multiple NVIDIA GPUs, select specific ones:

environment:
  - CUDA_VISIBLE_DEVICES=0,1  # Use GPU 0 and 1 only

Spread Model Across Multiple GPUs

environment:
  - OLLAMA_SCHED_SPREAD=1  # Distribute model layers across all GPUs

KV Cache Quantization (Save VRAM)

environment:
  - OLLAMA_KV_CACHE_TYPE=q8_0  # Options: f16 (default), q8_0, q4_0

Using q8_0 reduces VRAM usage with minimal quality loss. q4_0 saves more VRAM but may impact output quality.

Connecting to Open WebUI

Ollama’s API is designed to work with web interfaces. The most popular is Open WebUI:

services:
  ollama:
    image: ollama/ollama:v0.18.2
    container_name: ollama
    volumes:
      - ollama_data:/root/.ollama
    restart: unless-stopped
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

  open-webui:
    image: ghcr.io/open-webui/open-webui:v0.8.10
    container_name: open-webui
    ports:
      - "3000:8080"
    volumes:
      - open-webui_data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - WEBUI_SECRET_KEY=change-this-to-a-random-string
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_data:
  open-webui_data:

See the full Open WebUI guide for detailed setup.

Reverse Proxy

Behind Nginx Proxy Manager or another reverse proxy, forward to port 11434 (API) or port 3000 if using Open WebUI.

For the Ollama API specifically, ensure WebSocket passthrough is enabled and increase proxy timeouts — model inference can take 30+ seconds for large prompts.

Nginx config snippet:

location /ollama/ {
    proxy_pass http://localhost:11434/;
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
    proxy_read_timeout 300s;
    proxy_send_timeout 300s;
}

See Reverse Proxy Setup for full configuration.

Backup

Back up the models volume:

docker compose stop ollama
docker run --rm -v ollama_data:/data -v $(pwd):/backup alpine \
  tar czf /backup/ollama-backup.tar.gz /data
docker compose start ollama

The /root/.ollama volume contains downloaded models and configuration. Models can be re-downloaded, so backing up is optional if bandwidth isn’t a concern.

See Backup Strategy for a comprehensive approach.

Troubleshooting

GPU not detected

Symptom: Ollama falls back to CPU despite having an NVIDIA GPU.

Fix: Verify the NVIDIA Container Toolkit is installed and configured:

nvidia-smi  # Should show your GPU
docker run --rm --gpus all nvidia/cuda:12.0-base nvidia-smi  # Should work in Docker

If the second command fails, reconfigure the toolkit:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker

Model download fails

Symptom: ollama pull hangs or fails.

Fix: Check disk space (df -h) — models are large. Verify network connectivity from inside the container:

docker exec ollama curl -I https://registry.ollama.ai

Out of memory (OOM)

Symptom: Container is killed during inference.

Fix: Use a smaller model or a quantized variant. Add swap space:

sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Or limit the model’s context length:

environment:
  - OLLAMA_CONTEXT_LENGTH=4096

Slow inference on CPU

Symptom: Responses take minutes.

Fix: CPU inference is inherently slow for large models. Use smaller models (phi3:mini, tinyllama), reduce context length, or add a GPU. Even a used NVIDIA GTX 1070 with 8 GB VRAM dramatically improves performance.

Cannot connect from another container

Symptom: Other containers get “connection refused” when hitting Ollama’s API.

Fix: Ensure OLLAMA_ORIGINS=* is set and containers are on the same Docker network. Use http://ollama:11434 (the service name) as the URL, not localhost.

Resource Requirements

RAM (CPU mode): Model size + 2-4 GB overhead. A 7B model needs ~8 GB total.
RAM (GPU mode): System RAM for overhead (~4 GB), VRAM for the model.
VRAM: Roughly matches model file size. 8 GB VRAM = 7B parameter models. 24 GB VRAM = 70B quantized models.
CPU: More cores = faster CPU inference. Ollama uses all available cores.
Disk: 20-100+ GB depending on model collection.

Verdict

Ollama is the best way to run LLMs locally. The Docker setup is dead simple, GPU passthrough works reliably, and the OpenAI-compatible API means it integrates with virtually every AI tool. Pair it with Open WebUI for a ChatGPT-like interface, or use the API directly from your applications.

For a complete self-hosted AI stack, Ollama is the runtime and Open WebUI is the interface. If you need an OpenAI API drop-in replacement with support for multiple model formats, LocalAI is an alternative — but Ollama is simpler and faster for most use cases.

Frequently Asked Questions

How much RAM do I need for Ollama?

It depends on the model size. The rule of thumb: you need roughly 1 GB of RAM per billion parameters when running quantized models (Q4). Llama 3.1 8B needs ~6 GB, Mistral 7B needs ~5 GB, and Llama 3.1 70B needs ~40 GB. These are for the quantized (Q4_K_M) versions — full-precision models need 2x more. If using a GPU, VRAM is what matters. If running on CPU only, system RAM is used. Always leave headroom for the OS and other applications.

Can I run Ollama without a GPU?

Yes. Ollama runs on CPU-only systems. Inference is slower — expect 5-15 tokens/second for a 7B model on a modern 8-core CPU vs 40-80 tokens/second on an NVIDIA RTX 3060. For small models (3B-7B), CPU-only performance is usable for personal chat. For larger models (13B+) or production API serving, a GPU is strongly recommended. Intel CPUs with AVX2/AVX-512 support get better performance than older CPUs without these instruction sets.

Which GPU should I use with Ollama?

NVIDIA GPUs are the best supported. The RTX 3060 12GB is the community’s top value pick — 12 GB VRAM runs all 7B models and most 13B quantized models for around $250 used. The RTX 3090 (24 GB VRAM) runs 70B quantized models. AMD GPU support exists through ROCm but is less stable. Apple Silicon Macs with Metal are well-supported for the native (non-Docker) Ollama install but not inside Docker containers on macOS.

How do I use Ollama with Open WebUI?

Open WebUI provides a ChatGPT-like web interface for Ollama. Run it as a separate container and point it at your Ollama API:

docker run -d --name open-webui \
  -p 3000:8080 \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  --restart unless-stopped \
  ghcr.io/open-webui/open-webui:v0.8.10

Open WebUI supports model switching, conversation history, RAG (document upload), and multi-user accounts. It is the most popular frontend for self-hosted Ollama.

Can I use Ollama as an OpenAI API replacement?

Yes. Ollama exposes an OpenAI-compatible API at /v1/chat/completions. Any application that supports custom OpenAI API endpoints can use Ollama as a drop-in replacement — set the base URL to http://your-server:11434/v1 and use the model name as the model parameter. This works with LangChain, LlamaIndex, Continue (VS Code extension), and many other tools. No API key is needed for local access.

How do I download and manage models?

Use the ollama CLI inside the container:

docker exec -it ollama ollama pull llama3.1:8b    # Download a model
docker exec -it ollama ollama list                  # List installed models
docker exec -it ollama ollama rm mistral:7b         # Delete a model
docker exec -it ollama ollama show llama3.1:8b      # Show model details

Models are stored in the /root/.ollama volume. A 7B model is typically 4-5 GB on disk (quantized). You can also pull models via the API: curl http://localhost:11434/api/pull -d '{"name": "llama3.1:8b"}'.

Can multiple users query Ollama simultaneously?

Yes, but requests are processed sequentially by default for a single model. If two users send requests at the same time, one waits for the other to finish. You can set OLLAMA_NUM_PARALLEL to allow concurrent requests to the same model (requires enough VRAM to hold multiple inference contexts). For true multi-user serving with load balancing and queueing, consider vLLM which is designed for production serving.