How to Self-Host LocalAI with Docker Compose

What Is LocalAI?

LocalAI is a self-hosted, OpenAI-compatible API server that runs AI models locally. Unlike Ollama (which focuses on LLMs), LocalAI handles text generation, image generation (Stable Diffusion), audio transcription (Whisper), text-to-speech, and embeddings — all from a single API endpoint. It’s a drop-in replacement for the OpenAI API, so existing applications work without code changes.

Prerequisites

  • A Linux server (Ubuntu 22.04+ recommended)
  • Docker and Docker Compose installed (guide)
  • 8 GB+ RAM (CPU mode) or NVIDIA GPU with 8+ GB VRAM
  • 20 GB+ free disk space (models are large)
  • NVIDIA Container Toolkit (for GPU mode)

Docker Compose Configuration

Create a docker-compose.yml file:

services:
  localai:
    image: localai/localai:v4.0.0
    # GPU variants (uncomment one):
    # image: localai/localai:v4.0.0-gpu-nvidia-cuda-12
    # image: localai/localai:v4.0.0-gpu-nvidia-cuda-13
    # image: localai/localai:v4.0.0-gpu-hipblas       # AMD ROCm
    # image: localai/localai:v4.0.0-gpu-intel          # Intel Arc
    # image: localai/localai:v4.0.0-gpu-vulkan         # Vulkan
    container_name: localai
    ports:
      - "8080:8080"
    volumes:
      - localai_models:/build/models
    environment:
      # Thread count for CPU inference
      - THREADS=4
      # Default context window size
      - CONTEXT_SIZE=2048
      # Enable debug logging (optional)
      # - DEBUG=true
    # Uncomment for NVIDIA GPU support:
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: all
    #           capabilities: [gpu]
    restart: unless-stopped

volumes:
  localai_models:

Upgrading from v3.x: LocalAI v4.0.0 removed the all-in-one (AIO) images and the HuggingFace backend. If you used AIO images, switch to the main image above and install models via the gallery API. If you used HuggingFace models directly, convert them to GGUF format first. The json_verbose parameter was renamed to verbose_json for OpenAI spec compliance — update any integrations that reference the old name.

Start the stack:

docker compose up -d

Initial Setup

Once running, verify the API is responding:

curl http://localhost:8080/v1/models

Loading a Model

Download a GGUF model and create a configuration:

# Download a model (example: Mistral 7B)
docker exec localai wget -P /build/models \
  https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf

# Create a model config
docker exec localai bash -c 'cat > /build/models/mistral.yaml << EOF
name: mistral
backend: llama-cpp
parameters:
  model: mistral-7b-instruct-v0.2.Q4_K_M.gguf
  temperature: 0.7
  top_p: 0.9
context_size: 4096
threads: 4
EOF'

Test it with the OpenAI-compatible API:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral",
    "messages": [{"role": "user", "content": "What is self-hosting?"}]
  }'

Configuration

Key Environment Variables

VariableDefaultDescription
THREADS4CPU threads for inference
CONTEXT_SIZE512Default context window size
MODELS_PATH/build/modelsDirectory for model files
DEBUGfalseEnable debug logging
CORStrueEnable CORS headers
LOCALAI_DATA_PATHPersistent data path for agents and skills (separate from models)
GALLERIESJSON array of model gallery URLs

LocalAI supports a model gallery for easy model installation:

# List available models
curl http://localhost:8080/models/available

# Install a model from the gallery
curl http://localhost:8080/models/apply -d '{"id": "huggingface@TheBloke/mistral-7b-instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf"}'

Advanced Configuration

Image Generation (Stable Diffusion)

Add a Stable Diffusion model config:

# /build/models/stablediffusion.yaml
name: stablediffusion
backend: stablediffusion
parameters:
  model: stablediffusion_assets

Generate images via the OpenAI Images API:

curl http://localhost:8080/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{"prompt": "a cat on a spaceship", "size": "512x512"}'

Audio Transcription (Whisper)

LocalAI supports Whisper for speech-to-text:

curl http://localhost:8080/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-1"

Text-to-Speech

curl http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "tts-1", "input": "Hello from LocalAI", "voice": "alloy"}' \
  --output speech.mp3

Reverse Proxy

For HTTPS access, configure your reverse proxy to forward to port 8080. See Reverse Proxy Setup for details.

Nginx Proxy Manager: Create a proxy host pointing to localai:8080. Enable WebSocket support for streaming responses.

Backup

Back up the models volume:

docker run --rm -v localai_models:/data -v $(pwd):/backup alpine \
  tar czf /backup/localai-models-backup.tar.gz /data

The models volume contains downloaded models and YAML configurations. Models can be re-downloaded, but custom configs should be backed up. See Backup Strategy for a comprehensive approach.

Troubleshooting

Model Not Loading

Symptom: API returns empty model list or 404 on model name. Fix: Verify the YAML config filename matches the name field. Check that the GGUF file path in the YAML matches the actual file in /build/models/. Check logs: docker logs localai.

Out of Memory

Symptom: Container killed, OOM errors in logs. Fix: Use a smaller quantized model (Q4_K_M instead of Q8). Reduce CONTEXT_SIZE. For GPU: choose a model that fits in your VRAM. For CPU: ensure enough system RAM (model size + 2 GB overhead).

Slow Inference on CPU

Symptom: Responses take 30+ seconds. Fix: Increase THREADS to match your CPU core count. Use a smaller model (7B Q4 instead of 13B). Consider a GPU variant for 5-10x speedup.

CORS Errors from Frontend

Symptom: Browser console shows CORS errors. Fix: Set CORS=true environment variable (default is already true). If using a reverse proxy, ensure it doesn’t strip CORS headers.

GPU Not Detected

Symptom: Running on CPU despite having GPU. Fix: Ensure you’re using the correct GPU image variant (e.g., v4.0.0-gpu-nvidia-cuda-12). Verify NVIDIA Container Toolkit is installed: nvidia-smi should work inside the container. Check deploy.resources.reservations.devices in your Compose file.

Resource Requirements

  • RAM (CPU mode): 4-8 GB for 7B models, 8-16 GB for 13B models
  • VRAM (GPU mode): 4-8 GB for 7B Q4, 8-16 GB for 13B Q4
  • CPU: Medium-high (benefits from more cores, set THREADS accordingly)
  • Disk: 5-50 GB depending on number and size of models

Verdict

LocalAI is the Swiss Army knife of self-hosted AI. If you need a single service that handles text generation, image generation, audio transcription, and text-to-speech — all behind an OpenAI-compatible API — LocalAI is the only option that does it all. The trade-off is more complex setup compared to Ollama, which only does LLM inference but does it with less friction.

Choose LocalAI if you’re migrating an application from the OpenAI API to self-hosted, or if you need multi-modal AI (text + images + audio) from one endpoint. Choose Ollama if you only need LLM inference and want the simplest setup.

FAQ

How does LocalAI compare to Ollama?

Ollama focuses exclusively on LLM inference with the simplest possible setup — one command to pull and run models. LocalAI covers LLMs plus image generation, audio transcription, text-to-speech, and embeddings from a single API. Choose Ollama for simplicity and LLM-only use. Choose LocalAI for multi-modal AI or OpenAI API compatibility. See Ollama vs LocalAI.

Can I use LocalAI as a drop-in OpenAI API replacement?

Yes. LocalAI implements the OpenAI API specification (/v1/chat/completions, /v1/images/generations, /v1/audio/transcriptions, etc.). Point any OpenAI-compatible client at http://your-server:8080 instead of api.openai.com. No code changes needed in most applications.

Do I need a GPU to run LocalAI?

No. LocalAI works on CPU, though inference is significantly slower. A 7B parameter model (Q4 quantization) needs 4-8 GB of RAM on CPU. GPU variants (NVIDIA CUDA, AMD ROCm, Intel Arc, Vulkan) provide 5-10x speedup. For interactive use, a GPU is strongly recommended.

What model formats does LocalAI support?

GGUF is the primary format (via llama.cpp backend). LocalAI also supports diffusion models for image generation, Whisper models for audio, and ONNX models. HuggingFace model support was removed in v4.0.0 — convert HuggingFace models to GGUF format first.

Can I run multiple models simultaneously?

Yes. Define multiple YAML configuration files in the models directory, each pointing to a different model file. LocalAI loads models on demand when the API receives a request for that model name. Be mindful of RAM — each loaded model consumes memory proportional to its size.

What changed in LocalAI v4.0.0?

The all-in-one (AIO) images were removed — use the standard image and install models via the gallery API. The HuggingFace backend was dropped (convert models to GGUF). The json_verbose parameter was renamed to verbose_json for OpenAI spec compliance.

Comments