What Is LocalAI?

LocalAI is a self-hosted, OpenAI-compatible API server that runs AI models locally. Unlike Ollama (which focuses on LLMs), LocalAI handles text generation, image generation (Stable Diffusion), audio transcription (Whisper), text-to-speech, and embeddings — all from a single API endpoint. It’s a drop-in replacement for the OpenAI API, so existing applications work without code changes.

Prerequisites

A Linux server (Ubuntu 22.04+ recommended)
Docker and Docker Compose installed (guide)
8 GB+ RAM (CPU mode) or NVIDIA GPU with 8+ GB VRAM
20 GB+ free disk space (models are large)
NVIDIA Container Toolkit (for GPU mode)

Docker Compose Configuration

Create a docker-compose.yml file:

services:
  localai:
    image: localai/localai:v3.11.0
    # GPU variants (uncomment one):
    # image: localai/localai:v3.11.0-gpu-nvidia-cuda-12
    # image: localai/localai:v3.11.0-gpu-nvidia-cuda-13
    # image: localai/localai:v3.11.0-gpu-hipblas       # AMD ROCm
    # image: localai/localai:v3.11.0-gpu-intel          # Intel Arc
    # image: localai/localai:v3.11.0-gpu-vulkan         # Vulkan
    container_name: localai
    ports:
      - "8080:8080"
    volumes:
      - localai_models:/build/models
    environment:
      # Thread count for CPU inference
      - THREADS=4
      # Default context window size
      - CONTEXT_SIZE=2048
      # Enable debug logging (optional)
      # - DEBUG=true
    # Uncomment for NVIDIA GPU support:
    # deploy:
    #   resources:
    #     reservations:
    #       devices:
    #         - driver: nvidia
    #           count: all
    #           capabilities: [gpu]
    restart: unless-stopped

volumes:
  localai_models:

For the all-in-one image with pre-bundled models:

services:
  localai:
    image: localai/localai:v3.11.0-aio-cpu
    # Or with GPU: localai/localai:v3.11.0-aio-gpu-nvidia-cuda-12
    container_name: localai
    ports:
      - "8080:8080"
    volumes:
      - localai_models:/build/models
    restart: unless-stopped

volumes:
  localai_models:

Start the stack:

docker compose up -d

Initial Setup

Once running, verify the API is responding:

curl http://localhost:8080/v1/models

Loading a Model

Download a GGUF model and create a configuration:

# Download a model (example: Mistral 7B)
docker exec localai wget -P /build/models \
  https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_K_M.gguf

# Create a model config
docker exec localai bash -c 'cat > /build/models/mistral.yaml << EOF
name: mistral
backend: llama-cpp
parameters:
  model: mistral-7b-instruct-v0.2.Q4_K_M.gguf
  temperature: 0.7
  top_p: 0.9
context_size: 4096
threads: 4
EOF'

Test it with the OpenAI-compatible API:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistral",
    "messages": [{"role": "user", "content": "What is self-hosting?"}]
  }'

Configuration

Key Environment Variables

Variable	Default	Description
`THREADS`	`4`	CPU threads for inference
`CONTEXT_SIZE`	`512`	Default context window size
`MODELS_PATH`	`/build/models`	Directory for model files
`DEBUG`	`false`	Enable debug logging
`CORS`	`true`	Enable CORS headers
`GALLERIES`		JSON array of model gallery URLs

Model Gallery

LocalAI supports a model gallery for easy model installation:

# List available models
curl http://localhost:8080/models/available

# Install a model from the gallery
curl http://localhost:8080/models/apply -d '{"id": "huggingface@TheBloke/mistral-7b-instruct-v0.2-GGUF/mistral-7b-instruct-v0.2.Q4_K_M.gguf"}'

Advanced Configuration

Image Generation (Stable Diffusion)

Add a Stable Diffusion model config:

# /build/models/stablediffusion.yaml
name: stablediffusion
backend: stablediffusion
parameters:
  model: stablediffusion_assets

Generate images via the OpenAI Images API:

curl http://localhost:8080/v1/images/generations \
  -H "Content-Type: application/json" \
  -d '{"prompt": "a cat on a spaceship", "size": "512x512"}'

Audio Transcription (Whisper)

LocalAI supports Whisper for speech-to-text:

curl http://localhost:8080/v1/audio/transcriptions \
  -F "[email protected]" \
  -F "model=whisper-1"

Text-to-Speech

curl http://localhost:8080/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "tts-1", "input": "Hello from LocalAI", "voice": "alloy"}' \
  --output speech.mp3

Reverse Proxy

For HTTPS access, configure your reverse proxy to forward to port 8080. See Reverse Proxy Setup for details.

Nginx Proxy Manager: Create a proxy host pointing to localai:8080. Enable WebSocket support for streaming responses.

Backup

Back up the models volume:

docker run --rm -v localai_models:/data -v $(pwd):/backup alpine \
  tar czf /backup/localai-models-backup.tar.gz /data

The models volume contains downloaded models and YAML configurations. Models can be re-downloaded, but custom configs should be backed up. See Backup Strategy for a comprehensive approach.

Troubleshooting

Model Not Loading

Symptom: API returns empty model list or 404 on model name. Fix: Verify the YAML config filename matches the name field. Check that the GGUF file path in the YAML matches the actual file in /build/models/. Check logs: docker logs localai.

Out of Memory

Symptom: Container killed, OOM errors in logs. Fix: Use a smaller quantized model (Q4_K_M instead of Q8). Reduce CONTEXT_SIZE. For GPU: choose a model that fits in your VRAM. For CPU: ensure enough system RAM (model size + 2 GB overhead).

Slow Inference on CPU

Symptom: Responses take 30+ seconds. Fix: Increase THREADS to match your CPU core count. Use a smaller model (7B Q4 instead of 13B). Consider a GPU variant for 5-10x speedup.

CORS Errors from Frontend

Symptom: Browser console shows CORS errors. Fix: Set CORS=true environment variable (default is already true). If using a reverse proxy, ensure it doesn’t strip CORS headers.

GPU Not Detected

Symptom: Running on CPU despite having GPU. Fix: Ensure you’re using the correct GPU image variant (e.g., v3.11.0-gpu-nvidia-cuda-12). Verify NVIDIA Container Toolkit is installed: nvidia-smi should work inside the container. Check deploy.resources.reservations.devices in your Compose file.

Resource Requirements

RAM (CPU mode): 4-8 GB for 7B models, 8-16 GB for 13B models
VRAM (GPU mode): 4-8 GB for 7B Q4, 8-16 GB for 13B Q4
CPU: Medium-high (benefits from more cores, set THREADS accordingly)
Disk: 5-50 GB depending on number and size of models

Verdict

LocalAI is the Swiss Army knife of self-hosted AI. If you need a single service that handles text generation, image generation, audio transcription, and text-to-speech — all behind an OpenAI-compatible API — LocalAI is the only option that does it all. The trade-off is more complex setup compared to Ollama, which only does LLM inference but does it with less friction.

Choose LocalAI if you’re migrating an application from the OpenAI API to self-hosted, or if you need multi-modal AI (text + images + audio) from one endpoint. Choose Ollama if you only need LLM inference and want the simplest setup.