How to Self-Host Tabby with Docker Compose

What Is Tabby?

Tabby is a self-hosted AI code completion server. It runs a code-specific language model on your infrastructure and serves completions to IDE extensions (VS Code, JetBrains, Vim/Neovim). Tabby indexes your repositories for context-aware suggestions and includes an admin dashboard for managing users and monitoring usage. Think of it as a self-hosted GitHub Copilot alternative.

Prerequisites

A Linux server (Ubuntu 22.04+ recommended)
Docker and Docker Compose installed (guide)
NVIDIA GPU with 4+ GB VRAM (recommended) or CPU mode (slower)
8 GB+ RAM
10 GB+ free disk space
NVIDIA Container Toolkit (for GPU mode)

Docker Compose Configuration

Create a docker-compose.yml file:

services:
  tabby:
    image: tabbyml/tabby:v0.32.0
    container_name: tabby
    ports:
      - "8080:8080"
    volumes:
      - tabby_data:/data
    command: serve --model StarCoder-1B --device cuda
    # For CPU-only mode, use:
    # command: serve --model StarCoder-1B --device cpu
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    restart: unless-stopped

volumes:
  tabby_data:

Start the stack:

docker compose up -d

Tabby downloads the model on first start (StarCoder-1B is ~2 GB).

Initial Setup

Open http://your-server:8080 in your browser
Create your admin account on first visit
Go to Settings → Repository to add your code repositories
Install the Tabby extension in your IDE

IDE Extension Setup

VS Code:

Install “Tabby” from the VS Code Marketplace
Open Settings → search for “Tabby”
Set Server Endpoint to http://your-server:8080

JetBrains:

Install “Tabby” from the JetBrains Plugin Marketplace
Go to Settings → Tools → Tabby
Set Server Endpoint to http://your-server:8080

Vim/Neovim: Install TabbyML/vim-tabby plugin and configure the endpoint.

Configuration

Model Selection

Model	VRAM Required	Quality	Speed
`StarCoder-1B`	~2 GB	Good	Very fast
`StarCoder-3B`	~4 GB	Better	Fast
`StarCoder-7B`	~8 GB	Best built-in	Moderate
`CodeLlama-7B`	~8 GB	Best overall	Moderate
`DeepSeek-Coder-1.3B`	~2 GB	Good	Very fast

Change the model in the command field:

command: serve --model CodeLlama-7B --device cuda

Repository Indexing

Add repositories through the admin dashboard for context-aware completions:

Go to Settings → Repository
Add Git repository URLs
Tabby indexes the code and uses it as context for completions

Key CLI Arguments

Argument	Description
`--model MODEL`	Code completion model to serve
`--chat-model MODEL`	Separate model for chat (optional)
`--device cuda/cpu`	Inference device
`--port PORT`	Server port (default: 8080)
`--parallelism N`	Max concurrent completion requests

Advanced Configuration

Separate Chat Model

Use a larger model for chat while keeping a fast small model for completions:

command: >
  serve
  --model StarCoder-1B
  --chat-model CodeLlama-7B-Instruct
  --device cuda

Using an External LLM Backend

Tabby can connect to Ollama or other OpenAI-compatible APIs instead of serving models itself:

command: serve --device cpu
environment:
  - TABBY_LLM_ENDPOINT=http://ollama:11434

Reverse Proxy

Configure your reverse proxy to forward to port 8080. See Reverse Proxy Setup.

Backup

Back up the Tabby data volume:

docker run --rm -v tabby_data:/data -v $(pwd):/backup alpine \
  tar czf /backup/tabby-backup.tar.gz /data

This contains user accounts, repository indexes, and configuration. Models can be re-downloaded. See Backup Strategy.

Troubleshooting

Completions Are Slow

Symptom: Code completions take 2+ seconds. Fix: Use a smaller model (StarCoder-1B or DeepSeek-Coder-1.3B). Ensure GPU mode is active (--device cuda). Check that the NVIDIA Container Toolkit is installed.

GPU Not Detected

Symptom: Container falls back to CPU mode. Fix: Verify nvidia-smi works on the host. Ensure deploy.resources.reservations.devices is set in docker-compose.yml. Install NVIDIA Container Toolkit if missing.

IDE Extension Not Connecting

Symptom: Extension shows “disconnected” status. Fix: Verify the server URL is correct and accessible. Check firewall rules for port 8080. If using a reverse proxy, ensure it’s forwarding correctly.

No Context-Aware Completions

Symptom: Completions don’t reference your codebase. Fix: Add your repositories in Settings → Repository. Wait for indexing to complete (check the admin dashboard for status). Ensure the repository URL is accessible from the Tabby container.

Resource Requirements

VRAM: 2 GB (1B models), 4 GB (3B models), 8 GB (7B models)
RAM: 4-8 GB system RAM
CPU: Low-medium (GPU does the heavy lifting)
Disk: 2-10 GB per model + repository index size

Verdict

Tabby is the best self-hosted code completion server for teams. The admin dashboard, user management, repository indexing, and usage analytics make it a proper enterprise-ready tool. The trade-off is that it requires a dedicated GPU for reasonable performance, and the model selection is more limited than using a general-purpose LLM backend like Ollama.

Choose Tabby for a centralized code AI server for your team. Choose Continue.dev + Ollama if you want more flexibility and don’t need centralized management.

Frequently Asked Questions

Does Tabby require a GPU?

A GPU is strongly recommended but not required. Tabby supports CPU-only mode (--device cpu), but inference is significantly slower — completions may take 2-5 seconds instead of milliseconds. For a usable experience, an NVIDIA GPU with 4+ GB VRAM is recommended. The smaller models (StarCoder-1B, DeepSeek-Coder-1.3B) work on 2 GB VRAM.

How does Tabby compare to GitHub Copilot?

GitHub Copilot uses cloud-hosted models and costs $10-19/month per user. Tabby runs entirely on your infrastructure — your code never leaves your servers. Copilot generally produces higher-quality completions due to larger models, but Tabby with repository indexing provides context-aware suggestions tailored to your codebase. For privacy-sensitive codebases, Tabby is the clear choice.

Can I use Tabby with Ollama?

Yes. Tabby can connect to Ollama or any OpenAI-compatible API backend instead of serving models itself. Set TABBY_LLM_ENDPOINT=http://ollama:11434 as an environment variable. This lets you use any model Ollama supports while keeping Tabby’s IDE integration and admin dashboard.

Does Tabby support multiple users?

Yes. The admin dashboard includes user management with API token generation. Each user gets their own token for IDE authentication. The admin can monitor per-user usage statistics, see completion acceptance rates, and manage repository access.

What programming languages does Tabby support?

Tabby supports all major programming languages — the underlying models (StarCoder, CodeLlama, DeepSeek-Coder) are trained on datasets covering Python, JavaScript, TypeScript, Java, Go, Rust, C/C++, and dozens more. Quality varies by language — popular languages get better completions than niche ones.

Can I fine-tune models with my own code?

Not directly through Tabby’s interface. However, Tabby supports repository indexing which provides context-aware completions using your codebase without fine-tuning. For actual model fine-tuning, you’d need to fine-tune the model separately and then serve it through Tabby.