What Is vLLM?

vLLM is a high-throughput LLM inference engine designed for production serving. It invented PagedAttention, which manages GPU memory like an operating system manages virtual memory — dramatically improving throughput for concurrent requests. vLLM serves an OpenAI-compatible API, making it a drop-in backend for applications built against the OpenAI API.

Prerequisites

A Linux server (Ubuntu 22.04+ recommended)
Docker and Docker Compose installed (guide)
NVIDIA GPU with 16+ GB VRAM (required — no CPU mode)
NVIDIA Container Toolkit installed
30 GB+ free disk space (for model downloads)
HuggingFace account (for gated models like Llama)

Docker Compose Configuration

Create a docker-compose.yml file:

services:
  vllm:
    image: vllm/vllm-openai:v0.15.1
    container_name: vllm
    ports:
      - "8000:8000"
    volumes:
      - huggingface_cache:/root/.cache/huggingface
    environment:
      # Required for gated models (Llama, Mistral, etc.)
      - HUGGING_FACE_HUB_TOKEN=hf_your_token_here
    command: >
      --model mistralai/Mistral-7B-Instruct-v0.3
      --max-model-len 4096
      --gpu-memory-utilization 0.9
      --host 0.0.0.0
      --port 8000
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    ipc: host
    restart: unless-stopped

volumes:
  huggingface_cache:

Create a .env file:

# Get your token from https://huggingface.co/settings/tokens
HUGGING_FACE_HUB_TOKEN=hf_your_token_here

Start the stack:

docker compose up -d

The first start downloads the model from HuggingFace (may take several minutes depending on model size and connection speed).

Initial Setup

Verify the API is responding:

curl http://localhost:8000/v1/models

Test with a chat completion:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.3",
    "messages": [{"role": "user", "content": "What is self-hosting?"}],
    "max_tokens": 200
  }'

Configuration

Key Command-Line Arguments

Argument	Default	Description
`--model`	Required	HuggingFace model ID or local path
`--max-model-len`	Model’s max	Maximum context window length
`--gpu-memory-utilization`	`0.9`	Fraction of GPU VRAM to use (0.0-1.0)
`--tensor-parallel-size`	`1`	Number of GPUs for tensor parallelism
`--dtype`	`auto`	Data type: `auto`, `float16`, `bfloat16`
`--quantization`	None	Quantization method: `awq`, `gptq`, `fp8`
`--max-num-seqs`	`256`	Max concurrent sequences
`--host`	`0.0.0.0`	Host to bind
`--port`	`8000`	Port to listen on
`--api-key`	None	API key for authentication

Popular Models

Model	VRAM Required	Notes
`mistralai/Mistral-7B-Instruct-v0.3`	~16 GB	Good balance of quality and speed
`meta-llama/Llama-3.2-8B-Instruct`	~18 GB	Requires HF token (gated)
`Qwen/Qwen2.5-7B-Instruct`	~16 GB	Strong multilingual support
`TheBloke/Mistral-7B-Instruct-v0.2-AWQ`	~6 GB	AWQ quantized, less VRAM

Multi-GPU Setup

For models that don’t fit in a single GPU:

command: >
  --model meta-llama/Llama-3.1-70B-Instruct
  --tensor-parallel-size 2
  --max-model-len 4096
  --gpu-memory-utilization 0.9

Set --tensor-parallel-size to the number of GPUs available.

Advanced Configuration

Quantized Models (Lower VRAM)

Run AWQ or GPTQ quantized models to reduce VRAM requirements:

command: >
  --model TheBloke/Mistral-7B-Instruct-v0.2-AWQ
  --quantization awq
  --max-model-len 4096

API Key Authentication

command: >
  --model mistralai/Mistral-7B-Instruct-v0.3
  --api-key your-secret-api-key

Clients must include Authorization: Bearer your-secret-api-key in requests.

Speculative Decoding

For faster generation with a draft model:

command: >
  --model meta-llama/Llama-3.2-8B-Instruct
  --speculative-model meta-llama/Llama-3.2-1B-Instruct
  --num-speculative-tokens 5

Reverse Proxy

Configure your reverse proxy to forward to port 8000. WebSocket support is recommended for streaming. See Reverse Proxy Setup.

Backup

The HuggingFace cache volume stores downloaded models. Models can be re-downloaded from HuggingFace, so backups are optional but save time on re-deployment.

docker run --rm -v huggingface_cache:/data -v $(pwd):/backup alpine \
  tar czf /backup/vllm-models-backup.tar.gz /data

See Backup Strategy for a comprehensive approach.

Troubleshooting

CUDA Out of Memory

Symptom: torch.cuda.OutOfMemoryError on startup. Fix: Reduce --max-model-len (try 2048). Lower --gpu-memory-utilization to 0.8. Use a quantized model (AWQ/GPTQ). Use --tensor-parallel-size 2 if you have multiple GPUs.

Model Download Fails

Symptom: 401 or 403 error when downloading gated models. Fix: Set HUGGING_FACE_HUB_TOKEN with a valid HuggingFace token. Accept the model’s license on the HuggingFace website first.

Slow First Request

Symptom: First request takes 30+ seconds after startup. Fix: This is normal. vLLM compiles CUDA kernels on first use. Subsequent requests are fast. Use the warm-up period for health checks.

Container Crashes Immediately

Symptom: Container exits with code 1. Fix: Check docker logs vllm. Common causes: GPU not detected (install NVIDIA Container Toolkit), model too large for available VRAM, missing ipc: host in Compose file.

Resource Requirements

VRAM: 16-24 GB for 7B models (FP16), 6-8 GB for quantized
RAM: 16 GB+ system RAM recommended
CPU: Moderate (GPU does the heavy lifting)
Disk: 10-50 GB per model

Verdict

vLLM is the production-grade LLM serving engine. If you need to serve multiple concurrent users with consistent latency, vLLM’s PagedAttention and continuous batching make it 2-5x more efficient than sequential inference engines. The trade-off is a hard GPU requirement and more complex setup.

Choose vLLM if you’re building an application that serves LLM responses to multiple users. Choose Ollama if you want simpler setup for personal use or small teams.