How to Self-Host Text Generation WebUI
What Is Text Generation WebUI?
Text Generation WebUI (commonly called “Oobabooga”) is a Gradio-based web interface for running large language models locally. It supports the widest range of model formats of any LLM interface — GGUF, GPTQ, AWQ, EXL2, and HuggingFace Transformers. It also supports LoRA training and fine-tuning, making it the go-to tool for ML enthusiasts who want deep model control.
Prerequisites
- A Linux server (Ubuntu 22.04+ recommended)
- Docker and Docker Compose installed (guide)
- NVIDIA GPU with 8+ GB VRAM (recommended)
- 16 GB+ system RAM
- 30 GB+ free disk space
- NVIDIA Container Toolkit installed (for GPU mode)
Docker Compose Configuration
Text Generation WebUI doesn’t have an official Docker image, but the community-maintained setup works well. Create a docker-compose.yml:
services:
text-gen-webui:
build:
context: .
dockerfile: Dockerfile
container_name: text-gen-webui
ports:
- "7860:7860" # Web UI
- "5000:5000" # API server
volumes:
- ./models:/app/models
- ./loras:/app/loras
- ./characters:/app/characters
- ./presets:/app/presets
- ./extensions:/app/extensions
environment:
- CLI_ARGS=--listen --api --verbose
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
Create a Dockerfile:
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04
RUN apt-get update && apt-get install -y \
git python3 python3-pip python3-venv wget \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
RUN git clone https://github.com/oobabooga/text-generation-webui.git . && \
git checkout v1.10.1
RUN pip3 install --no-cache-dir -r requirements.txt && \
pip3 install --no-cache-dir torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
EXPOSE 7860 5000
CMD ["python3", "server.py", "--listen", "--api"]
Alternatively, use the official installation method which is simpler:
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
./start_linux.sh
The startup script creates a conda environment and handles all dependencies automatically.
Start the stack:
docker compose up -d --build
Initial Setup
- Open
http://your-server:7860in your browser - Go to the Model tab
- Enter a HuggingFace model name (e.g.,
TheBloke/Mistral-7B-Instruct-v0.2-GGUF) - Click Download and wait for the model to download
- Select the model and click Load
- Switch to the Chat tab and start chatting
Configuration
Model Loader Selection
| Loader | Model Formats | Best For |
|---|---|---|
| llama.cpp | GGUF | CPU/GPU hybrid inference, quantized models |
| ExLlamaV2 | EXL2, GPTQ | Fastest GPU inference, quantized models |
| Transformers | SafeTensors, HF format | Full precision, training, fine-tuning |
| AutoGPTQ | GPTQ | GPU inference, older GPTQ models |
| AutoAWQ | AWQ | GPU inference, AWQ quantized models |
CLI Arguments
| Argument | Description |
|---|---|
--listen | Listen on 0.0.0.0 (required for Docker) |
--api | Enable the OpenAI-compatible API on port 5000 |
--verbose | Enable detailed logging |
--cpu | Run on CPU only (slow) |
--n-gpu-layers N | Number of GPU layers (for llama.cpp) |
--gpu-memory X | Set GPU VRAM limit in GiB |
--extensions E1 E2 | Load extensions on startup |
Advanced Configuration
LoRA Training
Text Generation WebUI includes a built-in LoRA training interface:
- Go to the Training tab
- Prepare training data in the expected format (JSON or raw text)
- Select a base model (must be loaded in Transformers format)
- Configure training parameters (learning rate, epochs, batch size)
- Start training — the LoRA adapter is saved to
loras/
Extensions
Extensions add functionality. Popular ones include:
- openai — OpenAI-compatible API server
- multimodal — Vision model support
- superboogav2 — RAG (retrieval augmented generation)
- whisper_stt — Speech-to-text input
- silero_tts — Text-to-speech output
API Usage
The OpenAI-compatible API runs on port 5000:
curl http://localhost:5000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Mistral-7B-Instruct-v0.2",
"messages": [{"role": "user", "content": "What is self-hosting?"}]
}'
Reverse Proxy
Configure your reverse proxy to forward to port 7860 (Web UI) or 5000 (API). WebSocket support is required for the Gradio UI. See Reverse Proxy Setup.
Backup
Back up these directories:
models/— Downloaded models (large, can be re-downloaded)loras/— Trained LoRA adapters (cannot be re-created without retraining)characters/— Custom character definitionspresets/— Generation parameter presets
Priority: loras/ and characters/ are irreplaceable. Models can be re-downloaded. See Backup Strategy.
Troubleshooting
CUDA Out of Memory
Symptom: Model fails to load with OOM error.
Fix: Use a smaller quantized model. Set --n-gpu-layers to offload fewer layers to GPU. Use EXL2 or GGUF quantization for smaller VRAM footprint.
Model Downloads Slowly
Symptom: Model download from HuggingFace is very slow.
Fix: Download models manually using huggingface-cli download and place them in the models/ directory.
Gradio UI Won’t Load
Symptom: Port 7860 connection refused.
Fix: Ensure --listen flag is set in CLI_ARGS. Check Docker port mapping. Verify the container started successfully: docker logs text-gen-webui.
Extension Not Working
Symptom: Extension doesn’t appear or crashes. Fix: Install extension dependencies inside the container. Some extensions require additional Python packages not included in the base installation.
Resource Requirements
- VRAM: 4-8 GB for 7B Q4, 8-16 GB for 13B Q4, 16-24 GB for 7B FP16
- RAM: 8-32 GB (depends on model size and loader)
- CPU: Medium-high (benefits from more cores for CPU inference)
- Disk: 5-100 GB per model
Verdict
Text Generation WebUI is the power user’s LLM interface. It supports more model formats and loading backends than any other tool, and the built-in LoRA training is unique. The trade-off is more complex setup and a less polished UI compared to Open WebUI.
Choose Text Generation WebUI if you want LoRA training, EXL2 model support, or deep control over inference parameters. Choose Open WebUI + Ollama for a polished ChatGPT-like experience with simpler setup.
Frequently Asked Questions
Do I need a GPU to run Text Generation WebUI?
A GPU is strongly recommended but not strictly required. You can run small quantized models (7B Q4) on CPU using llama.cpp, but inference will be very slow — expect 1-3 tokens per second versus 30-100+ on a decent GPU. An NVIDIA GPU with 8+ GB VRAM is the practical minimum for usable performance.
How does Text Generation WebUI compare to Ollama + Open WebUI?
Text Generation WebUI offers more model format support (GGUF, GPTQ, AWQ, EXL2), multiple loading backends, and built-in LoRA training. Ollama + Open WebUI provides a simpler setup, a more polished ChatGPT-like interface, and easier model management. Choose Text Generation WebUI for deep control and training; choose Ollama + Open WebUI for ease of use.
Can I use Text Generation WebUI as an OpenAI API drop-in replacement?
Yes. With the --api flag, it exposes an OpenAI-compatible API on port 5000. Applications that support custom OpenAI endpoints can connect to it directly. This includes tools like Continue, Tabby, and most LLM-powered applications that accept a configurable API URL.
What’s the difference between GGUF and GPTQ models?
GGUF (llama.cpp format) allows flexible CPU/GPU splitting — you can offload some layers to GPU and keep the rest in system RAM. GPTQ models are GPU-only but generally faster on full GPU inference. For machines with limited VRAM, GGUF with partial GPU offload is the most practical option. EXL2 offers the fastest GPU inference of all formats.
Can I fine-tune models with Text Generation WebUI?
Yes. The built-in Training tab supports LoRA fine-tuning. Load a base model in Transformers format, prepare your training data as JSON or raw text, configure hyperparameters (learning rate, epochs, batch size), and train. The LoRA adapter saves to the loras/ directory and can be applied on top of the base model during inference.
How much disk space do models require?
It varies widely by model size and quantization. A 7B parameter model in Q4 quantization needs about 4-5 GB. A 13B Q4 model needs 8-10 GB. Full precision (FP16) models are roughly double the quantized size. Plan for 30-100 GB of disk space if you want to keep multiple models available.
Related
Get self-hosting tips in your inbox
Get the Docker Compose configs, hardware picks, and setup shortcuts we don't put in articles. Weekly. No spam.
Comments