How to Self-Host Text Generation WebUI

What Is Text Generation WebUI?

Text Generation WebUI (commonly called “Oobabooga”) is a Gradio-based web interface for running large language models locally. It supports the widest range of model formats of any LLM interface — GGUF, GPTQ, AWQ, EXL2, and HuggingFace Transformers. It also supports LoRA training and fine-tuning, making it the go-to tool for ML enthusiasts who want deep model control.

Prerequisites

  • A Linux server (Ubuntu 22.04+ recommended)
  • Docker and Docker Compose installed (guide)
  • NVIDIA GPU with 8+ GB VRAM (recommended)
  • 16 GB+ system RAM
  • 30 GB+ free disk space
  • NVIDIA Container Toolkit installed (for GPU mode)

Docker Compose Configuration

Text Generation WebUI doesn’t have an official Docker image, but the community-maintained setup works well. Create a docker-compose.yml:

services:
  text-gen-webui:
    build:
      context: .
      dockerfile: Dockerfile
    container_name: text-gen-webui
    ports:
      - "7860:7860"    # Web UI
      - "5000:5000"    # API server
    volumes:
      - ./models:/app/models
      - ./loras:/app/loras
      - ./characters:/app/characters
      - ./presets:/app/presets
      - ./extensions:/app/extensions
    environment:
      - CLI_ARGS=--listen --api --verbose
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

Create a Dockerfile:

FROM nvidia/cuda:12.4.1-devel-ubuntu22.04

RUN apt-get update && apt-get install -y \
    git python3 python3-pip python3-venv wget \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

RUN git clone https://github.com/oobabooga/text-generation-webui.git . && \
    git checkout v1.10.1

RUN pip3 install --no-cache-dir -r requirements.txt && \
    pip3 install --no-cache-dir torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

EXPOSE 7860 5000

CMD ["python3", "server.py", "--listen", "--api"]

Alternatively, use the official installation method which is simpler:

git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
./start_linux.sh

The startup script creates a conda environment and handles all dependencies automatically.

Start the stack:

docker compose up -d --build

Initial Setup

  1. Open http://your-server:7860 in your browser
  2. Go to the Model tab
  3. Enter a HuggingFace model name (e.g., TheBloke/Mistral-7B-Instruct-v0.2-GGUF)
  4. Click Download and wait for the model to download
  5. Select the model and click Load
  6. Switch to the Chat tab and start chatting

Configuration

Model Loader Selection

LoaderModel FormatsBest For
llama.cppGGUFCPU/GPU hybrid inference, quantized models
ExLlamaV2EXL2, GPTQFastest GPU inference, quantized models
TransformersSafeTensors, HF formatFull precision, training, fine-tuning
AutoGPTQGPTQGPU inference, older GPTQ models
AutoAWQAWQGPU inference, AWQ quantized models

CLI Arguments

ArgumentDescription
--listenListen on 0.0.0.0 (required for Docker)
--apiEnable the OpenAI-compatible API on port 5000
--verboseEnable detailed logging
--cpuRun on CPU only (slow)
--n-gpu-layers NNumber of GPU layers (for llama.cpp)
--gpu-memory XSet GPU VRAM limit in GiB
--extensions E1 E2Load extensions on startup

Advanced Configuration

LoRA Training

Text Generation WebUI includes a built-in LoRA training interface:

  1. Go to the Training tab
  2. Prepare training data in the expected format (JSON or raw text)
  3. Select a base model (must be loaded in Transformers format)
  4. Configure training parameters (learning rate, epochs, batch size)
  5. Start training — the LoRA adapter is saved to loras/

Extensions

Extensions add functionality. Popular ones include:

  • openai — OpenAI-compatible API server
  • multimodal — Vision model support
  • superboogav2 — RAG (retrieval augmented generation)
  • whisper_stt — Speech-to-text input
  • silero_tts — Text-to-speech output

API Usage

The OpenAI-compatible API runs on port 5000:

curl http://localhost:5000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Mistral-7B-Instruct-v0.2",
    "messages": [{"role": "user", "content": "What is self-hosting?"}]
  }'

Reverse Proxy

Configure your reverse proxy to forward to port 7860 (Web UI) or 5000 (API). WebSocket support is required for the Gradio UI. See Reverse Proxy Setup.

Backup

Back up these directories:

  • models/ — Downloaded models (large, can be re-downloaded)
  • loras/ — Trained LoRA adapters (cannot be re-created without retraining)
  • characters/ — Custom character definitions
  • presets/ — Generation parameter presets

Priority: loras/ and characters/ are irreplaceable. Models can be re-downloaded. See Backup Strategy.

Troubleshooting

CUDA Out of Memory

Symptom: Model fails to load with OOM error. Fix: Use a smaller quantized model. Set --n-gpu-layers to offload fewer layers to GPU. Use EXL2 or GGUF quantization for smaller VRAM footprint.

Model Downloads Slowly

Symptom: Model download from HuggingFace is very slow. Fix: Download models manually using huggingface-cli download and place them in the models/ directory.

Gradio UI Won’t Load

Symptom: Port 7860 connection refused. Fix: Ensure --listen flag is set in CLI_ARGS. Check Docker port mapping. Verify the container started successfully: docker logs text-gen-webui.

Extension Not Working

Symptom: Extension doesn’t appear or crashes. Fix: Install extension dependencies inside the container. Some extensions require additional Python packages not included in the base installation.

Resource Requirements

  • VRAM: 4-8 GB for 7B Q4, 8-16 GB for 13B Q4, 16-24 GB for 7B FP16
  • RAM: 8-32 GB (depends on model size and loader)
  • CPU: Medium-high (benefits from more cores for CPU inference)
  • Disk: 5-100 GB per model

Verdict

Text Generation WebUI is the power user’s LLM interface. It supports more model formats and loading backends than any other tool, and the built-in LoRA training is unique. The trade-off is more complex setup and a less polished UI compared to Open WebUI.

Choose Text Generation WebUI if you want LoRA training, EXL2 model support, or deep control over inference parameters. Choose Open WebUI + Ollama for a polished ChatGPT-like experience with simpler setup.

Frequently Asked Questions

Do I need a GPU to run Text Generation WebUI?

A GPU is strongly recommended but not strictly required. You can run small quantized models (7B Q4) on CPU using llama.cpp, but inference will be very slow — expect 1-3 tokens per second versus 30-100+ on a decent GPU. An NVIDIA GPU with 8+ GB VRAM is the practical minimum for usable performance.

How does Text Generation WebUI compare to Ollama + Open WebUI?

Text Generation WebUI offers more model format support (GGUF, GPTQ, AWQ, EXL2), multiple loading backends, and built-in LoRA training. Ollama + Open WebUI provides a simpler setup, a more polished ChatGPT-like interface, and easier model management. Choose Text Generation WebUI for deep control and training; choose Ollama + Open WebUI for ease of use.

Can I use Text Generation WebUI as an OpenAI API drop-in replacement?

Yes. With the --api flag, it exposes an OpenAI-compatible API on port 5000. Applications that support custom OpenAI endpoints can connect to it directly. This includes tools like Continue, Tabby, and most LLM-powered applications that accept a configurable API URL.

What’s the difference between GGUF and GPTQ models?

GGUF (llama.cpp format) allows flexible CPU/GPU splitting — you can offload some layers to GPU and keep the rest in system RAM. GPTQ models are GPU-only but generally faster on full GPU inference. For machines with limited VRAM, GGUF with partial GPU offload is the most practical option. EXL2 offers the fastest GPU inference of all formats.

Can I fine-tune models with Text Generation WebUI?

Yes. The built-in Training tab supports LoRA fine-tuning. Load a base model in Transformers format, prepare your training data as JSON or raw text, configure hyperparameters (learning rate, epochs, batch size), and train. The LoRA adapter saves to the loras/ directory and can be applied on top of the base model during inference.

How much disk space do models require?

It varies widely by model size and quantization. A 7B parameter model in Q4 quantization needs about 4-5 GB. A 13B Q4 model needs 8-10 GB. Full precision (FP16) models are roughly double the quantized size. Plan for 30-100 GB of disk space if you want to keep multiple models available.

Comments