Paperless-ngx: OCR Not Working — Fix
The Problem
Documents uploaded to Paperless-ngx process successfully (no errors in the task queue) but contain no searchable text. The document appears in the UI with its filename and tags, but search returns no results for words clearly visible in the scanned image. The content preview shows blank or garbled text.
Updated March 2026: Verified with latest Docker images and configurations.
Common error messages in logs (docker logs paperless-webserver):
[ERROR] [paperless.tasks] Error while consuming document: Tesseract OCR error
[WARNING] [paperless.consumer] No OCR performed on document
Error opening data file /usr/share/tesseract-ocr/5/tessdata/eng.traineddata
OCR skipped: document already contains text layer
The Cause
OCR failures in Paperless-ngx have four common root causes:
| Cause | Symptom | Frequency |
|---|---|---|
| Missing language packs | Error referencing tessdata files | Very common |
| Wrong OCR mode | Documents with existing text layer skipped | Common |
| Insufficient memory | Tesseract killed by OOM during processing | Common on low-RAM servers |
| File permissions | Container can’t read/write OCR temp files | Occasional |
The Fix
Method 1: Install Missing Language Packs
Paperless-ngx uses Tesseract for OCR. By default, only English is installed. If your documents are in another language (or mixed-language), Tesseract silently fails or produces garbage text.
Check which languages are installed:
docker exec paperless-webserver tesseract --list-langs
Add languages by setting PAPERLESS_OCR_LANGUAGES in your docker-compose.yml:
services:
webserver:
image: ghcr.io/paperless-ngx/paperless-ngx:2.20.11
environment:
# Space-separated Tesseract language codes
# Common: deu (German), fra (French), spa (Spanish), ita (Italian)
PAPERLESS_OCR_LANGUAGES: "eng deu fra"
PAPERLESS_OCR_LANGUAGE: "eng" # Default OCR language
Restart the container. Paperless-ngx downloads language packs on startup:
docker compose down && docker compose up -d
Re-process failed documents through the admin panel or by moving them back into the consumption folder.
Method 2: Change the OCR Mode
Paperless-ngx defaults to skip mode for documents that already have a text layer (like digitally-created PDFs). If that text layer is corrupt or incomplete, OCR gets skipped even though the document needs it.
Force OCR on all documents by changing the mode:
environment:
# Options: skip, redo, force
# skip: Don't OCR if text layer exists (default)
# redo: Re-OCR everything, replacing existing text
# force: Always run OCR, even on text-based PDFs
PAPERLESS_OCR_MODE: "force"
Warning: force mode significantly increases processing time and disk usage because every document gets OCR’d, including text-based PDFs that don’t need it. Use redo if you only want to fix broken text layers without forcing OCR on clean documents.
Method 3: Increase Memory for Tesseract
On servers with 2 GB RAM or less, Tesseract gets killed by the Linux OOM (Out of Memory) killer during processing of large or high-resolution scans. Check for OOM kills:
dmesg | grep -i "oom\|killed process"
Solutions, in order of preference:
A. Reduce parallel processing:
environment:
# Process one document at a time (default: 1)
PAPERLESS_TASK_WORKERS: "1"
# Limit threads per worker
PAPERLESS_THREADS_PER_WORKER: "1"
B. Set a memory limit and swap:
services:
webserver:
deploy:
resources:
limits:
memory: 2G
Also ensure your host has swap space configured (at least 2 GB).
C. Add more RAM. 4 GB is the practical minimum for processing high-resolution scans. 2 GB works only with reduced parallelism and low-resolution documents.
Method 4: Fix File Permissions
If the container runs as a non-root user, it needs read/write access to the consumption, media, and data directories. Permission errors appear in logs as:
PermissionError: [Errno 13] Permission denied: '/usr/src/paperless/media/documents/originals/...'
Fix with:
# Set ownership to the Paperless user (UID 1000 by default)
sudo chown -R 1000:1000 /path/to/paperless/media
sudo chown -R 1000:1000 /path/to/paperless/data
sudo chown -R 1000:1000 /path/to/paperless/consume
Or set the user ID explicitly in your compose file:
environment:
USERMAP_UID: "1000"
USERMAP_GID: "1000"
Prevention
- Set OCR languages at deployment time. Include all languages you might need in
PAPERLESS_OCR_LANGUAGESfrom the start. - Allocate enough RAM. Plan for 4 GB minimum if processing scanned documents regularly.
- Monitor the task queue. Check
http://your-server:8000/admin/→ Tasks for failed processing jobs. Catch OCR issues early. - Use
redomode for mixed document sources. If you receive both scanned and digital PDFs,redomode handles both correctly without the overhead offorce. - Keep Docker image updated. Newer Paperless-ngx releases include Tesseract improvements. Check release notes before upgrading.
Related
Get self-hosting tips in your inbox
Get the Docker Compose configs, hardware picks, and setup shortcuts we don't put in articles. Weekly. No spam.
Comments