Paperless-ngx: OCR Not Working — Fix

The Problem

Documents uploaded to Paperless-ngx process successfully (no errors in the task queue) but contain no searchable text. The document appears in the UI with its filename and tags, but search returns no results for words clearly visible in the scanned image. The content preview shows blank or garbled text.

Updated March 2026: Verified with latest Docker images and configurations.

Common error messages in logs (docker logs paperless-webserver):

[ERROR] [paperless.tasks] Error while consuming document: Tesseract OCR error

[WARNING] [paperless.consumer] No OCR performed on document

Error opening data file /usr/share/tesseract-ocr/5/tessdata/eng.traineddata

OCR skipped: document already contains text layer

The Cause

OCR failures in Paperless-ngx have four common root causes:

Cause	Symptom	Frequency
Missing language packs	Error referencing `tessdata` files	Very common
Wrong OCR mode	Documents with existing text layer skipped	Common
Insufficient memory	Tesseract killed by OOM during processing	Common on low-RAM servers
File permissions	Container can’t read/write OCR temp files	Occasional

The Fix

Method 1: Install Missing Language Packs

Paperless-ngx uses Tesseract for OCR. By default, only English is installed. If your documents are in another language (or mixed-language), Tesseract silently fails or produces garbage text.

Check which languages are installed:

docker exec paperless-webserver tesseract --list-langs

Add languages by setting PAPERLESS_OCR_LANGUAGES in your docker-compose.yml:

services:
  webserver:
    image: ghcr.io/paperless-ngx/paperless-ngx:2.20.11
    environment:
      # Space-separated Tesseract language codes
      # Common: deu (German), fra (French), spa (Spanish), ita (Italian)
      PAPERLESS_OCR_LANGUAGES: "eng deu fra"
      PAPERLESS_OCR_LANGUAGE: "eng"  # Default OCR language

Restart the container. Paperless-ngx downloads language packs on startup:

docker compose down && docker compose up -d

Re-process failed documents through the admin panel or by moving them back into the consumption folder.

Method 2: Change the OCR Mode

Paperless-ngx defaults to skip mode for documents that already have a text layer (like digitally-created PDFs). If that text layer is corrupt or incomplete, OCR gets skipped even though the document needs it.

Force OCR on all documents by changing the mode:

environment:
  # Options: skip, redo, force
  # skip: Don't OCR if text layer exists (default)
  # redo: Re-OCR everything, replacing existing text
  # force: Always run OCR, even on text-based PDFs
  PAPERLESS_OCR_MODE: "force"

Warning: force mode significantly increases processing time and disk usage because every document gets OCR’d, including text-based PDFs that don’t need it. Use redo if you only want to fix broken text layers without forcing OCR on clean documents.

Method 3: Increase Memory for Tesseract

On servers with 2 GB RAM or less, Tesseract gets killed by the Linux OOM (Out of Memory) killer during processing of large or high-resolution scans. Check for OOM kills:

dmesg | grep -i "oom\|killed process"

Solutions, in order of preference:

A. Reduce parallel processing:

environment:
  # Process one document at a time (default: 1)
  PAPERLESS_TASK_WORKERS: "1"
  # Limit threads per worker
  PAPERLESS_THREADS_PER_WORKER: "1"

B. Set a memory limit and swap:

services:
  webserver:
    deploy:
      resources:
        limits:
          memory: 2G

Also ensure your host has swap space configured (at least 2 GB).

C. Add more RAM. 4 GB is the practical minimum for processing high-resolution scans. 2 GB works only with reduced parallelism and low-resolution documents.

Method 4: Fix File Permissions

If the container runs as a non-root user, it needs read/write access to the consumption, media, and data directories. Permission errors appear in logs as:

PermissionError: [Errno 13] Permission denied: '/usr/src/paperless/media/documents/originals/...'

Fix with:

# Set ownership to the Paperless user (UID 1000 by default)
sudo chown -R 1000:1000 /path/to/paperless/media
sudo chown -R 1000:1000 /path/to/paperless/data
sudo chown -R 1000:1000 /path/to/paperless/consume

Or set the user ID explicitly in your compose file:

environment:
  USERMAP_UID: "1000"
  USERMAP_GID: "1000"

Prevention

Set OCR languages at deployment time. Include all languages you might need in PAPERLESS_OCR_LANGUAGES from the start.
Allocate enough RAM. Plan for 4 GB minimum if processing scanned documents regularly.
Monitor the task queue. Check http://your-server:8000/admin/ → Tasks for failed processing jobs. Catch OCR issues early.
Use redo mode for mixed document sources. If you receive both scanned and digital PDFs, redo mode handles both correctly without the overhead of force.
Keep Docker image updated. Newer Paperless-ngx releases include Tesseract improvements. Check release notes before upgrading.

The Problem

The Cause

The Fix

Method 1: Install Missing Language Packs

Method 2: Change the OCR Mode

Method 3: Increase Memory for Tesseract

Method 4: Fix File Permissions

Prevention

Related

Related Articles

How to Self-Host Paperless-ngx with Docker

Install Paperless-ngx on Raspberry Pi

Install Paperless-ngx on Proxmox VE

Install Paperless-ngx on Ubuntu Server

Paperless-ngx vs Docspell: Document Management Compared

Paperless-ngx vs Teedy: Document Management Compared

Get self-hosting tips in your inbox

Comments