Paperless-ngx: OCR Not Working — Fix

The Problem

Documents uploaded to Paperless-ngx process successfully (no errors in the task queue) but contain no searchable text. The document appears in the UI with its filename and tags, but search returns no results for words clearly visible in the scanned image. The content preview shows blank or garbled text.

Updated March 2026: Verified with latest Docker images and configurations.

Common error messages in logs (docker logs paperless-webserver):

[ERROR] [paperless.tasks] Error while consuming document: Tesseract OCR error
[WARNING] [paperless.consumer] No OCR performed on document
Error opening data file /usr/share/tesseract-ocr/5/tessdata/eng.traineddata
OCR skipped: document already contains text layer

The Cause

OCR failures in Paperless-ngx have four common root causes:

CauseSymptomFrequency
Missing language packsError referencing tessdata filesVery common
Wrong OCR modeDocuments with existing text layer skippedCommon
Insufficient memoryTesseract killed by OOM during processingCommon on low-RAM servers
File permissionsContainer can’t read/write OCR temp filesOccasional

The Fix

Method 1: Install Missing Language Packs

Paperless-ngx uses Tesseract for OCR. By default, only English is installed. If your documents are in another language (or mixed-language), Tesseract silently fails or produces garbage text.

Check which languages are installed:

docker exec paperless-webserver tesseract --list-langs

Add languages by setting PAPERLESS_OCR_LANGUAGES in your docker-compose.yml:

services:
  webserver:
    image: ghcr.io/paperless-ngx/paperless-ngx:2.20.11
    environment:
      # Space-separated Tesseract language codes
      # Common: deu (German), fra (French), spa (Spanish), ita (Italian)
      PAPERLESS_OCR_LANGUAGES: "eng deu fra"
      PAPERLESS_OCR_LANGUAGE: "eng"  # Default OCR language

Restart the container. Paperless-ngx downloads language packs on startup:

docker compose down && docker compose up -d

Re-process failed documents through the admin panel or by moving them back into the consumption folder.

Method 2: Change the OCR Mode

Paperless-ngx defaults to skip mode for documents that already have a text layer (like digitally-created PDFs). If that text layer is corrupt or incomplete, OCR gets skipped even though the document needs it.

Force OCR on all documents by changing the mode:

environment:
  # Options: skip, redo, force
  # skip: Don't OCR if text layer exists (default)
  # redo: Re-OCR everything, replacing existing text
  # force: Always run OCR, even on text-based PDFs
  PAPERLESS_OCR_MODE: "force"

Warning: force mode significantly increases processing time and disk usage because every document gets OCR’d, including text-based PDFs that don’t need it. Use redo if you only want to fix broken text layers without forcing OCR on clean documents.

Method 3: Increase Memory for Tesseract

On servers with 2 GB RAM or less, Tesseract gets killed by the Linux OOM (Out of Memory) killer during processing of large or high-resolution scans. Check for OOM kills:

dmesg | grep -i "oom\|killed process"

Solutions, in order of preference:

A. Reduce parallel processing:

environment:
  # Process one document at a time (default: 1)
  PAPERLESS_TASK_WORKERS: "1"
  # Limit threads per worker
  PAPERLESS_THREADS_PER_WORKER: "1"

B. Set a memory limit and swap:

services:
  webserver:
    deploy:
      resources:
        limits:
          memory: 2G

Also ensure your host has swap space configured (at least 2 GB).

C. Add more RAM. 4 GB is the practical minimum for processing high-resolution scans. 2 GB works only with reduced parallelism and low-resolution documents.

Method 4: Fix File Permissions

If the container runs as a non-root user, it needs read/write access to the consumption, media, and data directories. Permission errors appear in logs as:

PermissionError: [Errno 13] Permission denied: '/usr/src/paperless/media/documents/originals/...'

Fix with:

# Set ownership to the Paperless user (UID 1000 by default)
sudo chown -R 1000:1000 /path/to/paperless/media
sudo chown -R 1000:1000 /path/to/paperless/data
sudo chown -R 1000:1000 /path/to/paperless/consume

Or set the user ID explicitly in your compose file:

environment:
  USERMAP_UID: "1000"
  USERMAP_GID: "1000"

Prevention

  • Set OCR languages at deployment time. Include all languages you might need in PAPERLESS_OCR_LANGUAGES from the start.
  • Allocate enough RAM. Plan for 4 GB minimum if processing scanned documents regularly.
  • Monitor the task queue. Check http://your-server:8000/admin/ → Tasks for failed processing jobs. Catch OCR issues early.
  • Use redo mode for mixed document sources. If you receive both scanned and digital PDFs, redo mode handles both correctly without the overhead of force.
  • Keep Docker image updated. Newer Paperless-ngx releases include Tesseract improvements. Check release notes before upgrading.

Comments