OCR

Optical Character Recognition — extracting text from an image or scanned page.

Optical Character Recognition (OCR) is the process of converting pixel-level text in images, photographs, or scanned documents into machine-readable text. Classical OCR engines like Tesseract use statistical pattern matching; modern vision-model OCR (Mistral Vision, GPT-4V, AWS Textract) uses deep learning to achieve higher accuracy and extract richer structure (tables, form fields, layout information).

What it handles

OCR extracts text from scanned PDFs, smartphone photos of documents, screenshots, and any image containing readable text. Modern OCR tools go beyond text extraction: they detect tables, identify form fields, extract equations, and preserve spatial relationships. Accuracy depends heavily on image quality, font size, and language.

The cost problem

OCR is expensive per call. Running it repeatedly on the same file wastes compute and dollars. If you process a document multiple times (indexing, searching, analyzing), you're paying for OCR again each time even though the text hasn't changed.

LLMind's role

LLMind caches OCR output inside the file's XMP packet as signed semantic metadata. Run OCR once; downstream AI pipelines read the cached OCR text instead of re-calling the OCR provider. The cache is portable, verifiable, and saves money by eliminating redundant OCR calls.

What it handles

The cost problem

LLMind's role

Related terms

See also