OCR once reuse: cache OCR output in the file
OCR is a per-call tax. Every time a pipeline hits Amazon Textract, Mistral OCR, Azure Document Intelligence, or Google Document AI for the same PDF, you pay again. At roughly $1–2 per 1,000 pages for standard document OCR, a 50,000-page dataset costs $50–100 per pipeline run. Run it four times in a week and you've spent $200–400 OCR'ing content that hasn't changed. LLMind writes OCR output into the PDF's XMP packet. Run OCR once; every downstream pipeline reads the cached result.
OCR pricing is a per-call tax
OCR APIs charge by usage. Amazon Textract charges roughly $0.0015 per page for standard document analysis and $0.015 per page for tables and forms. Mistral OCR is free for some volume tiers but paid beyond. Google Document AI is comparable per-page. Azure Document Intelligence operates on a similar model. Each API call is idempotent—the same PDF, run through the same API, returns the same text and bounding boxes. But without caching, every pipeline pays the OCR tax again.
The economics are brutal at scale. Imagine a 50,000-page dataset. Your training team OCRs it. Your research team OCRs it. Your QA team OCRs it for validation. Your extraction team OCRs it to pull entities. Your summarization team OCRs it to feed a summarizer. That's five OCR passes on the same content. If each page costs $0.0015, that's $75 per pass, $375 total. The content hasn't changed, but you've paid for it five times. Worse, if one team runs a new training job with a different tokenizer next month, the cycle repeats.
Where the result should live (in the file, not your ops DB)
The traditional answer is to cache OCR results in a database. Postgres, Firestore, a DynamoDB table—somewhere you can query by file hash or path and retrieve the cached output. But ops databases decay. You migrate infrastructure. The table that tracked OCR results becomes orphaned when the project moves to a new cloud account. The file gets renamed or moved to a different storage tier, and the cache key breaks. Downstream teams don't trust the cache because they don't own it and don't know if it's stale. So they re-OCR out of caution.
LLMind's answer: the cache lives in the file's XMP packet. When the file moves from S3 to object storage, gets forked into a derivative dataset, or gets copied to another team's environment, the OCR output travels with it. The file is its own source of truth. When a downstream pipeline needs to ask, has this file been OCR'd?, the answer is in the file itself—no database lookup, no trust issues, no stale keys.
LLMind --cache-ocr pattern
The workflow is straightforward:
# First pass: run OCR and cache the result in-file
llmind enrich --cache-ocr --ocr-provider textract invoice.pdf
# Subsequent passes: read the cached layer instead of re-calling Textract
llmind inspect invoice.pdf --layer ocr
The first command runs your chosen OCR provider (Textract, Mistral, Azure Document Intelligence, Google Document AI, or an open-source tool like Marker) and writes the result into the PDF's XMP packet under the ocr layer. The cached layer includes raw OCR text, per-page confidence scores, bounding boxes (if the provider returned them), and provider metadata (name, version, timestamp). The entire layer is signed with HMAC-SHA256, so downstream tools can verify it hasn't been tampered with.
Every subsequent read—whether from the same team or a different one—uses llmind inspect to fetch the cached layer. No API call. No cost. The OCR tax is paid once, amortized across every downstream consumer. On a 50,000-page dataset with five pipeline runs, that's $300–400 saved. On larger datasets or more runs, the savings compound.
Compatible OCR providers
LLMind works with every major OCR provider. The tool is not a replacement for any of them:
- AWS Textract – best for complex document layouts, tables, and forms
- Mistral OCR – fast, open API, multilingual
- Azure Document Intelligence – good for structured extraction and table understanding
- Google Document AI – strong on handwriting and custom document types
- Marker – open-source, no API costs, good baseline
You pick the provider that best suits your content. LLMind runs after the provider and caches its output inside the file. When another pipeline needs the same file OCR'd, it reads the cache instead of re-calling the provider. The caching layer is agnostic to the provider—cache from Textract, read with any tool.
An honest note on "OCR alternatives": Some people land on this page searching for a Textract alternative. LLMind is not one. If you need a different OCR engine—cheaper, faster, or better at a specific document type—try Mistral OCR, Google Document AI, Marker, or another provider that suits your use case. LLMind is the caching layer that sits alongside your chosen provider, not a replacement for it.