Vision LLM vs extracted text

Published 2026-04-25 · 8 min read

The choice between vision LLM vs extracted text isn’t really about which approach is “better” — it’s about where each approach loses information and what that costs you over time. This pillar explains the real mechanism behind vision LLM accuracy limits on text-dense documents, cites the published benchmarks, and shows when pre-extracting text into the file itself is the more reliable path.

What actually happens when you send an image to a vision LLM

A common misconception: “sending a file as base64 to a vision LLM loses information because base64 is a lossy encoding.” This is wrong. Base64 is lossless. It’s a transport encoding, not a compression scheme. Every byte of the original image arrives at the server intact.

The information loss happens at two later stages: downsampling and vision encoder tokenization.

Here’s the actual pipeline. Your client base64-encodes the image bytes and sends them in an API request. The server decodes the bytes back to the original image — losslessly. Then the model’s preprocessing layer resizes the image to fit within a vendor-defined pixel budget. If your image is larger than that budget, pixels are discarded. A 300 dpi scanned contract page, which might be 2480×3508 pixels, gets scaled down to fit within whatever the vision encoder accepts. Fine text at 8pt or below, dense table borders, and decimal points in figures all become harder to resolve at smaller pixel counts.

After downsampling, the resized image is passed through a vision encoder that converts it to a fixed sequence of tokens. These tokens are what the language model actually reads — not the pixels directly. The token budget for an image input is vendor-specific. Anthropic publishes their image input handling at docs.anthropic.com/en/docs/build-with-claude/vision . OpenAI documents per-image token costs at different detail levels. Google Gemini documents its own image processing limits. Each vendor’s docs are the authoritative source for their current numbers — these specs change as models are updated, so linking to vendor docs is more durable than printing the number in this page.

The result: vision LLMs are not doing OCR on a high-resolution image. They’re doing their best with a downsampled, token-limited representation. For photos, illustrations, and diagrams that don’t depend on fine-grained text fidelity, that’s usually fine. For text-dense documents — scanned contracts, research papers, multi-page financial reports — the gap between the original resolution and the processed representation matters.

Two real costs: information loss and repeated tokens

Running text-dense documents through a vision LLM carries two distinct costs that compound across a workflow.

Information loss

The failure modes are predictable. Small text — footnotes, margin annotations, sub-12pt body text in scanned PDFs — degrades after downsampling. Dense tables with thin borders or merged cells lose their structural integrity. Handwriting recognition worsens at lower resolutions. Fine numbers (decimal points, negative signs, superscripts) are the highest-risk category because they’re visually small and semantically critical. If a contract says “$1.5 million” and the vision encoder renders that as “$15 million,” the downstream consequence is significant.

This isn’t a criticism of any specific model. It’s a structural constraint of vision encoders operating on fixed pixel and token budgets. Specialist OCR systems are designed specifically to preserve fine text at high resolution — they run at native document resolution without a downstream token budget forcing further compression.

Repeated token cost

Each new AI conversation that needs to read a document has to re-upload the image and pay the full image-token cost again. Image token budgets are substantially larger than equivalent text-token representations for the same content. If a 10-page contract is queried by five different team members in five separate conversations, that image-token cost is paid five times.

Prompt caching helps for repeated text within a single extended conversation, but image caching has different characteristics. Anthropic’s prompt caching documentation (retrieved 2026-04-24) confirms that images can be cached — but adds or removes images anywhere in the prompt invalidates the cache block for that message. In practice, cross-conversation image caching requires consistent prompt construction, which is harder to guarantee in ad-hoc usage.

For comparison: Anthropic publishes that cache reads are priced at 0.1× the base input token rate (as of 2026-04-24 retrieval). If you could reliably cache the image-token representation, the repeat-query cost would drop sharply. But pre-extracting text from the document removes the image-token cost entirely for all subsequent queries — not just the ones that happen to hit cache.

What the published benchmarks actually say

Benchmarks in this space are tricky to interpret. A vision LLM that scores well on a visual-reasoning dataset (charts, photos, diagrams) may perform differently on document-oriented tasks (dense text, tables, multi-page layouts). The relevant benchmarks are document-specific.

Mistral’s internal evaluation (published 2025-03-06). When Mistral released Mistral OCR, they published comparative scores on their own internal evaluation covering publication papers and PDFs from the web. Their published numbers for the Overall category: Mistral OCR 2503: 94.89 , GPT-4o-2024-11-20: 89.77, Gemini-2.0-Flash-001: 88.69, Google Document AI: 83.42. These are vendor-published figures; Mistral is reporting on their own evaluation rather than an independent benchmark. Per Mistral’s own disclosure, the comparison is not fully apples-to-apples because some capabilities (such as embedded image extraction) were not available in all compared systems. The directional finding — specialist OCR pipelines outperform general vision LLMs on document-heavy tasks — is consistent with the structural argument above.

DocVQA leaderboard. The DocVQA leaderboard hosted by the Robust Reading Competition is the canonical public benchmark for document visual question-answering. Vendors and researchers submit results on a standardized dataset of real document images with associated questions. It’s the right place to cross-check vendor claims about document-understanding accuracy.

The pattern across both sources is consistent: general vision LLMs trained primarily on natural images trail specialist document-processing pipelines on text-dense document tasks. That gap narrows for some models and some task types, but it’s real enough to matter for workflows that depend on high-fidelity text extraction.

The alternative: extract once, query many times

The extract-once pattern inverts the cost structure. Instead of paying image tokens on every query, you run a specialist extraction tool once per file and embed the result directly inside the file.

Vision LLM vs pre-extracted text: workflow comparison
Workflow Accuracy on text-dense pages Token cost (5 conversations) Cacheable across conversations Best for
Raw image → vision LLM Vendor-dependent; degrades on small text, dense tables, fine numbers Pays image tokens × N conversations Vendor-dependent; cache invalidated if image position changes Visual reasoning, one-shot queries, small images
Pre-extract → LRFS-enriched file → LLM Specialist OCR accuracy; typically strong on printed text at native resolution Pays text tokens × 1 extraction + cache reads for subsequent queries Yes — text caching well-supported across standard prompt-caching rules Repeat-query workflows, text-dense documents, archival

The mechanics: a specialist OCR or vision-extraction pipeline runs once against the original file at native resolution. The extracted text, document structure, and a natural-language description are embedded directly into the file as metadata layers via the LRFS payload format: llmind:text, llmind:description, and llmind:structure. The enriched file becomes a LLM-ready file — any AI tool that reads the LRFS layers gets the pre-extracted text rather than doing its own extraction from scratch.

LLMind is a file enrichment engine that implements this pattern. It does not perform OCR itself. It embeds the output of whatever specialist extraction tool is appropriate for the file format — LlamaParse, Mistral OCR, Azure Document Intelligence, or any other pipeline that produces clean structured text. The extraction step runs once; the enriched file carries the result with it, wherever it goes. See the OCR cache use case for the workflow, and OCR once, read forever for the full treatment of why re-extraction compounds costs.

Subsequent AI conversations that need the document content read text tokens from the enriched file rather than image tokens. Text is denser per unit of meaning than image-token representations of the same content. And because the extracted text doesn’t change between conversations, it caches well under standard prompt-caching rules.

Concrete example: a 10-page contract

The following is illustrative arithmetic — not a benchmark. Real numbers depend heavily on document type, resolution, token pricing at time of use, and conversation structure.

Suppose your team needs to query a 10-page scanned contract in five separate conversations — different team members asking different questions at different times.

Workflow A: Raw image upload each time. Each conversation uploads all 10 page images. Each page incurs its image-token cost (which varies by model and vendor — check the vendor docs linked above for current rates). If the same conversation structure isn’t repeated exactly, caching provides no benefit. You pay the full image-token cost for each of the five conversations. In addition, each conversation is working from the downsampled-pixel representation of the contract, which means footnotes, table figures, and fine-print clauses may be less reliable than the original text.

Workflow B: Extract once, then query the enriched file. A specialist OCR pipeline runs once on the contract at full resolution. The extracted text (say, 4,000–6,000 tokens of clean prose for a dense 10-page contract) is embedded as llmind:text in the file. Each of the five conversations sends the same text-token payload. If a prompt-caching prefix is set for the document, conversations 2–5 pay the cache-read rate (0.1× the base input token rate, per Anthropic’s documentation retrieved 2026-04-24) rather than the full input rate. The accuracy is not bounded by a downsampling step — it’s bounded by the specialist OCR pipeline, which runs at native document resolution.

The break-even point — where workflow B pays less than workflow A — depends on how many times the document is queried and what the image-to-text token ratio is for that document type. For a contract that gets queried ten or more times, workflow B is almost always cheaper. For a document that will be queried once and discarded, the extraction cost may not be worth it (see the next section).

When raw image upload IS the right choice

The case for pre-extraction is strongest for text-dense documents queried repeatedly. But vision LLMs are the right tool for a different set of tasks.

  • Visual reasoning tasks. “What color is the car?” “Describe the scene.” “What emotion does this image convey?” These are questions about the image itself, not about embedded text. No OCR pipeline produces better answers here — the visual content is the content.
  • Chart shape and trend interpretation. If you need a model to describe the trajectory of a line chart — not read the underlying numbers, but reason about the shape — vision LLMs are purpose-built for this. OCR extracts the axis labels and series values; it doesn’t capture the visual argument the chart is making.
  • One-shot OCR for small images. A single screenshot, a product label, a business card — if the file will only ever be read once and the image is small enough that downsampling doesn’t matter, the overhead of running a specialist pipeline and embedding results isn’t worth it. Just send the image directly.
  • Exploratory analysis on novel document types. When you don’t yet know what a document contains — a scan you’ve never seen before — a quick vision LLM pass gives you a fast first read. If that document turns out to be something you’ll query repeatedly, that’s the signal to enrich it properly.

Vision LLMs as a category are advancing quickly. The gap between specialist OCR and general vision LLM accuracy on document tasks is narrowing as model training improves. The structural argument for pre-extraction — paying image tokens once instead of N times — holds regardless of model quality, but the accuracy argument is worth re-evaluating as models improve. Track the DocVQA leaderboard for the most current picture.