Preprocessed PDF cache — in the file itself

Published 2026-04-22 · 9 min read

Your dataset has 50,000 PDFs. You run LlamaParse over them. An embedding model. A structural-extraction model. A summarization model. Hours of GPU time, hundreds of dollars. Tomorrow a teammate needs to re-train with a slightly different tokenizer — and they re-run all four steps. LLMind makes preprocessing output a per-file cache. Run once. Write the result into the PDF's XMP packet. Every downstream pipeline reads the cached version and skips the re-run.

The re-preprocessing problem

Dataset preprocessing pipelines are expensive and idempotent. The same PDF, run through the same parser, gives the same output every time. But teams re-run anyway because the output lives in a scratch directory that gets cleaned, or an ops database that nobody owns, or a notebook someone forgot to save. The file itself forgets what was done to it. When a teammate joins the project or a new training run is needed, nobody knows which PDFs have been processed and which haven't—so everything gets reprocessed. The cost multiplies: one dataset, six rounds of preprocessing, six times the expense.

The traditional answer is to build an ops database—a Postgres table or Firestore collection that tracks which files have been processed and stores their outputs. But ops databases decay. You deprecate the cluster. The data migrates. The path the file was keyed under changes. The file moves from your internal S3 to a shared archive to an externally-facing bucket. The database key breaks. The cache becomes stale and nobody trusts it, so teams re-run preprocessing out of caution anyway.

LLMind as the cache layer

LLMind writes preprocessing output into the file's XMP packet under the https://llmind.org/ns/1.0/ namespace. The payload can hold: extracted text from LlamaParse, Docling, Reducto, or Textract; structural decomposition (tables, headings, lists, semantic regions); extracted entities and relationships; page-level summaries; even embedding vectors if your workflow includes them. Every layer is signed with HMAC-SHA256, so tampering is detectable.

When the file moves—from a preprocessing worker to object storage to a training run to another team—the cached representation moves with it. The metadata is baked into the PDF's XMP packet; it's not in an external database that could be orphaned or forgotten. The file carries its own preprocessing state. A downstream pipeline running six months later can ask: has this file been preprocessed? The answer is in the file itself.

This works across team boundaries and storage transitions. A researcher on the language team can download a file from the archive, inspect its preprocessing state with llmind inspect, and know immediately whether they need to reprocess. A training team can read the cached extraction without rebuilding a database or restoring an archived Firestore backup. The preprocessing work is amortized across every downstream consumer—one parse operation pays for itself many times over.

Works alongside LlamaParse, Docling, Reducto, Textract

Be crystal clear about the framing: LLMind is not a replacement for any IDP tool. You still pick the best parser for your content. If your PDFs are scanned invoices, you use Textract. If they're research papers with complex tables, you might choose Docling or LlamaParse. If you're processing a heterogeneous dataset, you run your PDFs through multiple parsers and cache each result separately under different layer names (parsed.llama, parsed.docling, parsed.textract).

LLMind runs after the parser and caches its output inside the file. When the file travels to another team, to another training run, or to an archival system, the parser output travels too. When another pipeline needs the same content structured, it reads the cache instead of re-running the parser. The cache is a per-file artifact, not a separate database to maintain.

Workflow: parse once, enrich, read forever

Here's the pattern:

# Parse with your chosen IDP tool
parsed=$(llama-parse paper.pdf)

# Cache the result in the PDF's XMP packet
llmind enrich --layer parsed --from-stdin paper.pdf <<< "$parsed"

# Later, any downstream tool reads the cached layer natively
llmind inspect paper.pdf --layer parsed

The first step runs your parser of choice and captures its output. The second step writes that output into the PDF's XMP metadata under the named layer parsed. The output is signed, so downstream tools can verify it hasn't been tampered with. The third step—and every subsequent read—fetches the cached layer without re-parsing.

Scale this to 50,000 files. Your preprocessing worker runs LlamaParse once per file and caches the output. Your embedding team reads the cached layer for 50,000 files without re-parsing. Your summarization team does the same. Your retraining run reads the cached layer. The first parsing pass is expensive; every subsequent read is free. The ROI compounds as your dataset grows and more teams consume the same content.

Related reading

Star on GitHub · Install CLI