What is file enrichment?
File enrichment is the practice of embedding structured, machine-readable metadata directly inside a file’s own metadata payload — typically XMP for documents and media — so any downstream tool can read the file’s meaning without re-parsing or re-OCR.
This is a new category. It doesn’t show up in most AI tooling conversations because it sits one layer above OCR, parsing, RAG, and vector search. Instead of treating a file as raw bytes that need interpretation every time an AI tool opens them, enrichment treats interpretation as a one-time operation whose result travels with the file itself.
The practical consequence: a 200-page PDF that takes eight seconds to OCR the first time becomes a few-millisecond read for every subsequent tool — because the extracted text, document structure, entities, and semantic summary are already inside the file. You do the expensive work once. Every downstream consumer — a Claude agent, an MCP server, a custom retrieval pipeline, a NotebookLM import — reads the cached result.
Where enrichment fits relative to OCR, parsing, and RAG
It’s easiest to understand enrichment by what it is not. Enrichment is not a replacement for document intelligence, RAG frameworks, or vector databases. It sits between them and the file system:
- Parsers and IDP tools — LlamaParse, Docling, Reducto, Textract, Unstructured.io, Mistral OCR, Azure and Google Document AI — extract structure from raw files. They run upstream of enrichment. A file enrichment engine runs after them, takes their output, and caches it inside the file.
- RAG frameworks — LlamaIndex, LangChain, Haystack — orchestrate retrieval and generation. They consume enriched files and get immediate access to text, structure, and summaries without re-parsing every input.
- Vector databases — Pinecone, Weaviate, Qdrant, Chroma, pgvector — store embeddings for semantic search. Enriched files can be indexed into any of them, or read directly with no vector DB at all. The metadata travels independently of the index.
- Enterprise search — Glean, Dropbox Dash, Notion AI — are downstream consumers. Enrichment makes their crawlers faster and more accurate because they read structured metadata instead of re-interpreting the file.
The self-describing file pattern
A self-describing file carries enough metadata about its own content that any reader — human, application, or AI tool — can understand what the file contains without separate processing. File enrichment is the technique; self-describing file is the result.
The enrichment standard LLMind uses lives in XMP (Extensible Metadata Platform), the same W3C-adjacent metadata container already embedded inside standard JPEG, PNG, PDF, MP3, WAV, and M4A files. XMP lets you define a custom namespace and write structured data into the file without breaking the file format or its compatibility with any existing tool. The file stays a normal file; it just carries a richer self-description.
Why signed, tamper-evident enrichment matters
Enrichment is only useful if you can trust the metadata you’re reading. Any downstream tool reading a “summary” field inside a file needs to know the summary corresponds to this specific file, and that nobody has swapped the summary out.
LLMind’s enrichment layer is cryptographically signed with HMAC-SHA256 over the file’s SHA-256 checksum. Two guarantees follow:
- If the file content changes, the checksum no longer matches, and the layer is detected as stale. Downstream tools can either re-enrich or fall back to raw processing.
- If someone forges a new layer, the HMAC signature fails unless they hold the original signing key. Provenance is preserved across the file’s lifetime.
This is the same design philosophy as C2PA Content Credentials — signed, in-file metadata — but the payload is semantic meaning (text, description, structure, entities), not origin authenticity alone.
How LLMind implements file enrichment
LLMind is a file enrichment engine: an open-source CLI and library
that takes standard JPEG, PNG, PDF, MP3, WAV, or M4A files and embeds a signed
semantic layer directly inside each file’s XMP payload. The namespace is
stable at https://llmind.org/ns/1.0/. Any AI tool — from a Claude agent
to an MCP server to a custom RAG pipeline — can read the layer and skip
re-parsing.
The format and signing scheme are published as the LLM-Ready File Specification (LRFS). Third-party implementations are welcome.
Install the CLI
pipx install 'llmind-cli[all]'
llmind enrich myfile.pdf