XMP metadata Python: libraries, patterns, and gotchas
Python has no single "official" XMP library. Depending on the file type, you'll reach for py3exiv2 (images), pikepdf (PDFs), pillow (images, limited), or piexif (EXIF, minimal XMP). This guide covers the common patterns, the gotchas, and how LLMind sits on top of this plumbing.
The Python XMP landscape
If you're working with XMP metadata in Python, you need to pick a library before you write a single line. The ecosystem is fragmented by file type and binding style:
- py3exiv2 — Wraps the C++ library exiv2; handles JPEG, PNG, TIFF, and PDF. Mature, feature-complete, but requires installing the exiv2 binary. Maintained by a community team and well-documented.
- pyexiv2 — Different binding and maintainer than py3exiv2, also wraps exiv2. Different API style. Check maintenance status before choosing—the Python XMP ecosystem has fragmentation at the binding layer.
- pikepdf — The go-to for PDF-only workflows. Wraps QPDF (C++) and includes a high-level metadata API. Much simpler than exiv2 if you only need PDFs.
- pillow (PIL) — Does basic image metadata (EXIF, some XMP) through its Image.info dictionary. Limited XMP handling; best for simple reads, not writes.
- piexif — Lightweight, pure-Python EXIF library. Minimal XMP support; focus is EXIF. Good for simple photo metadata; not suitable for rich XMP.
None of these libraries provide a high-level "write structured semantic data" API. They're low-level primitives. You get a dictionary of XMP tags; you manage the RDF structure and namespace details yourself. That's where LLMind enters the story.
Read XMP from a JPEG
Here's the pattern for reading XMP from an image file using py3exiv2:
import py3exiv2
with py3exiv2.Image('photo.jpg') as img:
xmp_dict = img.read_xmp()
# xmp_dict contains all available XMP properties
for key, value in xmp_dict.items():
print(key, value)
# Access a specific property
if 'Xmp.dc.title' in xmp_dict:
print("Title found", xmp_dict['Xmp.dc.title'])
The keys follow the XMP namespace prefix convention (e.g., Xmp.prefix.localName). dc is Dublin Core (a standard, pre-registered namespace). The values are usually strings, but complex properties (like arrays) come back as lists.
One gotcha: py3exiv2 requires exiv2 installed as a system dependency. On macOS, brew install exiv2. On Linux, apt-get install libexiv2-dev. Windows is more complex; consider pikepdf if you're Windows-only and don't need images.
Write XMP to a PDF
For PDFs, pikepdf's metadata API is cleaner than exiv2:
from pikepdf import Pdf
with Pdf.open('document.pdf', allow_overwriting_input=True) as pdf:
with pdf.open_metadata() as meta:
# Standard properties (Dublin Core)
meta['dc:title'] = 'My Document Title'
meta['dc:creator'] = ['Alice', 'Bob'] # Arrays work
meta['dc:description'] = 'A summary of this PDF.'
# Custom namespace: declare it first
meta.register_namespace('NAMESPACE_URI', 'myns')
meta['myns:customField'] = 'custom value'
meta['myns:version'] = '2.1'
pdf.save()
print('PDF metadata written successfully')
The open_metadata() context manager handles the RDF/XML serialization. You just set key-value pairs. The critical gotcha: if you use a custom namespace (one that's not pre-registered in pikepdf's defaults), you must call register_namespace() before assigning to it. Otherwise, the write silently fails or uses a wrong prefix.
Note allow_overwriting_input=True—this modifies the PDF in place. Omit it to create a new file.
The "design your own schema" problem
Once you can read and write XMP, you face a design problem: all these libraries let you write arbitrary tags. You still have to decide:
- Namespace URI: What URI identifies your namespace? Is it stable? Will it change in the next version?
- Property names: Use camelCase? snake_case? Abbreviations or full words? Document it or consumers will guess wrong.
- RDF structure: When is something a simple string vs. a structured property with nested fields? XMP supports both; you need to be explicit.
- Array semantics: Is a multi-valued property a Bag (unordered), Seq (ordered), or Alt (alternative language/form)? XMP has all three; pick carefully.
- Cross-format consistency: JPEG, PNG, PDF, and MP3 all support XMP, but round-trip behavior varies. One library might drop custom properties; another might serialize them differently. Test on every format you support.
This is the friction point. You're not just writing metadata; you're inventing a schema and hoping every consumer understands it.
How LLMind sits on top
LLMind solves the schema-design problem by publishing LRFS (LLM-Ready File Specification), a fully-specified schema under a stable namespace. Instead of hand-rolling XMP tags, you use LLMind:
from llmind import enrich
# One-liner. Writes structured, signed semantic metadata
# under the stable llmind.org namespace, with HMAC signature.
enrich('research_paper.pdf') Internally, LLMind:
- Parses the file (using LlamaParse, Docling, or another IDP tool).
- Extracts semantic meaning: text, structure, entities, summaries.
- Writes it to XMP under
Xmp.llmind.*properties. - Signs the payload with HMAC-SHA256 over the file's SHA-256 checksum.
- Leaves other XMP namespaces untouched.
Every downstream tool (Claude agent, LangChain, custom RAG pipeline) reads the same layer, with the same namespace semantics, signed with the same key. No schema design work. No round-trip surprises.
FAQ
Which Python XMP library should I use?
It depends on your file type and binding preference. For images (JPEG, PNG, TIFF), py3exiv2 or pyexiv2 are mature and feature-complete but require a C++ binary dependency (exiv2). For PDFs, pikepdf is the de facto standard and wraps QPDF. For lightweight image metadata, pillow has built-in support but limited XMP handling. For EXIF-focused workflows with minimal XMP, piexif is lightweight and pure Python. If you're starting a new project, pikepdf (PDF) or py3exiv2 (images) are the safest bets.
Can I use LLMind alongside my existing XMP code?
Yes. LLMind writes to the https://llmind.org/ns/1.0/ namespace, which is independent of other namespaces (Dublin Core, IPTC, etc.). Your existing XMP code can coexist with LLMind's enrichment layer. The HMAC signature protects only the LLMind namespace, so you can add or modify other metadata without affecting the enrichment layer's integrity.