Training data provenance, signed into every file

Published 2026-04-22 · 9 min read

Training data provenance has always been a documentation problem. You ship a dataset with a manifest file that documents each file’s origin, transformations, and licensing. But manifests are easy to lose. The moment a dataset is mirrored, forked, subsampled, or re-uploaded, the manifest stays behind or gets silently replaced. LLMind makes provenance a per-file property. Every file in your dataset carries signed lineage directly inside its own XMP metadata. Origin claim, transformation log, license, creator—all cryptographically vouched for and portable with the file.

What training data provenance actually requires

For a training dataset to have real provenance, every file needs four things. First: an origin claim—where did this file come from? A URL, an upstream dataset ID, a capture timestamp. Second: a transformation log—what was done to this file before inclusion? Resizing, cropping, OCR, filtering, augmentation. Third: signatures—cryptographic proof that the origin and transformation claims are authentic and haven’t been altered. And fourth, often overlooked: a schema. A downstream tool—a training pipeline, a dataset audit tool, a governance scanner—needs to read the provenance claim without custom parsing logic. Unstructured text in a manifest file fails this test. Metadata that requires a specific parsing library fails it too.

Most existing dataset documentation approaches—YAML manifests, JSON metadata sidecars, dataset cards—capture the first three requirements but stumble on the fourth. They produce human-readable documentation that doesn’t travel with the file, that are easy to lose or replace, and that downstream tools often can’t consume programmatically without custom code.

The sidecar-manifest problem

You build a dataset for some task. You include a manifest.json or dataset_card.md that documents each file’s provenance. The metadata is human-readable and well-organized. Someone forks your dataset. They extract 10% of the files for their use case, and they write a new manifest to reflect the subset. Now those files have dual documentation: the original manifest upstream and the new one in the fork. A downstream user might read the second manifest and miss the original origin claim entirely. The fork gets mirrored to another platform. The new platform preserves the files but loses the fork’s manifest—too many bytes, wrong format, not worth migrating. Now those files have no documentation at all. Someone subsamples the mirrored dataset for a research project. Same story. By the time your training data reaches a live training run, the provenance claim is ten steps removed, corrupted by each handoff, or completely absent.

The problem is that the manifest is a sidecar. It travels separately from the data. Every fork, mirror, and repackaging is a chance to lose it. What you need is provenance that can’t be separated from the file.

LLMind’s per-file lineage with XMP and HMAC signatures

LLMind writes lineage metadata directly into each file’s XMP packet, under the https://llmind.org/ns/1.0/ namespace. The lineage record contains everything a dataset engineer needs: the original source (a URL, an upstream dataset identifier, or a reference), a capture or creation timestamp, a list of transformations applied (with details like the OCR provider, image dimensions, filtering criteria), the content license (CC-BY, MIT, proprietary, etc.), and the creator or organization that generated the record.

Every lineage record is signed with HMAC-SHA256. The entire file gets a SHA-256 checksum embedded in the same metadata packet. If someone modifies the file or alters the metadata—changes a caption, strips an entity, resizes an image beyond what the transformation log says—verification will fail. A downstream tool can run llmind verify and get a cryptographic confirmation: “This file’s provenance claim is authentic and unaltered,” or “This file has been tampered with.”

The metadata is structured, not prose. It uses standard XMP property types so parsers don’t need custom logic. And because it lives inside the file, it survives the entire lifecycle of the dataset: uploads, mirrors, forks, subsampling, even lossy re-compression of images (the metadata is in XMP, which survives JPEG re-encoding).

Example: preparing a HuggingFace dataset with LLMind lineage

Prepare your dataset directory with lineage records:

# Enrich each file with lineage and signing
for file in dataset/*.jpg; do
  llmind enrich --sign \
    --lineage-source "https://source-dataset.org/images/2025-q1" \
    --lineage-license "CC-BY-4.0" \
    --lineage-creator "my-org/data-team" \
    "$file"
done

# Upload to HuggingFace
# Provenance travels with each file automatically
huggingface-cli upload my-dataset/ .

Every file now carries its own provenance. When someone forks your HuggingFace dataset, the files retain their lineage records. When their fork gets mirrored to another platform, the metadata stays embedded. A downstream researcher who pulls files from any of those mirrors can verify the original origin claim with llmind verify. The provenance doesn’t degrade or get lost. It’s not a separate document that someone might forget to copy—it’s part of the file itself.

Why per-file provenance matters for training data governance

Training data governance is increasingly critical. Regulators want to know where your training data came from. Your model’s users want to verify that they’re using licensed content. Your team needs to audit whether a file went through the filters and transformations you expected. With sidecar manifests, you can’t verify any of this at the file level. With per-file signatures, you can. Grab any file from your dataset and verify its provenance standalone. No manifest lookup, no database query. The proof is embedded. This is especially important when your dataset gets resampled, merged with other sources, or used in ways you didn’t anticipate.

Related reading

Star on GitHub · Install CLI