LLMind benchmarks

Reference measurements for LLMind file enrichment — file-size overhead, signing throughput, read-time comparison vs. re-parsing with LlamaParse, Docling, Textract, Mistral OCR.

Measurements pending. The tables below describe what Sprint 3 measures against the reference corpus at huggingface.co/datasets/llmind/reference-enriched-pdfs-v1 . Cells show — until the measurement protocol runs. The methodology is committed at docs/benchmarks/sprint-3-methodology.md and the raw CSV at docs/benchmarks/sprint-3-data.csv . Third parties can rerun and publish alternative numbers against the same corpus.

Test corpus

100 public-domain PDFs at huggingface.co/datasets/llmind/reference-enriched-pdfs-v1 : 40 small (<500KB), 40 medium (500KB–5MB), 20 large (5–50MB). Content mix: technical papers, scanned facsimiles, government reports, synthetic research PDFs. Licensed for redistribution (CC0, public domain, or US government works).

File-size overhead

The XMP semantic layer LLMind writes adds bytes to the file. This table measures how many — as a percentage of the original, and in absolute bytes.

Measurement	Value	Notes
avg-overhead-percent	—	Average XMP payload size as % of original file bytes across the 100-PDF corpus.
median-overhead-percent	—	Median (less sensitive to huge files).
p95-overhead-percent	—	95th percentile — worst case for small PDFs.
avg-absolute-bytes	—	Average bytes added per file (for intuition — expect single-digit KB).

Signing throughput

How fast LLMind can sign the semantic layer on commodity hardware. Measured in isolation (pure crypto; no OCR / parse in the hot path).

Algorithm	Throughput	Notes
hmac-sha256	—	HMAC-SHA256 signing of the semantic layer (default algorithm).
ed25519	—	ed25519 signing (optional; for public-key verification).
file-checksum-sha256	—	SHA-256 file-content checksum (excludes XMP packet).

Read-time comparison

The core value proposition: reading the cached LRFS semantic layer from XMP is orders of magnitude faster than re-parsing the same PDF. Lower milliseconds are better.

Operation	Time per file	Notes
llmind-cached-read	—	Time for a consumer to parse the XMP packet and extract the layer.
llamaparse-reparse	—	Time to re-parse the PDF with LlamaParse (cloud API call; includes network).
docling-reparse	—	Time to re-parse with Docling (local
textract-reparse	—	Time to re-parse with AWS Textract (cloud API call; includes network).
mistral-ocr-reparse	—	Time to re-parse with Mistral OCR (cloud API call).
speedup-factor-vs-llamaparse	—	Speedup from reading cached LRFS vs. re-running LlamaParse.
speedup-factor-vs-textract	—	Speedup from reading cached LRFS vs. re-running AWS Textract.

What's not measured

Some adjacent questions are deliberately out of scope. The methodology document explains each exclusion in detail. In short:

End-to-end RAG answer quality. Too many variables (LLM, chunker, prompt) to attribute signal to LLMind alone.
Vector-DB retrieval latency. LLMind isn't a vector DB; a head-to-head is category-confused. See comparisons.
Concurrent-workload performance. Sprint 3 is single-threaded reference measurement; multi-process is a Sprint 4+ investigation.

Reproducibility

The corpus, the methodology, and the raw CSV are all committed to git. Third-party reviewers can rerun each step and publish alternative numbers. Results vary with hardware generation (especially for HMAC-SHA256 throughput), network conditions (for cloud-API baselines), and file selection.

The Sprint 3 CSV stays frozen as the reference point. Future re-runs land at docs/benchmarks/sprint-4-data.csv, etc.