LLMind benchmarks
Reference measurements for LLMind file enrichment — file-size overhead, signing throughput, read-time comparison vs. re-parsing with LlamaParse, Docling, Textract, Mistral OCR.
Measurements pending.
The tables below describe what Sprint 3 measures against the
reference corpus at
huggingface.co/datasets/llmind/reference-enriched-pdfs-v1
.
Cells show — until the measurement protocol runs.
The methodology is committed at
docs/benchmarks/sprint-3-methodology.md
and the raw CSV at
docs/benchmarks/sprint-3-data.csv
.
Third parties can rerun and publish alternative numbers against the same corpus.
Test corpus
100 public-domain PDFs at huggingface.co/datasets/llmind/reference-enriched-pdfs-v1 : 40 small (<500KB), 40 medium (500KB–5MB), 20 large (5–50MB). Content mix: technical papers, scanned facsimiles, government reports, synthetic research PDFs. Licensed for redistribution (CC0, public domain, or US government works).
File-size overhead
The XMP semantic layer LLMind writes adds bytes to the file. This table measures how many — as a percentage of the original, and in absolute bytes.
| Measurement | Value | Notes |
|---|---|---|
| avg-overhead-percent | — | Average XMP payload size as % of original file bytes across the 100-PDF corpus. |
| median-overhead-percent | — | Median (less sensitive to huge files). |
| p95-overhead-percent | — | 95th percentile — worst case for small PDFs. |
| avg-absolute-bytes | — | Average bytes added per file (for intuition — expect single-digit KB). |
Signing throughput
How fast LLMind can sign the semantic layer on commodity hardware. Measured in isolation (pure crypto; no OCR / parse in the hot path).
| Algorithm | Throughput | Notes |
|---|---|---|
| hmac-sha256 | — | HMAC-SHA256 signing of the semantic layer (default algorithm). |
| ed25519 | — | ed25519 signing (optional; for public-key verification). |
| file-checksum-sha256 | — | SHA-256 file-content checksum (excludes XMP packet). |
Read-time comparison
The core value proposition: reading the cached LRFS semantic layer from XMP is orders of magnitude faster than re-parsing the same PDF. Lower milliseconds are better.
| Operation | Time per file | Notes |
|---|---|---|
| llmind-cached-read | — | Time for a consumer to parse the XMP packet and extract the layer. |
| llamaparse-reparse | — | Time to re-parse the PDF with LlamaParse (cloud API call; includes network). |
| docling-reparse | — | Time to re-parse with Docling (local |
| textract-reparse | — | Time to re-parse with AWS Textract (cloud API call; includes network). |
| mistral-ocr-reparse | — | Time to re-parse with Mistral OCR (cloud API call). |
| speedup-factor-vs-llamaparse | — | Speedup from reading cached LRFS vs. re-running LlamaParse. |
| speedup-factor-vs-textract | — | Speedup from reading cached LRFS vs. re-running AWS Textract. |
What's not measured
Some adjacent questions are deliberately out of scope. The methodology document explains each exclusion in detail. In short:
- End-to-end RAG answer quality. Too many variables (LLM, chunker, prompt) to attribute signal to LLMind alone.
- Vector-DB retrieval latency. LLMind isn't a vector DB; a head-to-head is category-confused. See comparisons.
- Concurrent-workload performance. Sprint 3 is single-threaded reference measurement; multi-process is a Sprint 4+ investigation.
Reproducibility
The corpus, the methodology, and the raw CSV are all committed to git. Third-party reviewers can rerun each step and publish alternative numbers. Results vary with hardware generation (especially for HMAC-SHA256 throughput), network conditions (for cloud-API baselines), and file selection.
The Sprint 3 CSV stays frozen as the reference point.
Future re-runs land at docs/benchmarks/sprint-4-data.csv, etc.