RAG

Retrieval-Augmented Generation — a pattern where an LLM retrieves relevant chunks before generating a response.

Retrieval-Augmented Generation is a standard architecture for giving LLMs access to a large corpus. The pattern is straightforward: chunk a document corpus, embed each chunk as a dense vector, store these embeddings in a vector database, retrieve the top-k most relevant chunks at query time, then feed the retrieved chunks into the LLM's prompt alongside the user's question. The LLM synthesizes a response grounded in the retrieved content.

The pattern

RAG requires four steps: chunking (break corpus into coherent passages), embedding (convert each passage to a vector using an embedding model), storage (index vectors in a vector DB for efficient similarity search), and retrieval (find top-k chunks for the query). This design allows LLMs to reason over data larger than their context window, while keeping inference costs reasonable.

Why it's everywhere

RAG is the default answer to "give an LLM access to a large corpus." It works well for millions of passages and scales gracefully. Vector DBs are mature, embedding models are fast and cheap, and the pattern is language-agnostic. Nearly every LLM-powered search and chatbot application uses RAG in some form.

Where LLMind sits

LLMind is not a RAG framework. Instead, it caches signed semantic metadata inside files so the retrieval step has richer content to work with — or so an agent can skip retrieval entirely by reading enriched files through an MCP server. Rather than chunking for similarity search, LLMind enriches files with structured summaries the LLM reads without needing a vector DB.

The pattern

Why it's everywhere

Where LLMind sits

Related terms

See also