Retrieval Augmented Generation Explained: How RAG Gives LLMs a Memory
You have a customer support chatbot built on a large language model. Your product updated three weeks ago. Pricing changed. Two features were deprecated. A new integration launched.
The model knows none of it. It was trained six months ago. It will confidently answer questions about the old pricing until you retrain or replace it. And retraining costs real money and takes real time.
This is not a niche problem. It’s the central practical limitation of every large language model in production. The model’s knowledge is frozen at the moment training ends. Ask it about anything that happened after that cutoff, or ask it about anything that lives in your internal documents rather than the public internet, and it will either say it doesn’t know or, worse, it will hallucinate an answer that sounds plausible and is wrong.
Retrieval augmented generation (RAG) is the architectural solution to that problem. It doesn’t retrain the model. It gives the model a separate, updatable external memory it can search at inference time, then uses what it finds to generate a grounded answer.
Here’s how it actually works, what the research behind it shows, and where it breaks in ways the tutorials don’t mention.
Table of Contents
The Memory Problem RAG Was Designed to Solve
To understand why RAG exists, you need to understand what a language model actually stores.
When a language model is trained, knowledge gets compressed into its weights. Those billions of numerical parameters encode patterns, facts, relationships, and reasoning shortcuts learned from the training corpus. Researchers call this parametric memory because the knowledge lives inside the model’s parameters.
This is what the original RAG paper by Lewis et al. (2020), published at NeurIPS, identified as the core limitation. Their framing was direct: parametric memory “cannot easily expand or revise” knowledge, “can’t straightforwardly provide insight into their predictions, and may produce hallucinations.” The word hallucination appears in the paper as a stated motivation for building RAG, not as a criticism added later by practitioners.
The solution they proposed: pair the parametric memory with what they called non-parametric memory, meaning an external document index that stores knowledge separately from the model’s weights. Non-parametric because the knowledge doesn’t live in parameters. It lives in documents you can add to, remove from, and update independently of the model.
That distinction, parametric memory versus non-parametric memory, is the conceptual core of RAG. Everything else in the architecture is implementation detail flowing from that idea.
What RAG Actually Does Step by Step
When a user submits a query to a RAG system, three things happen in sequence before any text is generated.
Step 1: The query is embedded. The user’s question gets converted into a vector embedding — a list of numbers that represents the semantic meaning of the question. The same embedding model used during indexing is used here, so the query and the documents live in the same mathematical space.
Step 2: Semantic search runs against the document index. The system searches a vector database containing embeddings of all your documents, looking for the entries that are semantically closest to the query embedding. This is semantic search: it finds documents based on meaning, not exact keyword matching. Under the hood, this is a Maximum Inner Product Search (MIPS) problem — finding the vectors with the highest dot product similarity to the query vector. Vector databases like Pinecone, Weaviate, and Chroma are engineered specifically to run MIPS efficiently at scale.
Step 3: Retrieved documents are passed to the model as context. The top-K retrieved documents get concatenated with the original query and handed to the language model. The model generates its response conditioned on both the query and the retrieved documents. This conditioning on external documents is what researchers mean by grounding — the model’s output is anchored to specific, retrievable source material rather than floating free in parametric space.
The result: the model generates answers based on documents you control, updated as recently as your last indexing run, without any retraining.
How Documents Actually Get Into the Index
The retrieval step only works if the documents have been prepared correctly. This is the part most introductions skip and most RAG implementations get wrong in practice.
Raw documents — PDFs, knowledge base articles, internal wikis, product documentation — can’t be stored directly. They need to be split into smaller pieces first. This process is called chunking, and the chunk size matters more than most tutorials admit.
In the original Lewis et al. research, Wikipedia was split into disjoint 100-word chunks, producing 21 million documents from a single Wikipedia dump. That number gives you a sense of the granularity. 100 words is not a paragraph. It’s not a section. It’s a tight passage. The reasoning behind small chunks is that embedding a 3,000-word document into a single vector loses most of its specificity. A query about pricing won’t retrieve a document that mentions pricing in paragraph 14 if the embedding averaged over 40 paragraphs. Smaller chunks produce more precise embeddings.
The tradeoff: too-small chunks lose context. A chunk that contains one sentence from a table may not make sense without the surrounding rows. In practice, most teams experiment with chunk sizes between 256 and 1,024 tokens, with 512 being a common starting point. Some pipelines use overlapping chunks, where each chunk shares tokens with the previous one, to preserve continuity across boundaries.
Once chunked, each piece gets converted to a vector embedding and stored in the vector database alongside its original text. This whole preparation process is called indexing. The quality of your retrieval is a direct function of the quality of your indexing.
Sparse Retrieval vs. Dense Retrieval: The Distinction Most Blogs Miss
Vector-based semantic search is not the only retrieval method, and treating it as such causes real problems in production.
The Zhao et al. survey on RAG published in Data Science and Engineering (2026) makes a clean distinction between two retrieval approaches that most practitioners collapse into one.
Sparse retrieval uses keyword-based similarity. Classic examples are TF-IDF and BM25. Sparse methods work on exact or near-exact term matching. They are fast, interpretable, and they work extremely well when the user’s query uses the same terminology as the documents. If your internal documentation consistently calls something “account provisioning” and users also ask about “account provisioning,” sparse retrieval finds it immediately.
Dense retrieval uses vector embeddings and semantic similarity. It works when users ask about something in different words than the documents use. “How do I set up a new user?” can retrieve documents about “account provisioning” because the semantic meanings overlap, even though no keywords match.
The failure mode of each is the opposite of the other. Sparse retrieval misses semantically related documents that use different terms. Dense retrieval can miss documents where exact phrasing matters, and it struggles on highly technical queries where terminology precision is the whole point.
Production RAG systems increasingly use hybrid retrieval, combining both methods and merging their results. Neither alone is sufficient for a serious deployment.
What Gets Grounded and What Doesn’t
Hallucination in language models happens because the model interpolates from parametric memory when it doesn’t know something. It generates text that fits the statistical pattern of plausible answers rather than drawing on verified facts.
RAG reduces hallucination by making the model generate conditioned on retrieved documents. When the retrieved context contains the relevant fact, the model’s output is anchored to that fact rather than interpolated. The research from Lewis et al. confirmed this: their RAG models generated “more specific, diverse and factual language” than a parametric-only baseline on knowledge-intensive tasks.
The crucial qualifier is: when the retrieved context contains the relevant fact. If retrieval fails — if the relevant document isn’t in the index, or the chunking cut the key sentence away from its context, or the query is phrased so differently from the document that semantic search doesn’t surface the right chunk — the model falls back to parametric memory. It will hallucinate anyway, now with more confidence because it was handed irrelevant context that partially fit.
This is the important nuance: RAG reduces hallucination, it does not eliminate it. The quality of the generated answer is bounded by the quality of the retrieval. Bad retrieval, bad grounding, bad answer. The model does not compensate for a poor retrieval pipeline.
RAG-Sequence vs. RAG-Token: The Detail Nobody Mentions
The Lewis et al. paper introduced two specific formulations of RAG that have different behaviors and are worth understanding even if you’re not implementing from scratch.
In RAG-Sequence, the model retrieves a set of documents, picks one, and generates the entire response conditioned on that single document. One document, one answer.
In RAG-Token, the model can draw from different retrieved documents for different parts of its response. Each token of the output can be generated from a different source document. The model marginalizes over the retrieved documents at each step, allowing it to synthesize information from multiple sources within a single answer.
RAG-Token is more powerful but harder to attribute. If you need to tell a user exactly which source document an answer came from, RAG-Sequence makes that straightforward. RAG-Token mixes sources at the token level, which makes citation harder.
Most production RAG implementations are closer to RAG-Sequence in practice, largely for explainability reasons. Regulated industries in particular need traceable answers.
Why RAG Is Usually the Right Call Before Fine-Tuning
There’s a tendency to reach for fine-tuning when a model doesn’t know your domain. Fine-tuning means updating the model’s weights on your specific data, which is expensive, slow, and produces a model whose knowledge is again static the moment training ends.
RAG and fine-tuning answer different problems. Fine-tuning teaches the model a new style, format, or domain-specific reasoning pattern. RAG gives the model access to specific, current, updatable facts. The distinction matters in practice: a customer support model needs to know what your product does right now, which changes. That’s a RAG problem. It needs to generate responses in your brand voice, which doesn’t change. That’s a fine-tuning problem.
For a clear picture of what actually lives in a language model’s parametric memory before any of this augmentation happens, the how large language models work post covers the transformer architecture and pre-training process that RAG is built on top of.
Where RAG Breaks
RAG has a clean conceptual story. Production deployments have a messier one.
Retrieval quality degrades with document scale. At a few hundred documents, almost any chunking and embedding approach works. At hundreds of thousands of documents, embedding quality, chunk size, and metadata filtering become critical. Retrieval precision drops as the index grows unless you invest in the infrastructure.
Chunking decisions propagate through the entire pipeline. The most common production failure is chunks that are too large, too small, or split at semantically wrong boundaries. A table of contents chunk is nearly useless. A chunk that cuts mid-sentence loses coherence. There’s no universally correct chunk size. It depends on your documents, your queries, and your embedding model.
The context window is a hard ceiling. Even with perfect retrieval, you can only pass so many documents to the model as context. If the right answer requires synthesizing across ten documents but your context window fits three, the model won’t see all the relevant information. Context window management is a real architectural constraint.
Embedding models go stale. The embedding model you used to index your documents and the one you use to embed queries need to be the same, or at least compatible. Switching embedding models after the fact means re-embedding and re-indexing everything.
FAQ
What is the difference between retrieval augmented generation and fine-tuning?
Fine-tuning updates the weights of a language model on a specific dataset, permanently changing what the model knows. The resulting knowledge is static. Retrieval augmented generation adds an external document index at inference time without touching the model’s weights. The knowledge can be updated by updating the index. RAG is better suited for current, frequently changing information. Fine-tuning is better suited for teaching a model a consistent style, format, or domain-specific reasoning pattern. Many production systems use both.
Does RAG eliminate hallucinations?
No. RAG significantly reduces hallucination by grounding model outputs in retrieved documents. But if retrieval fails to surface the relevant documents, either because the document isn’t indexed, the chunk was poorly constructed, or the query doesn’t semantically match the source language, the model falls back to its parametric memory and can still hallucinate. The quality of the answer is bounded by the quality of the retrieval pipeline, not just the quality of the model.
What is a vector database and why does RAG need one?
A vector database stores documents as numerical vector embeddings and enables fast similarity search across them. RAG needs one because semantic search, finding documents by meaning rather than exact keywords, requires comparing the query embedding against potentially millions of document embeddings. Standard databases aren’t built for this kind of similarity computation at scale. Vector databases like Pinecone, Weaviate, Milvus, and Chroma are designed specifically for high-speed similarity search over large embedding sets.
The cleanest mental model for RAG is this: the language model supplies the reasoning and language generation capability; the document index supplies the facts. Neither does the other’s job. When the two are wired together correctly, you get a system that can answer questions about things the model never saw during training, based on documents that were updated this morning. When the retrieval pipeline is poorly built, you get a system with an extra failure mode nobody warned you about.
The original research that introduced this architecture is worth reading if you want the formal treatment. Lewis et al. (2020) remains the foundational paper and is openly available at arxiv.org/abs/2005.11401.

