RAG Architecture: Complete Guide
Retrieval-Augmented Generation (RAG) is how you make LLMs answer questions about your data. This guide covers everything from basic concepts to production patterns.
See the Generation Step in Action
The “G” in RAG: the model answers a question using only retrieved context, and refuses when the answer isn’t there. The system prompt below plays the role of retrieved chunks — edit it, change the question, and watch how grounding works.
Demo runs on Groq's free open models (rate-limited). Cost figures estimate what the same token counts would cost on the listed API models.
What Is RAG?
The Problem: LLMs have a training cutoff. Claude was trained until April 2024. Ask it about events in May 2026 and it won’t know.
The Solution: RAG teaches the LLM your data by searching for relevant documents first, then adding them to the prompt as context.
The Flow:
User Question ↓Search Your Knowledge Base ↓Retrieve Relevant Documents ↓Add Documents to Prompt ↓Send to LLM with Full Context ↓LLM Generates Answer (now informed by your data)Why It Works: LLMs are excellent at reasoning over provided context. You just need to provide the right context.
The Three Stages of RAG
Stage 1: Indexing (Offline, happens once)
Take your documents and prepare them for search:
-
Chunking - Break documents into manageable pieces (chunks)
- Why: LLMs have context limits; you can’t send a 1000-page document
- How: Split by paragraph, sentence, or fixed size (e.g., 512 tokens)
- Tradeoff: Smaller chunks = more precise retrieval, but harder for LLM to understand context
-
Embedding - Convert text into numerical vectors
- Why: Numbers are what vector databases understand
- How: Pass each chunk through an embedding model (e.g.,
text-embedding-3-small) - Result: Each chunk becomes a 1536-dimensional vector
-
Storage - Store vectors in a vector database
- Why: Fast similarity search
- How: Use Pinecone, Weaviate, Qdrant, Chroma, or pgvector
- Result: Searchable knowledge base
Stage 2: Retrieval (At query time)
When a user asks a question:
- Convert question to vector - Use same embedding model
- Find similar vectors - Vector database does similarity search (cosine, L2, etc.)
- Return top-K chunks - Usually top 3-5 most similar chunks
- Rank if needed - Re-rank results if you have a better ranker
Example:
User asks: “What’s our return policy?”
- Query vector: [0.34, -0.12, 0.89, … 1536 dimensions total]
- Database search: Finds chunks about “returns”, “refunds”, “exchange policy”
- Return: Top 3 chunks about returns
Stage 3: Generation (At query time)
Send context + question to LLM:
prompt = """Here is context about our company:{retrieved_chunks}
User question: {user_question}
Answer the question based on the context above."""
answer = llm(prompt)The LLM now has context and can answer accurately.
Chunking Strategies
Your chunking strategy dramatically affects RAG quality.
Strategy 1: Fixed-Size Chunks
Example: Split every 512 tokens
Chunk 1: tokens 0-512Chunk 2: tokens 512-1024Chunk 3: tokens 1024-1536Pros: Simple, predictable
Cons: May split sentences, loses context at boundaries
Use when: You have unstructured text (PDFs, web scrapes)
Strategy 2: Semantic Chunks
Example: Split when topic changes
Chunk 1: "Introduction and Background"Chunk 2: "Methods and Approach"Chunk 3: "Results"Pros: Preserves meaning, better context
Cons: Harder to implement, requires analysis
Use when: You control the source (your documentation)
Strategy 3: Overlapping Chunks
Example: Chunks with 50-token overlap
Chunk 1: tokens 0-512Chunk 2: tokens 256-768 (overlaps with chunk 1)Chunk 3: tokens 512-1024 (overlaps with chunk 2)Pros: Preserves context across boundaries
Cons: Requires more storage (2x), slower search
Use when: Context at boundaries matters (legal docs, technical specs)
Retrieval Strategies
How you search matters.
Strategy 1: Dense Retrieval (Most Common)
How: Convert question to vector, find similar vectors
query_vector = embedding_model.embed("What's your return policy?")results = vector_db.search(query_vector, top_k=5)Pros: Fast, good for semantic search
Cons: Fails on keyword-specific queries
When to use: Most RAG systems
Strategy 2: BM25 (Keyword Search)
How: Traditional text search (like Elasticsearch)
results = bm25_index.search("return policy refund", top_k=5)Pros: Excellent for keywords, fast
Cons: Fails on semantic meaning
When to use: When keywords are important (product searches)
Strategy 3: Hybrid (Best)
How: Combine dense + BM25, re-rank results
dense_results = vector_db.search(query_vector, top_k=10)bm25_results = bm25_index.search(query_text, top_k=10)combined = reciprocal_rank_fusion(dense_results, bm25_results)final = rerank_with_llm(combined, query, top_k=5)Pros: Best of both worlds
Cons: More complex, slower
When to use: Production systems where accuracy matters
Vector Databases Compared
| DB | Best For | Index Type | Scalability | Cost | Complexity |
|---|---|---|---|---|---|
| Chroma | Prototyping, local | Brute force (HNSW optional) | Single node | Free | Easiest |
| Pinecone | Production, managed | HNSW | Auto-scaling | $0.04/1K vectors | Medium |
| Weaviate | Self-hosted, scale | HNSW + custom | Multi-node | Free or paid | Medium |
| Qdrant | High performance, filtering | HNSW + payload index | Multi-node, sharding | Free or cloud | Medium |
| pgvector | SQL integration | IVFFlat, HNSW | Postgres scale | DB cost | Hard |
| Milvus | Billion-scale | IVF, HNSW, DiskANN | Distributed | Free or cloud | Hardest |
Common RAG Patterns
Pattern 1: Simple Q&A (Naive RAG)
User Question ↓ (embed)Vector Search ↓Top 3 Chunks ↓ (add to prompt)Send to LLM with Context ↓AnswerPros: Simple, fast
Cons: Fails on complex questions needing multiple documents
Use: Customer support, simple FAQ
Pattern 2: Multi-Document (Fusion)
User Question ↓Retrieve from Multiple Sources ↓Combine & Re-rank ↓Generate with Full Context ↓Answer Synthesized from Multiple DocsPros: Handles complex questions
Cons: More expensive, longer context
Use: Research, analysis tasks
Pattern 3: Iterative RAG (with Questions)
User Question ↓Initial Retrieval ↓LLM Generates Follow-up Questions ↓Retrieve Again (for follow-ups) ↓Generate Final Answer with All ContextPros: Handles multi-step reasoning
Cons: Multiple LLM calls, expensive
Use: Complex research, troubleshooting
Retrieval Technology Deep Dive
The retrieval layer is the most important determinant of RAG quality. A bad retriever means the LLM gets bad context. This section covers everything that happens between “user asks a question” and “chunks go into the prompt.”
Embedding Models
Embedding models convert text to vectors. Not all embedding models are equal — the choice dramatically affects retrieval quality.
How embedding models work:
- Text is tokenized (same as LLMs)
- Passed through a transformer encoder (no decoder — just the encoder part)
- The final hidden state is pooled into a single vector
- That vector represents the semantic meaning of the input
Comparison of major embedding models (May 2026):
| Model | Dimensions | Max Tokens | Best For | Cost |
|---|---|---|---|---|
| text-embedding-3-small | 512-1536 | 8K | General purpose, cheap | $0.02/1K tokens |
| text-embedding-3-large | 256-3072 | 8K | High accuracy | $0.13/1K tokens |
| Cohere Embed v4 | 1024-4096 | 512 | Multilingual, classification | $0.10/1K tokens |
| BGE-M3 (BAAI) | 1024 | 8K | Multilingual, open-source | Free (self-host) |
| E5-Mistral (Microsoft) | 4096 | 8K | High accuracy, open-source | Free (self-host) |
| Jina Embeddings v3 | 1024 | 8K | Task-specific routing | Free (self-host) |
Key considerations:
- Dimensionality: Higher = more information per vector, but slower search. 768-1536 is the sweet spot for most use cases.
- Max tokens: Embedding models have token limits too. Longer documents must be chunked first.
- Open vs API: Open-source models (BGE, E5, Jina) can be self-hosted for privacy and zero API costs. API models (OpenAI, Cohere) are simpler but cost money at scale.
- Multilingual: If your data has multiple languages, use a multilingual embedding model (Cohere, BGE-M3).
Rule of thumb: Start with text-embedding-3-small (cheap, good quality). Switch to E5-Mistral or BGE-M3 if you need better accuracy at higher scale.
Dense vs Sparse vs Hybrid Retrieval
Dense retrieval (vector search):
Embed both query and documents into dense vectors. Search by cosine similarity or dot product.
Query: "return policy" → [0.3, -0.1, 0.8, ...] (dense vector)Document: "we accept returns within 30 days" → [0.35, -0.05, 0.75, ...] (dense vector)Similarity: 0.92 (very similar) ✅Pros: Understands semantics (“how to get a refund” finds return policy) Cons: Keyword-specific queries fail (“policy document 4042” needs exact match)
Sparse retrieval (BM25 / keyword search):
Traditional TF-IDF style. Each term gets a weight based on frequency.
Query: "return policy" "return" → weight 0.45 "policy" → weight 0.55Document: "we accept returns within 30 days" "returns" → weight 0.3, "within" → weight 0.1, "30" → weight 0.15, "days" → weight 0.1Score: 0.27 (decent — matches on "return/returns")Pros: Excellent for exact terms, IDs, proper nouns Cons: No semantic understanding — “how to get my money back” won’t match “return policy”
Hybrid retrieval:
Combine both scores and merge results. The standard technique is Reciprocal Rank Fusion (RRF):
def reciprocal_rank_fusion(dense_results, sparse_results, k=60): scores = {} for rank, doc in enumerate(dense_results + sparse_results): doc_id = doc.id if doc_id not in scores: scores[doc_id] = 0 scores[doc_id] += 1 / (rank + k) return sorted(scores.items(), key=lambda x: x[1], reverse=True)RRF formula: score = sum(1 / (rank + k)) for each document across both rankings. k=60 is the standard smoothing constant.
Hybrid is almost always better than either alone. Expect 10-20% improvement in retrieval recall.
Reranking
After initial retrieval (top 10-100), a reranker re-evaluates candidates with a more expensive but more accurate model.
Why rerank?
- Vector DB search is fast but imperfect
- A reranker can take into account more nuanced signals
- Reranking is applied to a small set (top N), so it adds minimal latency
Cross-encoder reranking:
The reranker takes a (query, document) pair and outputs a relevance score:
Input: ("What is the return policy?", "we accept returns within 30 days...") ↓Cross-encoder transformer (BERT-style, processes both together) ↓Score: 0.95 (highly relevant)Cross-encoders are too slow to run on the full corpus (they process each pair fully) but fast enough for 10-100 candidates.
Reranker comparison:
| Model | Speed (docs/sec) | Quality | Cost |
|---|---|---|---|
| Cohere Rerank v3 | ~100 | Excellent | $1/1K docs |
| BGE-Reranker-v2 | ~50 | Very good | Free (self-host) |
| Cross-encoder/ms-marco | ~200 | Good | Free (self-host) |
| LLM-as-judge | ~5 | Best | $0.01/query |
When to use which:
- Start without reranking — vector search alone is often sufficient for simple Q&A
- Add cross-encoder reranking when you need higher accuracy (production systems)
- Add LLM reranking only for the hardest cases (multi-document, multi-hop)
Late Interaction Models (ColBERT)
ColBERT introduces a middle ground between dense retrieval and cross-encoder reranking. It uses late interaction — query and document are encoded separately, then compared token-by-token.
Query: "return policy" → [q1, q2] (query token vectors)Document: "we accept returns within..." → [d1, d2, d3, ..., dn] (doc token vectors)
Match: For each query token q_i, find max similarity with any document token d_j q1("return") matches d3("returns") → 0.9 q2("policy") matches d1("we") → 0.3 (no match)Score: average of max similarities = (0.9 + 0.3) / 2 = 0.6Pros:
- More accurate than standard dense retrieval (token-level matching)
- Can be pre-computed (document embeddings are static)
- Efficient at query time (only compare query tokens to pre-computed doc embeddings)
Cons:
- More storage (store per-token embeddings, not a single vector)
- Slower than standard dense retrieval (more comparisons)
- Fewer deployment options (main implementation is ColBERTv2)
Best for: High-accuracy retrieval where standard vector search isn’t enough but full cross-encoder reranking is too expensive.
Advanced Retrieval Patterns
Query rewriting: Transform the user’s raw query into a better search query before retrieval.
User question: "How do I cancel?"↕ (LLM rewrites)Search query: "cancellation policy subscription termination refund"Multi-vector retrieval: Generate multiple queries for a single user question.
User question: "Compare our products"↕ (LLM generates variations)Queries:1. "product A features pricing"2. "product B features pricing"3. "A vs B comparison"Each query is searched independently. Results are merged and deduplicated.
HyDE (Hypothetical Document Embeddings):
Generate a hypothetical answer first, then use that to search:
User question: "What's the return policy for electronics?" ↓LLM generates hypothetical answer: "Electronics can be returned within 30 days if unopened..." ↓Embed the hypothetical answer (not the question) ↓Search with this embedding (more likely to match relevant documents)HyDE works because the hypothetical answer is semantically closer to the actual relevant documents than the original question.
Step-back prompting for retrieval:
Retrieve at a higher level of abstraction first, then narrow down.
User question: "Can I return a laptop after 2 weeks?" ↓ Step backConcept question: "Electronics return policy" ↓ RetrieveRetrieved: "Electronics: 30-day return window, must include all accessories" ↓ NarrowSpecific answer: "Yes, a laptop can be returned within 2 weeks."Production Considerations
1. Chunking Size
Too small (100 tokens):
- Pro: Precise retrieval
- Con: LLM loses context
Too large (2000 tokens):
- Pro: Full context
- Con: Retrieves irrelevant stuff
Goldilocks (512-1024 tokens): Usually best
2. Overlap (if using)
No overlap: Fast search, but boundaries lose context
50-token overlap: Extra storage, better results
3. Reranking
After retrieving top-10 from vector DB, rerank with:
- Cross-encoder: Slow but accurate
- Query likelihood: Fast, decent
- LLM-based: Expensive but smart
4. Context Window
Always leave room for the question + response:
max_context_size = model_context_window - buffer# buffer = 1000 tokens (for question + answer)
retrieved_chunks = retrieve_up_to(max_context_size - buffer)5. Error Handling
What if no documents match?
- Return “No information found”
- Fall back to general LLM knowledge
- Ask user for clarification
What if too many documents match?
- Take top-K (usually 3-5)
- Re-rank and keep best
- Use filtering if available
Common Mistakes
❌ No overlap between chunks → Context lost at boundaries
✅ Use 50-token overlap
❌ Chunks too large (>1500 tokens) → Includes irrelevant content
✅ Use 512-1024 tokens
❌ Only using vector search → Fails on keywords
✅ Use hybrid (vector + BM25)
❌ Not re-ranking results → Suboptimal retrieval
✅ Re-rank top-10 to top-3
❌ Stale embeddings → Miss new documents
✅ Re-embed regularly or use live embeddings
Implementation Checklist
- Collect your documents
- Parse documents (PDFs, web, etc.)
- Split into chunks (512-1024 tokens, 50-token overlap)
- Generate embeddings (use
text-embedding-3-small) - Store in vector DB (start with Chroma for prototyping)
- Build retrieval function (dense + BM25 if possible)
- Add re-ranking (cross-encoder or LLM)
- Test on known questions
- Monitor retrieval quality
- Iterate on chunk size / retrieval strategy
Example: From Zero to RAG
# 1. Load documentsdocuments = load_pdfs("./docs/")
# 2. Split into chunkschunks = split_into_chunks(documents, chunk_size=512, overlap=50)
# 3. Embedembeddings = [embed_model.embed(chunk) for chunk in chunks]
# 4. Store in vector DBvector_db = Chroma()for chunk, embedding in zip(chunks, embeddings): vector_db.add(text=chunk, embedding=embedding)
# 5. Build retrievaldef retrieve(query, top_k=5): query_vector = embed_model.embed(query) results = vector_db.search(query_vector, top_k=top_k) return results
# 6. Build RAG chaindef answer_question(question): context = retrieve(question) prompt = f"Context:\n{context}\n\nQuestion: {question}" answer = llm.generate(prompt) return answer
# 7. Use itprint(answer_question("What's our return policy?"))Measuring RAG Quality
Retrieval Metrics:
- Precision: % of retrieved docs relevant
- Recall: % of relevant docs retrieved
- MRR (Mean Reciprocal Rank): How high is first relevant doc?
Answer Metrics:
- Relevance: Does answer address the question?
- Accuracy: Is answer correct?
- Groundedness: Is answer based on provided context?
User Metrics:
- Helpfulness: Did user get what they needed?
- Satisfaction: Would they use this again?
- Time to resolution: How quickly did they get answer?
When RAG Isn’t Enough
- You need reasoning: Add agents/chains (tool use for follow-ups)
- You need multi-hop questions: Iterative RAG or graph-based retrieval
- You need structured data: Add SQL / structured query capability
- You need real-time data: Stream updates or use live APIs
See Also:
- Builder Path - Hands-on RAG implementation
- Frameworks Guide - LangChain for RAG
- How LLMs Work - Understanding embeddings