RAG Architecture: Complete Guide

📖 15 min read deep-diveragarchitectureretrievalembeddings

Retrieval-Augmented Generation - building knowledge systems that teach LLMs your data, including retrieval technology deep dive

Key Takeaways

RAG has 3 stages: indexing (chunk, embed, store), retrieval (find relevant chunks), generation (LLM answers with context)
Hybrid search (dense + sparse/BM25) gives 10-20% better retrieval recall than either alone
Embedding model choice dramatically affects quality — E5-Mistral and BGE-M3 are the best open options
Cross-encoder reranking of top-10 candidates is the highest-ROI retrieval improvement

Retrieval-Augmented Generation (RAG) is how you make LLMs answer questions about your data. This guide covers everything from basic concepts to production patterns.

See the Generation Step in Action

The “G” in RAG: the model answers a question using only retrieved context, and refuses when the answer isn’t there. The system prompt below plays the role of retrieved chunks — edit it, change the question, and watch how grounding works.

RAG sandbox — grounded answering ● Live · Groq

System prompt (optional) Prompt

Demo runs on Groq's free open models (rate-limited). Cost figures estimate what the same token counts would cost on the listed API models.

What Is RAG?

The Problem: LLMs have a training cutoff. Claude was trained until April 2024. Ask it about events in May 2026 and it won’t know.

The Solution: RAG teaches the LLM your data by searching for relevant documents first, then adding them to the prompt as context.

The Flow:

User Question
    ↓
Search Your Knowledge Base
    ↓
Retrieve Relevant Documents
    ↓
Add Documents to Prompt
    ↓
Send to LLM with Full Context
    ↓
LLM Generates Answer (now informed by your data)

Why It Works: LLMs are excellent at reasoning over provided context. You just need to provide the right context.

The Three Stages of RAG

Stage 1: Indexing (Offline, happens once)

Take your documents and prepare them for search:

Chunking - Break documents into manageable pieces (chunks)
- Why: LLMs have context limits; you can’t send a 1000-page document
- How: Split by paragraph, sentence, or fixed size (e.g., 512 tokens)
- Tradeoff: Smaller chunks = more precise retrieval, but harder for LLM to understand context
Embedding - Convert text into numerical vectors
- Why: Numbers are what vector databases understand
- How: Pass each chunk through an embedding model (e.g., text-embedding-3-small)
- Result: Each chunk becomes a 1536-dimensional vector
Storage - Store vectors in a vector database
- Why: Fast similarity search
- How: Use Pinecone, Weaviate, Qdrant, Chroma, or pgvector
- Result: Searchable knowledge base

Stage 2: Retrieval (At query time)

When a user asks a question:

Convert question to vector - Use same embedding model
Find similar vectors - Vector database does similarity search (cosine, L2, etc.)
Return top-K chunks - Usually top 3-5 most similar chunks
Rank if needed - Re-rank results if you have a better ranker

Example:

User asks: “What’s our return policy?”

Query vector: [0.34, -0.12, 0.89, … 1536 dimensions total]
Database search: Finds chunks about “returns”, “refunds”, “exchange policy”
Return: Top 3 chunks about returns

Stage 3: Generation (At query time)

Send context + question to LLM:

prompt = """
Here is context about our company:
{retrieved_chunks}

User question: {user_question}

Answer the question based on the context above.
"""

answer = llm(prompt)

The LLM now has context and can answer accurately.

Chunking Strategies

Your chunking strategy dramatically affects RAG quality.

Strategy 1: Fixed-Size Chunks

Example: Split every 512 tokens

Chunk 1: tokens 0-512
Chunk 2: tokens 512-1024
Chunk 3: tokens 1024-1536

Pros: Simple, predictable
Cons: May split sentences, loses context at boundaries
Use when: You have unstructured text (PDFs, web scrapes)

Strategy 2: Semantic Chunks

Example: Split when topic changes

Chunk 1: "Introduction and Background"
Chunk 2: "Methods and Approach"
Chunk 3: "Results"

Pros: Preserves meaning, better context
Cons: Harder to implement, requires analysis
Use when: You control the source (your documentation)

Strategy 3: Overlapping Chunks

Example: Chunks with 50-token overlap

Chunk 1: tokens 0-512
Chunk 2: tokens 256-768 (overlaps with chunk 1)
Chunk 3: tokens 512-1024 (overlaps with chunk 2)

Pros: Preserves context across boundaries
Cons: Requires more storage (2x), slower search
Use when: Context at boundaries matters (legal docs, technical specs)

Retrieval Strategies

How you search matters.

Strategy 1: Dense Retrieval (Most Common)

How: Convert question to vector, find similar vectors

query_vector = embedding_model.embed("What's your return policy?")
results = vector_db.search(query_vector, top_k=5)

Pros: Fast, good for semantic search
Cons: Fails on keyword-specific queries
When to use: Most RAG systems

Strategy 2: BM25 (Keyword Search)

How: Traditional text search (like Elasticsearch)

results = bm25_index.search("return policy refund", top_k=5)

Pros: Excellent for keywords, fast
Cons: Fails on semantic meaning
When to use: When keywords are important (product searches)

Strategy 3: Hybrid (Best)

How: Combine dense + BM25, re-rank results

dense_results = vector_db.search(query_vector, top_k=10)
bm25_results = bm25_index.search(query_text, top_k=10)
combined = reciprocal_rank_fusion(dense_results, bm25_results)
final = rerank_with_llm(combined, query, top_k=5)

Pros: Best of both worlds
Cons: More complex, slower
When to use: Production systems where accuracy matters

Vector Databases Compared

DB	Best For	Index Type	Scalability	Cost	Complexity
Chroma	Prototyping, local	Brute force (HNSW optional)	Single node	Free	Easiest
Pinecone	Production, managed	HNSW	Auto-scaling	$0.04/1K vectors	Medium
Weaviate	Self-hosted, scale	HNSW + custom	Multi-node	Free or paid	Medium
Qdrant	High performance, filtering	HNSW + payload index	Multi-node, sharding	Free or cloud	Medium
pgvector	SQL integration	IVFFlat, HNSW	Postgres scale	DB cost	Hard
Milvus	Billion-scale	IVF, HNSW, DiskANN	Distributed	Free or cloud	Hardest

Common RAG Patterns

Pattern 1: Simple Q&A (Naive RAG)

User Question
    ↓ (embed)
Vector Search
    ↓
Top 3 Chunks
    ↓ (add to prompt)
Send to LLM with Context
    ↓
Answer

Pros: Simple, fast
Cons: Fails on complex questions needing multiple documents
Use: Customer support, simple FAQ

Pattern 2: Multi-Document (Fusion)

User Question
    ↓
Retrieve from Multiple Sources
    ↓
Combine & Re-rank
    ↓
Generate with Full Context
    ↓
Answer Synthesized from Multiple Docs

Pros: Handles complex questions
Cons: More expensive, longer context
Use: Research, analysis tasks

Pattern 3: Iterative RAG (with Questions)

User Question
    ↓
Initial Retrieval
    ↓
LLM Generates Follow-up Questions
    ↓
Retrieve Again (for follow-ups)
    ↓
Generate Final Answer with All Context

Pros: Handles multi-step reasoning
Cons: Multiple LLM calls, expensive
Use: Complex research, troubleshooting

Retrieval Technology Deep Dive

The retrieval layer is the most important determinant of RAG quality. A bad retriever means the LLM gets bad context. This section covers everything that happens between “user asks a question” and “chunks go into the prompt.”

Embedding Models

Embedding models convert text to vectors. Not all embedding models are equal — the choice dramatically affects retrieval quality.

How embedding models work:

Text is tokenized (same as LLMs)
Passed through a transformer encoder (no decoder — just the encoder part)
The final hidden state is pooled into a single vector
That vector represents the semantic meaning of the input

Comparison of major embedding models (May 2026):

Model	Dimensions	Max Tokens	Best For	Cost
text-embedding-3-small	512-1536	8K	General purpose, cheap	$0.02/1K tokens
text-embedding-3-large	256-3072	8K	High accuracy	$0.13/1K tokens
Cohere Embed v4	1024-4096	512	Multilingual, classification	$0.10/1K tokens
BGE-M3 (BAAI)	1024	8K	Multilingual, open-source	Free (self-host)
E5-Mistral (Microsoft)	4096	8K	High accuracy, open-source	Free (self-host)
Jina Embeddings v3	1024	8K	Task-specific routing	Free (self-host)

Key considerations:

Dimensionality: Higher = more information per vector, but slower search. 768-1536 is the sweet spot for most use cases.
Max tokens: Embedding models have token limits too. Longer documents must be chunked first.
Open vs API: Open-source models (BGE, E5, Jina) can be self-hosted for privacy and zero API costs. API models (OpenAI, Cohere) are simpler but cost money at scale.
Multilingual: If your data has multiple languages, use a multilingual embedding model (Cohere, BGE-M3).

Rule of thumb: Start with text-embedding-3-small (cheap, good quality). Switch to E5-Mistral or BGE-M3 if you need better accuracy at higher scale.

Dense vs Sparse vs Hybrid Retrieval

Dense retrieval (vector search):

Embed both query and documents into dense vectors. Search by cosine similarity or dot product.

Query: "return policy" → [0.3, -0.1, 0.8, ...]  (dense vector)
Document: "we accept returns within 30 days" → [0.35, -0.05, 0.75, ...]  (dense vector)
Similarity: 0.92 (very similar) ✅

Pros: Understands semantics (“how to get a refund” finds return policy) Cons: Keyword-specific queries fail (“policy document 4042” needs exact match)

Sparse retrieval (BM25 / keyword search):

Traditional TF-IDF style. Each term gets a weight based on frequency.

Query: "return policy"
  "return" → weight 0.45
  "policy" → weight 0.55
Document: "we accept returns within 30 days"
  "returns" → weight 0.3, "within" → weight 0.1, "30" → weight 0.15, "days" → weight 0.1
Score: 0.27 (decent — matches on "return/returns")

Pros: Excellent for exact terms, IDs, proper nouns Cons: No semantic understanding — “how to get my money back” won’t match “return policy”

Hybrid retrieval:

Combine both scores and merge results. The standard technique is Reciprocal Rank Fusion (RRF):

def reciprocal_rank_fusion(dense_results, sparse_results, k=60):
    scores = {}
    for rank, doc in enumerate(dense_results + sparse_results):
        doc_id = doc.id
        if doc_id not in scores:
            scores[doc_id] = 0
        scores[doc_id] += 1 / (rank + k)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

RRF formula: score = sum(1 / (rank + k)) for each document across both rankings. k=60 is the standard smoothing constant.

Hybrid is almost always better than either alone. Expect 10-20% improvement in retrieval recall.

Reranking

After initial retrieval (top 10-100), a reranker re-evaluates candidates with a more expensive but more accurate model.

Why rerank?

Vector DB search is fast but imperfect
A reranker can take into account more nuanced signals
Reranking is applied to a small set (top N), so it adds minimal latency

Cross-encoder reranking:

The reranker takes a (query, document) pair and outputs a relevance score:

Input: ("What is the return policy?", "we accept returns within 30 days...")
  ↓
Cross-encoder transformer (BERT-style, processes both together)
  ↓
Score: 0.95 (highly relevant)

Cross-encoders are too slow to run on the full corpus (they process each pair fully) but fast enough for 10-100 candidates.

Reranker comparison:

Model	Speed (docs/sec)	Quality	Cost
Cohere Rerank v3	~100	Excellent	$1/1K docs
BGE-Reranker-v2	~50	Very good	Free (self-host)
Cross-encoder/ms-marco	~200	Good	Free (self-host)
LLM-as-judge	~5	Best	$0.01/query

When to use which:

Start without reranking — vector search alone is often sufficient for simple Q&A
Add cross-encoder reranking when you need higher accuracy (production systems)
Add LLM reranking only for the hardest cases (multi-document, multi-hop)

Late Interaction Models (ColBERT)

ColBERT introduces a middle ground between dense retrieval and cross-encoder reranking. It uses late interaction — query and document are encoded separately, then compared token-by-token.

Query: "return policy" → [q1, q2]  (query token vectors)
Document: "we accept returns within..." → [d1, d2, d3, ..., dn]  (doc token vectors)

Match: For each query token q_i, find max similarity with any document token d_j
       q1("return") matches d3("returns") → 0.9
       q2("policy") matches d1("we") → 0.3  (no match)
Score: average of max similarities = (0.9 + 0.3) / 2 = 0.6

Pros:

More accurate than standard dense retrieval (token-level matching)
Can be pre-computed (document embeddings are static)
Efficient at query time (only compare query tokens to pre-computed doc embeddings)

Cons:

More storage (store per-token embeddings, not a single vector)
Slower than standard dense retrieval (more comparisons)
Fewer deployment options (main implementation is ColBERTv2)

Best for: High-accuracy retrieval where standard vector search isn’t enough but full cross-encoder reranking is too expensive.

Advanced Retrieval Patterns

Query rewriting: Transform the user’s raw query into a better search query before retrieval.

User question: "How do I cancel?"
↕ (LLM rewrites)
Search query: "cancellation policy subscription termination refund"

Multi-vector retrieval: Generate multiple queries for a single user question.

User question: "Compare our products"
↕ (LLM generates variations)
Queries:
1. "product A features pricing"
2. "product B features pricing"
3. "A vs B comparison"

Each query is searched independently. Results are merged and deduplicated.

HyDE (Hypothetical Document Embeddings):

Generate a hypothetical answer first, then use that to search:

User question: "What's the return policy for electronics?"
  ↓
LLM generates hypothetical answer: "Electronics can be returned within 30 days if unopened..."
  ↓
Embed the hypothetical answer (not the question)
  ↓
Search with this embedding (more likely to match relevant documents)

HyDE works because the hypothetical answer is semantically closer to the actual relevant documents than the original question.

Step-back prompting for retrieval:

Retrieve at a higher level of abstraction first, then narrow down.

User question: "Can I return a laptop after 2 weeks?"
  ↓ Step back
Concept question: "Electronics return policy"
  ↓ Retrieve
Retrieved: "Electronics: 30-day return window, must include all accessories"
  ↓ Narrow
Specific answer: "Yes, a laptop can be returned within 2 weeks."

Production Considerations

1. Chunking Size

Too small (100 tokens):

Pro: Precise retrieval
Con: LLM loses context

Too large (2000 tokens):

Pro: Full context
Con: Retrieves irrelevant stuff

Goldilocks (512-1024 tokens): Usually best

2. Overlap (if using)

No overlap: Fast search, but boundaries lose context
50-token overlap: Extra storage, better results

3. Reranking

After retrieving top-10 from vector DB, rerank with:

Cross-encoder: Slow but accurate
Query likelihood: Fast, decent
LLM-based: Expensive but smart

4. Context Window

Always leave room for the question + response:

max_context_size = model_context_window - buffer
# buffer = 1000 tokens (for question + answer)

retrieved_chunks = retrieve_up_to(max_context_size - buffer)

5. Error Handling

What if no documents match?

Return “No information found”
Fall back to general LLM knowledge
Ask user for clarification

What if too many documents match?

Take top-K (usually 3-5)
Re-rank and keep best
Use filtering if available

Common Mistakes

❌ No overlap between chunks → Context lost at boundaries
✅ Use 50-token overlap

❌ Chunks too large (>1500 tokens) → Includes irrelevant content
✅ Use 512-1024 tokens

❌ Only using vector search → Fails on keywords
✅ Use hybrid (vector + BM25)

❌ Not re-ranking results → Suboptimal retrieval
✅ Re-rank top-10 to top-3

❌ Stale embeddings → Miss new documents
✅ Re-embed regularly or use live embeddings

Implementation Checklist

Example: From Zero to RAG

# 1. Load documents
documents = load_pdfs("./docs/")

# 2. Split into chunks
chunks = split_into_chunks(documents, chunk_size=512, overlap=50)

# 3. Embed
embeddings = [embed_model.embed(chunk) for chunk in chunks]

# 4. Store in vector DB
vector_db = Chroma()
for chunk, embedding in zip(chunks, embeddings):
    vector_db.add(text=chunk, embedding=embedding)

# 5. Build retrieval
def retrieve(query, top_k=5):
    query_vector = embed_model.embed(query)
    results = vector_db.search(query_vector, top_k=top_k)
    return results

# 6. Build RAG chain
def answer_question(question):
    context = retrieve(question)
    prompt = f"Context:\n{context}\n\nQuestion: {question}"
    answer = llm.generate(prompt)
    return answer

# 7. Use it
print(answer_question("What's our return policy?"))

Measuring RAG Quality

Retrieval Metrics:

Precision: % of retrieved docs relevant
Recall: % of relevant docs retrieved
MRR (Mean Reciprocal Rank): How high is first relevant doc?

Answer Metrics:

Relevance: Does answer address the question?
Accuracy: Is answer correct?
Groundedness: Is answer based on provided context?

User Metrics:

Helpfulness: Did user get what they needed?
Satisfaction: Would they use this again?
Time to resolution: How quickly did they get answer?

When RAG Isn’t Enough

You need reasoning: Add agents/chains (tool use for follow-ups)
You need multi-hop questions: Iterative RAG or graph-based retrieval
You need structured data: Add SQL / structured query capability
You need real-time data: Stream updates or use live APIs