Skip to content

RAG Architecture: Complete Guide

📖 15 min read deep-diveragarchitectureretrievalembeddings
Retrieval-Augmented Generation - building knowledge systems that teach LLMs your data, including retrieval technology deep dive
Key Takeaways
  • RAG has 3 stages: indexing (chunk, embed, store), retrieval (find relevant chunks), generation (LLM answers with context)
  • Hybrid search (dense + sparse/BM25) gives 10-20% better retrieval recall than either alone
  • Embedding model choice dramatically affects quality — E5-Mistral and BGE-M3 are the best open options
  • Cross-encoder reranking of top-10 candidates is the highest-ROI retrieval improvement

Retrieval-Augmented Generation (RAG) is how you make LLMs answer questions about your data. This guide covers everything from basic concepts to production patterns.

See the Generation Step in Action

The “G” in RAG: the model answers a question using only retrieved context, and refuses when the answer isn’t there. The system prompt below plays the role of retrieved chunks — edit it, change the question, and watch how grounding works.

RAG sandbox — grounded answering ● Live · Groq

Demo runs on Groq's free open models (rate-limited). Cost figures estimate what the same token counts would cost on the listed API models.


What Is RAG?

The Problem: LLMs have a training cutoff. Claude was trained until April 2024. Ask it about events in May 2026 and it won’t know.

The Solution: RAG teaches the LLM your data by searching for relevant documents first, then adding them to the prompt as context.

The Flow:

User Question
Search Your Knowledge Base
Retrieve Relevant Documents
Add Documents to Prompt
Send to LLM with Full Context
LLM Generates Answer (now informed by your data)

Why It Works: LLMs are excellent at reasoning over provided context. You just need to provide the right context.


The Three Stages of RAG

Stage 1: Indexing (Offline, happens once)

Take your documents and prepare them for search:

  1. Chunking - Break documents into manageable pieces (chunks)

    • Why: LLMs have context limits; you can’t send a 1000-page document
    • How: Split by paragraph, sentence, or fixed size (e.g., 512 tokens)
    • Tradeoff: Smaller chunks = more precise retrieval, but harder for LLM to understand context
  2. Embedding - Convert text into numerical vectors

    • Why: Numbers are what vector databases understand
    • How: Pass each chunk through an embedding model (e.g., text-embedding-3-small)
    • Result: Each chunk becomes a 1536-dimensional vector
  3. Storage - Store vectors in a vector database

    • Why: Fast similarity search
    • How: Use Pinecone, Weaviate, Qdrant, Chroma, or pgvector
    • Result: Searchable knowledge base

Stage 2: Retrieval (At query time)

When a user asks a question:

  1. Convert question to vector - Use same embedding model
  2. Find similar vectors - Vector database does similarity search (cosine, L2, etc.)
  3. Return top-K chunks - Usually top 3-5 most similar chunks
  4. Rank if needed - Re-rank results if you have a better ranker

Example:

User asks: “What’s our return policy?”

  • Query vector: [0.34, -0.12, 0.89, … 1536 dimensions total]
  • Database search: Finds chunks about “returns”, “refunds”, “exchange policy”
  • Return: Top 3 chunks about returns

Stage 3: Generation (At query time)

Send context + question to LLM:

prompt = """
Here is context about our company:
{retrieved_chunks}
User question: {user_question}
Answer the question based on the context above.
"""
answer = llm(prompt)

The LLM now has context and can answer accurately.


Chunking Strategies

Your chunking strategy dramatically affects RAG quality.

Strategy 1: Fixed-Size Chunks

Example: Split every 512 tokens

Chunk 1: tokens 0-512
Chunk 2: tokens 512-1024
Chunk 3: tokens 1024-1536

Pros: Simple, predictable
Cons: May split sentences, loses context at boundaries
Use when: You have unstructured text (PDFs, web scrapes)

Strategy 2: Semantic Chunks

Example: Split when topic changes

Chunk 1: "Introduction and Background"
Chunk 2: "Methods and Approach"
Chunk 3: "Results"

Pros: Preserves meaning, better context
Cons: Harder to implement, requires analysis
Use when: You control the source (your documentation)

Strategy 3: Overlapping Chunks

Example: Chunks with 50-token overlap

Chunk 1: tokens 0-512
Chunk 2: tokens 256-768 (overlaps with chunk 1)
Chunk 3: tokens 512-1024 (overlaps with chunk 2)

Pros: Preserves context across boundaries
Cons: Requires more storage (2x), slower search
Use when: Context at boundaries matters (legal docs, technical specs)


Retrieval Strategies

How you search matters.

Strategy 1: Dense Retrieval (Most Common)

How: Convert question to vector, find similar vectors

query_vector = embedding_model.embed("What's your return policy?")
results = vector_db.search(query_vector, top_k=5)

Pros: Fast, good for semantic search
Cons: Fails on keyword-specific queries
When to use: Most RAG systems

How: Traditional text search (like Elasticsearch)

results = bm25_index.search("return policy refund", top_k=5)

Pros: Excellent for keywords, fast
Cons: Fails on semantic meaning
When to use: When keywords are important (product searches)

Strategy 3: Hybrid (Best)

How: Combine dense + BM25, re-rank results

dense_results = vector_db.search(query_vector, top_k=10)
bm25_results = bm25_index.search(query_text, top_k=10)
combined = reciprocal_rank_fusion(dense_results, bm25_results)
final = rerank_with_llm(combined, query, top_k=5)

Pros: Best of both worlds
Cons: More complex, slower
When to use: Production systems where accuracy matters


Vector Databases Compared

DBBest ForIndex TypeScalabilityCostComplexity
ChromaPrototyping, localBrute force (HNSW optional)Single nodeFreeEasiest
PineconeProduction, managedHNSWAuto-scaling$0.04/1K vectorsMedium
WeaviateSelf-hosted, scaleHNSW + customMulti-nodeFree or paidMedium
QdrantHigh performance, filteringHNSW + payload indexMulti-node, shardingFree or cloudMedium
pgvectorSQL integrationIVFFlat, HNSWPostgres scaleDB costHard
MilvusBillion-scaleIVF, HNSW, DiskANNDistributedFree or cloudHardest

Common RAG Patterns

Pattern 1: Simple Q&A (Naive RAG)

User Question
↓ (embed)
Vector Search
Top 3 Chunks
↓ (add to prompt)
Send to LLM with Context
Answer

Pros: Simple, fast
Cons: Fails on complex questions needing multiple documents
Use: Customer support, simple FAQ

Pattern 2: Multi-Document (Fusion)

User Question
Retrieve from Multiple Sources
Combine & Re-rank
Generate with Full Context
Answer Synthesized from Multiple Docs

Pros: Handles complex questions
Cons: More expensive, longer context
Use: Research, analysis tasks

Pattern 3: Iterative RAG (with Questions)

User Question
Initial Retrieval
LLM Generates Follow-up Questions
Retrieve Again (for follow-ups)
Generate Final Answer with All Context

Pros: Handles multi-step reasoning
Cons: Multiple LLM calls, expensive
Use: Complex research, troubleshooting


Retrieval Technology Deep Dive

The retrieval layer is the most important determinant of RAG quality. A bad retriever means the LLM gets bad context. This section covers everything that happens between “user asks a question” and “chunks go into the prompt.”

Embedding Models

Embedding models convert text to vectors. Not all embedding models are equal — the choice dramatically affects retrieval quality.

How embedding models work:

  1. Text is tokenized (same as LLMs)
  2. Passed through a transformer encoder (no decoder — just the encoder part)
  3. The final hidden state is pooled into a single vector
  4. That vector represents the semantic meaning of the input

Comparison of major embedding models (May 2026):

ModelDimensionsMax TokensBest ForCost
text-embedding-3-small512-15368KGeneral purpose, cheap$0.02/1K tokens
text-embedding-3-large256-30728KHigh accuracy$0.13/1K tokens
Cohere Embed v41024-4096512Multilingual, classification$0.10/1K tokens
BGE-M3 (BAAI)10248KMultilingual, open-sourceFree (self-host)
E5-Mistral (Microsoft)40968KHigh accuracy, open-sourceFree (self-host)
Jina Embeddings v310248KTask-specific routingFree (self-host)

Key considerations:

  • Dimensionality: Higher = more information per vector, but slower search. 768-1536 is the sweet spot for most use cases.
  • Max tokens: Embedding models have token limits too. Longer documents must be chunked first.
  • Open vs API: Open-source models (BGE, E5, Jina) can be self-hosted for privacy and zero API costs. API models (OpenAI, Cohere) are simpler but cost money at scale.
  • Multilingual: If your data has multiple languages, use a multilingual embedding model (Cohere, BGE-M3).

Rule of thumb: Start with text-embedding-3-small (cheap, good quality). Switch to E5-Mistral or BGE-M3 if you need better accuracy at higher scale.

Dense vs Sparse vs Hybrid Retrieval

Dense retrieval (vector search):

Embed both query and documents into dense vectors. Search by cosine similarity or dot product.

Query: "return policy" → [0.3, -0.1, 0.8, ...] (dense vector)
Document: "we accept returns within 30 days" → [0.35, -0.05, 0.75, ...] (dense vector)
Similarity: 0.92 (very similar) ✅

Pros: Understands semantics (“how to get a refund” finds return policy) Cons: Keyword-specific queries fail (“policy document 4042” needs exact match)

Sparse retrieval (BM25 / keyword search):

Traditional TF-IDF style. Each term gets a weight based on frequency.

Query: "return policy"
"return" → weight 0.45
"policy" → weight 0.55
Document: "we accept returns within 30 days"
"returns" → weight 0.3, "within" → weight 0.1, "30" → weight 0.15, "days" → weight 0.1
Score: 0.27 (decent — matches on "return/returns")

Pros: Excellent for exact terms, IDs, proper nouns Cons: No semantic understanding — “how to get my money back” won’t match “return policy”

Hybrid retrieval:

Combine both scores and merge results. The standard technique is Reciprocal Rank Fusion (RRF):

def reciprocal_rank_fusion(dense_results, sparse_results, k=60):
scores = {}
for rank, doc in enumerate(dense_results + sparse_results):
doc_id = doc.id
if doc_id not in scores:
scores[doc_id] = 0
scores[doc_id] += 1 / (rank + k)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)

RRF formula: score = sum(1 / (rank + k)) for each document across both rankings. k=60 is the standard smoothing constant.

Hybrid is almost always better than either alone. Expect 10-20% improvement in retrieval recall.

Reranking

After initial retrieval (top 10-100), a reranker re-evaluates candidates with a more expensive but more accurate model.

Why rerank?

  • Vector DB search is fast but imperfect
  • A reranker can take into account more nuanced signals
  • Reranking is applied to a small set (top N), so it adds minimal latency

Cross-encoder reranking:

The reranker takes a (query, document) pair and outputs a relevance score:

Input: ("What is the return policy?", "we accept returns within 30 days...")
Cross-encoder transformer (BERT-style, processes both together)
Score: 0.95 (highly relevant)

Cross-encoders are too slow to run on the full corpus (they process each pair fully) but fast enough for 10-100 candidates.

Reranker comparison:

ModelSpeed (docs/sec)QualityCost
Cohere Rerank v3~100Excellent$1/1K docs
BGE-Reranker-v2~50Very goodFree (self-host)
Cross-encoder/ms-marco~200GoodFree (self-host)
LLM-as-judge~5Best$0.01/query

When to use which:

  • Start without reranking — vector search alone is often sufficient for simple Q&A
  • Add cross-encoder reranking when you need higher accuracy (production systems)
  • Add LLM reranking only for the hardest cases (multi-document, multi-hop)

Late Interaction Models (ColBERT)

ColBERT introduces a middle ground between dense retrieval and cross-encoder reranking. It uses late interaction — query and document are encoded separately, then compared token-by-token.

Query: "return policy" → [q1, q2] (query token vectors)
Document: "we accept returns within..." → [d1, d2, d3, ..., dn] (doc token vectors)
Match: For each query token q_i, find max similarity with any document token d_j
q1("return") matches d3("returns") → 0.9
q2("policy") matches d1("we") → 0.3 (no match)
Score: average of max similarities = (0.9 + 0.3) / 2 = 0.6

Pros:

  • More accurate than standard dense retrieval (token-level matching)
  • Can be pre-computed (document embeddings are static)
  • Efficient at query time (only compare query tokens to pre-computed doc embeddings)

Cons:

  • More storage (store per-token embeddings, not a single vector)
  • Slower than standard dense retrieval (more comparisons)
  • Fewer deployment options (main implementation is ColBERTv2)

Best for: High-accuracy retrieval where standard vector search isn’t enough but full cross-encoder reranking is too expensive.

Advanced Retrieval Patterns

Query rewriting: Transform the user’s raw query into a better search query before retrieval.

User question: "How do I cancel?"
↕ (LLM rewrites)
Search query: "cancellation policy subscription termination refund"

Multi-vector retrieval: Generate multiple queries for a single user question.

User question: "Compare our products"
↕ (LLM generates variations)
Queries:
1. "product A features pricing"
2. "product B features pricing"
3. "A vs B comparison"

Each query is searched independently. Results are merged and deduplicated.

HyDE (Hypothetical Document Embeddings):

Generate a hypothetical answer first, then use that to search:

User question: "What's the return policy for electronics?"
LLM generates hypothetical answer: "Electronics can be returned within 30 days if unopened..."
Embed the hypothetical answer (not the question)
Search with this embedding (more likely to match relevant documents)

HyDE works because the hypothetical answer is semantically closer to the actual relevant documents than the original question.

Step-back prompting for retrieval:

Retrieve at a higher level of abstraction first, then narrow down.

User question: "Can I return a laptop after 2 weeks?"
↓ Step back
Concept question: "Electronics return policy"
↓ Retrieve
Retrieved: "Electronics: 30-day return window, must include all accessories"
↓ Narrow
Specific answer: "Yes, a laptop can be returned within 2 weeks."

Production Considerations

1. Chunking Size

Too small (100 tokens):

  • Pro: Precise retrieval
  • Con: LLM loses context

Too large (2000 tokens):

  • Pro: Full context
  • Con: Retrieves irrelevant stuff

Goldilocks (512-1024 tokens): Usually best

2. Overlap (if using)

No overlap: Fast search, but boundaries lose context
50-token overlap: Extra storage, better results

3. Reranking

After retrieving top-10 from vector DB, rerank with:

  • Cross-encoder: Slow but accurate
  • Query likelihood: Fast, decent
  • LLM-based: Expensive but smart

4. Context Window

Always leave room for the question + response:

max_context_size = model_context_window - buffer
# buffer = 1000 tokens (for question + answer)
retrieved_chunks = retrieve_up_to(max_context_size - buffer)

5. Error Handling

What if no documents match?

  • Return “No information found”
  • Fall back to general LLM knowledge
  • Ask user for clarification

What if too many documents match?

  • Take top-K (usually 3-5)
  • Re-rank and keep best
  • Use filtering if available

Common Mistakes

No overlap between chunks → Context lost at boundaries
Use 50-token overlap

Chunks too large (>1500 tokens) → Includes irrelevant content
Use 512-1024 tokens

Only using vector search → Fails on keywords
Use hybrid (vector + BM25)

Not re-ranking results → Suboptimal retrieval
Re-rank top-10 to top-3

Stale embeddings → Miss new documents
Re-embed regularly or use live embeddings


Implementation Checklist

  • Collect your documents
  • Parse documents (PDFs, web, etc.)
  • Split into chunks (512-1024 tokens, 50-token overlap)
  • Generate embeddings (use text-embedding-3-small)
  • Store in vector DB (start with Chroma for prototyping)
  • Build retrieval function (dense + BM25 if possible)
  • Add re-ranking (cross-encoder or LLM)
  • Test on known questions
  • Monitor retrieval quality
  • Iterate on chunk size / retrieval strategy

Example: From Zero to RAG

# 1. Load documents
documents = load_pdfs("./docs/")
# 2. Split into chunks
chunks = split_into_chunks(documents, chunk_size=512, overlap=50)
# 3. Embed
embeddings = [embed_model.embed(chunk) for chunk in chunks]
# 4. Store in vector DB
vector_db = Chroma()
for chunk, embedding in zip(chunks, embeddings):
vector_db.add(text=chunk, embedding=embedding)
# 5. Build retrieval
def retrieve(query, top_k=5):
query_vector = embed_model.embed(query)
results = vector_db.search(query_vector, top_k=top_k)
return results
# 6. Build RAG chain
def answer_question(question):
context = retrieve(question)
prompt = f"Context:\n{context}\n\nQuestion: {question}"
answer = llm.generate(prompt)
return answer
# 7. Use it
print(answer_question("What's our return policy?"))

Measuring RAG Quality

Retrieval Metrics:

  • Precision: % of retrieved docs relevant
  • Recall: % of relevant docs retrieved
  • MRR (Mean Reciprocal Rank): How high is first relevant doc?

Answer Metrics:

  • Relevance: Does answer address the question?
  • Accuracy: Is answer correct?
  • Groundedness: Is answer based on provided context?

User Metrics:

  • Helpfulness: Did user get what they needed?
  • Satisfaction: Would they use this again?
  • Time to resolution: How quickly did they get answer?

When RAG Isn’t Enough

  • You need reasoning: Add agents/chains (tool use for follow-ups)
  • You need multi-hop questions: Iterative RAG or graph-based retrieval
  • You need structured data: Add SQL / structured query capability
  • You need real-time data: Stream updates or use live APIs

See Also: