Skip to content

RAG Architecture

A minimal RAG system has four moving parts: ingest, index, retrieve, and generate.

flowchart LR
    A[Source docs<br/>PDFs, web, notes] --> B[Chunk + clean]
    B --> C[Embed]
    C --> D[(Vector DB)]
    U[User query] --> E[Embed query]
    E --> D
    D --> F[Top-k chunks]
    F --> G[Prompt builder]
    U --> G
    G --> H[LLM]
    H --> I[Answer]

    style D fill:#1e3a8a,stroke:#60a5fa,color:#fff
    style H fill:#166534,stroke:#4ade80,color:#fff

What each stage is really doing

Chunk + clean — the unglamorous 80% of the work. Bad chunks = bad answers.
Embed — turns text into vectors. Pick a model whose dimensions fit your DB.
Retrieve — top-k by cosine similarity, optionally re-ranked.
Prompt builder — packs chunks into the context with a system prompt.
LLM — generates the final answer. Cite chunks back to the user.

Common failure modes

flowchart TB
    X[Bad answer] --> A1[Chunks too big?<br/>Retrieval diluted]
    X --> A2[Chunks too small?<br/>Lost context]
    X --> A3[Wrong embedding model?<br/>Semantic mismatch]
    X --> A4[No re-ranker?<br/>Top-k is noisy]
    X --> A5[System prompt leaks?<br/>Model ignores retrieved text]

Next level

Add a re-ranker (e.g. Cohere Rerank, bge-reranker) between retrieval and prompt building.
Add query rewriting — use a small model to expand the user’s query before retrieval.
Evaluate with Ragas so you can improve it with numbers, not vibes.