back to knowledge base
module 066 min read

Retrieval-Augmented Generation

Embeddings, vector search math, chunking, retrievers, rerankers, and an end-to-end RAG pipeline.

LLMs have two problems: their knowledge is frozen at training time, and they hallucinate. RAG fixes both by retrieving relevant documents at query time and feeding them into the prompt, so the model answers from evidence rather than memory.


6.1 The core idea

code
Question ──► [Retriever] ──► top-k relevant chunks ──┐
                          [Prompt: context + question] ──► LLM ──► grounded answer

Instead of asking the LLM to recall, you ask it to read and answer. The LLM becomes a reasoning engine over freshly retrieved, authoritative text.

Why it works: retrieval injects up-to-date, domain-specific, or private knowledge the model never saw; citing sources curbs hallucination; you can update the knowledge base without retraining.


6.2 Embeddings — turning text into geometry

A text embedding is a vector eRd\mathbf{e}\in\mathbb{R}^{d} (e.g. d=384,768,1536d=384,768,1536) such that semantically similar texts have nearby vectors. Produced by an encoder model (from [[05_architectures]]) — e.g. sentence-transformers, OpenAI text-embedding-3, BGE, E5.

How a sentence embedding is formed from a Transformer encoder:

  • Mean pooling: average all token vectors of the last layer.
e=1ni=1nhi\mathbf{e} = \frac{1}{n}\sum_{i=1}^{n}\mathbf{h}_i
  • or [CLS] pooling: take the [CLS] token vector.
  • Models are fine-tuned with contrastive learning so paraphrases land close and unrelated texts land far apart (InfoNCE loss):
L=logexp(sim(q,p+)/τ)jexp(sim(q,pj)/τ)L = -\log \frac{\exp(\text{sim}(\mathbf{q},\mathbf{p}^+)/\tau)}{\sum_{j}\exp(\text{sim}(\mathbf{q},\mathbf{p}_j)/\tau)}

where p+\mathbf{p}^+ is a true match and the denominator includes negatives.


6.3 Similarity & vector search math

Given a query embedding q\mathbf{q} and document embeddings {di}\{\mathbf{d}_i\}, find the closest.

Cosine similarity (most common)

cos(q,d)=qdqd[1,1]\text{cos}(\mathbf{q},\mathbf{d}) = \frac{\mathbf{q}\cdot\mathbf{d}}{\|\mathbf{q}\|\,\|\mathbf{d}\|} \in [-1, 1]

Measures angle, ignores magnitude.

Worked example

q=[1,1]\mathbf{q}=[1,1], d1=[2,2]\mathbf{d}_1=[2,2], d2=[1,1]\mathbf{d}_2=[1,-1].

  • cos(q,d1)=12+1228=44=1.0\text{cos}(\mathbf{q},\mathbf{d}_1)=\frac{1\cdot2+1\cdot2}{\sqrt2\cdot\sqrt8}=\frac{4}{4}=1.0 (identical direction → most similar).
  • cos(q,d2)=1122=0\text{cos}(\mathbf{q},\mathbf{d}_2)=\frac{1-1}{\sqrt2\cdot\sqrt2}=0 (orthogonal → unrelated). So d1\mathbf{d}_1 is retrieved.

Approximate Nearest Neighbor (ANN)

Exact search is O(Nd)O(N\cdot d) per query — too slow for millions of vectors. ANN indexes trade a little accuracy for huge speed:

  • HNSW (Hierarchical Navigable Small World): a multi-layer proximity graph; greedy-walk to nearest neighbors in O(logN)O(\log N). Used by most vector DBs.
  • IVF (Inverted File): cluster vectors (k-means) into cells; search only the nearest few cells.
  • PQ (Product Quantization): compress vectors into codes for memory savings. Vector databases: FAISS, Chroma, Pinecone, Weaviate, Qdrant, Milvus, pgvector.

6.4 The full RAG pipeline (indexing + querying)

Phase A — Indexing (offline, once)

  1. Load documents (PDF, HTML, DB, etc.).
  2. Chunk into passages (see §6.5).
  3. Embed each chunk with the encoder model.
  4. Store vectors + original text + metadata in a vector DB.

Phase B — Querying (online, per question)

  1. Embed the query with the same model.
  2. Retrieve top-k nearest chunks (ANN search).
  3. (optional) Rerank candidates (§6.6).
  4. Assemble prompt: system instructions + retrieved context + question.
  5. Generate with the LLM, instructed to answer only from context and cite sources.
code
DOCS → split → embed → ┌──────────────┐
                       │  Vector DB   │
QUERY → embed ───────► │  (ANN search)│ → top-k chunks → LLM prompt → answer
                       └──────────────┘

6.5 Chunking — the most underrated step

Bad chunking ruins RAG. Considerations:

  • Size: too big → diluted relevance & wasted context tokens; too small → fragments lose meaning. Typical 200–500 tokens.
  • Overlap: 10–20% overlap between consecutive chunks so ideas spanning a boundary aren't split.
  • Semantic/structure-aware splitting: split on headings, paragraphs, sentences, or code blocks rather than blind character counts.
  • Metadata: attach source, title, page, section, timestamp → enables filtering and citations.

Example (recursive character splitting):

python
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500, chunk_overlap=80,
    separators=["\n\n", "\n", ". ", " ", ""])   # try paragraph, then line, then sentence...
chunks = splitter.split_text(document_text)

6.6 Improving retrieval quality

  • Hybrid search: combine dense (embedding) search with sparse keyword search (BM25) — dense captures meaning, sparse captures exact terms (names, codes, acronyms). Fuse with Reciprocal Rank Fusion:
RRF(d)=rretrievers1k+rankr(d)\text{RRF}(d) = \sum_{r\in\text{retrievers}} \frac{1}{k + \text{rank}_r(d)}
  • Reranking: retrieve ~50 candidates with the cheap bi-encoder, then rescore with a cross-encoder (the query+doc pair fed jointly through a Transformer → far more accurate relevance), keep top 5. Cross-encoders are slow → only used on the shortlist.
  • Query transformation:
    • Multi-query: have the LLM rephrase the question several ways, retrieve for each, union results.
    • HyDE (Hypothetical Document Embeddings): ask the LLM to draft a fake answer, embed that, and search — often closer to real docs than the bare question.
    • Step-back prompting: ask a more general question first.
  • Contextual / parent-document retrieval: embed small chunks for precision but return their larger parent chunk for context.
  • Metadata filtering: restrict search (e.g. date > 2024, dept = finance) before/with vector search.

6.7 The generation prompt (grounding & citations)

A good RAG prompt is explicit about using only the context:

code
You are a helpful assistant. Answer the QUESTION using ONLY the CONTEXT below.
If the answer is not in the context, say "I don't have enough information."
Cite sources as [1], [2] matching the context blocks.

CONTEXT:
[1] {chunk_1}  (source: {meta_1})
[2] {chunk_2}  (source: {meta_2})

QUESTION: {question}
ANSWER:

This reduces hallucination and produces traceable answers.


6.8 End-to-end RAG code (LangChain-style, minimal)

python
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# ---- Phase A: index ----
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=80)
docs = splitter.create_documents([raw_text])
vectordb = FAISS.from_documents(docs, OpenAIEmbeddings(model="text-embedding-3-small"))
retriever = vectordb.as_retriever(search_kwargs={"k": 4})

# ---- Phase B: query ----
prompt = ChatPromptTemplate.from_template(
    "Answer using ONLY the context. If unknown, say so.\n\n"
    "Context:\n{context}\n\nQuestion: {question}")
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

def format_docs(docs): return "\n\n".join(d.page_content for d in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt | llm | StrOutputParser()
)
print(rag_chain.invoke("What is the refund policy?"))

(The | pipe operator is LCEL — explained in [[07_langchain]].)


6.9 Pure-Python retrieval (no framework, to see the mechanics)

python
import numpy as np
def embed(texts): ...                      # call your embedding model → (N, d) array

corpus = [...]                              # list of chunk strings
E = embed(corpus); E /= np.linalg.norm(E, axis=1, keepdims=True)   # normalize

def retrieve(query, k=4):
    q = embed([query])[0]; q /= np.linalg.norm(q)
    sims = E @ q                            # cosine sims (since normalized) (N,)
    top = sims.argsort()[::-1][:k]          # indices of k highest
    return [(corpus[i], float(sims[i])) for i in top]

6.10 Evaluation & pitfalls

Evaluate retrieval and generation separately:

  • Retrieval: recall@k, precision@k, MRR (Mean Reciprocal Rank), NDCG.
  • Generation: faithfulness (is every claim supported by context?), answer relevance, context precision/recall. Tools: RAGAS, TruLens, LLM-as-judge.

Common failure modes:

  • Embedding/query model mismatch — query and docs must use the same embedding model.
  • Chunk too big/small — tune it; it's the #1 quality lever.
  • "Lost in the middle" — LLMs attend less to the middle of long contexts; put the most relevant chunks first/last.
  • No relevant docs retrieved → garbage in, garbage out. Add hybrid search + reranking.
  • Stale index — re-embed when documents or the embedding model change.
  • Context overflow — too many/long chunks exceed the context window; cap k and chunk size.

RAG is the simplest "agentic" capability: the model uses a tool (search) to augment itself. Generalizing tool use → [[09_agentic_ai]]. The orchestration frameworks come first: [[07_langchain]] and [[08_langgraph]].

Next: [[07_langchain]].