LLMs have two problems: their knowledge is frozen at training time, and they hallucinate. RAG fixes both by retrieving relevant documents at query time and feeding them into the prompt, so the model answers from evidence rather than memory.
6.1 The core idea
codeQuestion ──► [Retriever] ──► top-k relevant chunks ──┐ ▼ [Prompt: context + question] ──► LLM ──► grounded answer
Instead of asking the LLM to recall, you ask it to read and answer. The LLM becomes a reasoning engine over freshly retrieved, authoritative text.
Why it works: retrieval injects up-to-date, domain-specific, or private knowledge the model never saw; citing sources curbs hallucination; you can update the knowledge base without retraining.
6.2 Embeddings — turning text into geometry
A text embedding is a vector (e.g. ) such that semantically similar texts have nearby vectors. Produced by an encoder model (from [[05_architectures]]) — e.g. sentence-transformers, OpenAI text-embedding-3, BGE, E5.
How a sentence embedding is formed from a Transformer encoder:
- Mean pooling: average all token vectors of the last layer.
- or
[CLS]pooling: take the[CLS]token vector. - Models are fine-tuned with contrastive learning so paraphrases land close and unrelated texts land far apart (InfoNCE loss):
where is a true match and the denominator includes negatives.
6.3 Similarity & vector search math
Given a query embedding and document embeddings , find the closest.
Cosine similarity (most common)
Measures angle, ignores magnitude.
Worked example
, , .
- (identical direction → most similar).
- (orthogonal → unrelated). So is retrieved.
Approximate Nearest Neighbor (ANN)
Exact search is per query — too slow for millions of vectors. ANN indexes trade a little accuracy for huge speed:
- HNSW (Hierarchical Navigable Small World): a multi-layer proximity graph; greedy-walk to nearest neighbors in . Used by most vector DBs.
- IVF (Inverted File): cluster vectors (k-means) into cells; search only the nearest few cells.
- PQ (Product Quantization): compress vectors into codes for memory savings. Vector databases: FAISS, Chroma, Pinecone, Weaviate, Qdrant, Milvus, pgvector.
6.4 The full RAG pipeline (indexing + querying)
Phase A — Indexing (offline, once)
- Load documents (PDF, HTML, DB, etc.).
- Chunk into passages (see §6.5).
- Embed each chunk with the encoder model.
- Store vectors + original text + metadata in a vector DB.
Phase B — Querying (online, per question)
- Embed the query with the same model.
- Retrieve top-k nearest chunks (ANN search).
- (optional) Rerank candidates (§6.6).
- Assemble prompt: system instructions + retrieved context + question.
- Generate with the LLM, instructed to answer only from context and cite sources.
codeDOCS → split → embed → ┌──────────────┐ │ Vector DB │ QUERY → embed ───────► │ (ANN search)│ → top-k chunks → LLM prompt → answer └──────────────┘
6.5 Chunking — the most underrated step
Bad chunking ruins RAG. Considerations:
- Size: too big → diluted relevance & wasted context tokens; too small → fragments lose meaning. Typical 200–500 tokens.
- Overlap: 10–20% overlap between consecutive chunks so ideas spanning a boundary aren't split.
- Semantic/structure-aware splitting: split on headings, paragraphs, sentences, or code blocks rather than blind character counts.
- Metadata: attach source, title, page, section, timestamp → enables filtering and citations.
Example (recursive character splitting):
pythonfrom langchain_text_splitters import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=500, chunk_overlap=80, separators=["\n\n", "\n", ". ", " ", ""]) # try paragraph, then line, then sentence... chunks = splitter.split_text(document_text)
6.6 Improving retrieval quality
- Hybrid search: combine dense (embedding) search with sparse keyword search (BM25) — dense captures meaning, sparse captures exact terms (names, codes, acronyms). Fuse with Reciprocal Rank Fusion:
- Reranking: retrieve ~50 candidates with the cheap bi-encoder, then rescore with a cross-encoder (the query+doc pair fed jointly through a Transformer → far more accurate relevance), keep top 5. Cross-encoders are slow → only used on the shortlist.
- Query transformation:
- Multi-query: have the LLM rephrase the question several ways, retrieve for each, union results.
- HyDE (Hypothetical Document Embeddings): ask the LLM to draft a fake answer, embed that, and search — often closer to real docs than the bare question.
- Step-back prompting: ask a more general question first.
- Contextual / parent-document retrieval: embed small chunks for precision but return their larger parent chunk for context.
- Metadata filtering: restrict search (e.g.
date > 2024,dept = finance) before/with vector search.
6.7 The generation prompt (grounding & citations)
A good RAG prompt is explicit about using only the context:
codeYou are a helpful assistant. Answer the QUESTION using ONLY the CONTEXT below. If the answer is not in the context, say "I don't have enough information." Cite sources as [1], [2] matching the context blocks. CONTEXT: [1] {chunk_1} (source: {meta_1}) [2] {chunk_2} (source: {meta_2}) QUESTION: {question} ANSWER:
This reduces hallucination and produces traceable answers.
6.8 End-to-end RAG code (LangChain-style, minimal)
pythonfrom langchain_community.vectorstores import FAISS from langchain_openai import OpenAIEmbeddings, ChatOpenAI from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_core.prompts import ChatPromptTemplate from langchain_core.runnables import RunnablePassthrough from langchain_core.output_parsers import StrOutputParser # ---- Phase A: index ---- splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=80) docs = splitter.create_documents([raw_text]) vectordb = FAISS.from_documents(docs, OpenAIEmbeddings(model="text-embedding-3-small")) retriever = vectordb.as_retriever(search_kwargs={"k": 4}) # ---- Phase B: query ---- prompt = ChatPromptTemplate.from_template( "Answer using ONLY the context. If unknown, say so.\n\n" "Context:\n{context}\n\nQuestion: {question}") llm = ChatOpenAI(model="gpt-4o-mini", temperature=0) def format_docs(docs): return "\n\n".join(d.page_content for d in docs) rag_chain = ( {"context": retriever | format_docs, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser() ) print(rag_chain.invoke("What is the refund policy?"))
(The | pipe operator is LCEL — explained in [[07_langchain]].)
6.9 Pure-Python retrieval (no framework, to see the mechanics)
pythonimport numpy as np def embed(texts): ... # call your embedding model → (N, d) array corpus = [...] # list of chunk strings E = embed(corpus); E /= np.linalg.norm(E, axis=1, keepdims=True) # normalize def retrieve(query, k=4): q = embed([query])[0]; q /= np.linalg.norm(q) sims = E @ q # cosine sims (since normalized) (N,) top = sims.argsort()[::-1][:k] # indices of k highest return [(corpus[i], float(sims[i])) for i in top]
6.10 Evaluation & pitfalls
Evaluate retrieval and generation separately:
- Retrieval: recall@k, precision@k, MRR (Mean Reciprocal Rank), NDCG.
- Generation: faithfulness (is every claim supported by context?), answer relevance, context precision/recall. Tools: RAGAS, TruLens, LLM-as-judge.
Common failure modes:
- Embedding/query model mismatch — query and docs must use the same embedding model.
- Chunk too big/small — tune it; it's the #1 quality lever.
- "Lost in the middle" — LLMs attend less to the middle of long contexts; put the most relevant chunks first/last.
- No relevant docs retrieved → garbage in, garbage out. Add hybrid search + reranking.
- Stale index — re-embed when documents or the embedding model change.
- Context overflow — too many/long chunks exceed the context window; cap k and chunk size.
RAG is the simplest "agentic" capability: the model uses a tool (search) to augment itself. Generalizing tool use → [[09_agentic_ai]]. The orchestration frameworks come first: [[07_langchain]] and [[08_langgraph]].
Next: [[07_langchain]].