← back to blog
$ cat ./blog/rag-from-scratch.md  ·  March 2026  ·  ~15 min read

Building RAG From Scratch
— A Complete Technical Deep-Dive


Large language models are trained on static snapshots of the internet. They cannot know about documents you wrote last week, the internal policy PDF sitting on your server, or a research paper published after their training cutoff. They also hallucinate — they fill knowledge gaps with plausible-sounding but fabricated text.

Retrieval-Augmented Generation (RAG) solves both problems. Instead of asking the model to recall facts from its weights, you store your documents in a searchable index, retrieve the most relevant passages when a question arrives, stuff those passages into the prompt as context, and ask the model to answer only from that context. The model becomes a reading comprehension engine over a knowledge base you control.

I wanted to understand every layer of this — not by calling langchain.RetrievalQA(), but by deriving each algorithm from scratch and implementing it by hand. No frameworks, no abstractions. Nine components, each in its own file, each with its own math.

ingestion chunker embeddings vectorstore query embeddings retriever (hybrid) reranker generator bm25

// what is RAG and why?

The pipeline has two phases. Indexing happens once: load documents, split into chunks, embed into vectors, store in the vector database. Querying happens on every user request: embed the query, retrieve candidates from both dense (vector) and sparse (BM25) indexes, fuse rankings, rerank with a cross-encoder, and generate an answer grounded in the retrieved context.

Every component is a separate Python file. Every file maps to one section in this writeup.

// recursive text chunking

Embedding models have a fixed maximum input length. all-MiniLM-L6-v2 handles at most 256 tokens. You cannot embed a 50-page PDF as a single vector — even if it fit, the resulting vector would be a diffuse average of everything, matching nothing well.

Chunking splits documents into passages that fit within the model's window, are semantically coherent, and overlap slightly so information at boundaries isn't lost.

Why "recursive"?

A naive approach splits on \n\n. But what if a paragraph is still 2,000 characters? You split on \n. Still too long? Split on ". ". Still? On " ". Still? Hard-slice by character. This hierarchy of separators is the "recursive" part — try the most semantically meaningful separator first, then fall back progressively.

The overlap window

After splitting, short pieces are merged back into full-sized chunks using a sliding deque. When the window fills, emit the current chunk, then pop from the left until only chunk_overlap characters remain. That retained tail becomes the start of the next chunk, preventing a question about a concept that straddles two chunks from missing context entirely.

Position reconstruction

Each chunk needs start_char and end_char positions in the original text for source attribution. A forward cursor tracks position: text.find(chunk, cursor) always starts from roughly the right place, avoiding backtracking.

// sentence embeddings — mean pooling

A sentence embedding maps a variable-length string to a fixed-length vector in ℝd such that semantically similar sentences are geometrically close. all-MiniLM-L6-v2 outputs 384-dimensional vectors.

Why not the [CLS] token?

BERT-family models prepend a [CLS] token whose final hidden state is used for classification. For sentence similarity, mean pooling — averaging all token embeddings weighted by the attention mask — produces significantly better representations. This is what sentence-transformers were fine-tuned to use.

The math

The model outputs last_hidden_state of shape (B, T, D). Not all T tokens are real — padding fills shorter sequences. The attention mask is a (B, T) binary tensor.

mask_expanded = attention_mask.unsqueeze(-1).expand_as(token_embeddings)
# shape: (B, T, D)

sum_embeddings = torch.sum(token_embeddings * mask_expanded, dim=1)
# shape: (B, D) — sum of real token embeddings only

sum_mask = torch.clamp(mask_expanded.sum(dim=1), min=1e-9)
# shape: (B, D) — count of real tokens

pooled = sum_embeddings / sum_mask
# shape: (B, D) — mean of real token embeddings
The .unsqueeze(-1) adds a trailing dimension so the mask broadcasts across D. Forgetting .expand_as() is the most common implementation bug — you get a shape mismatch or incorrect broadcast silently.

L2 normalization

After pooling, normalize each vector to unit length with np.clip(norms, 1e-12, None) to prevent division by zero. This converts cosine similarity into a simple dot product.

// vector store — cosine similarity as dot product

For two vectors a and b:

cosine_similarity(a, b) = (a · b) / (‖a‖ · ‖b‖)

If all stored vectors are L2-normalized (‖v‖ = 1), then cosine similarity = dot product. A dot product against the entire matrix is a single BLAS call:

scores = self._embeddings @ query_embedding   # (N, D) @ (D,) → (N,)

This is O(N × D) and runs in microseconds for tens of thousands of chunks.

Top-k without full sort

np.argpartition is a quickselect variant — it finds the k largest values in O(N) without sorting everything. We then sort only those k candidates: O(k log k) instead of O(N log N).

// BM25 — sparse retrieval from first principles

Classic TF-IDF has two problems: term frequency is unbounded (mentioning "cat" 100 times scores 10× higher than 10 times, but both clearly discuss cats), and longer documents naturally have higher TF even for the same density.

BM25 fixes both

TF saturation (k1 = 1.5): As tf → ∞, the score approaches (k1 + 1). Mentioning a term 100 times is only slightly better than 10 times.

Length normalization (b = 0.75): A document twice the average length is penalized but not eliminated. The formula divides by 1 - b + b × |d|/avgdl.

IDF — Robertson-Walker formula

IDF(t) = log( (N - df(t) + 0.5) / (df(t) + 0.5)  +  1 )
The outer +1 is critical. Without it, if df(t) = N (term in every document), log(0) = -∞. With it: log(0 + 1) = 0 — universal terms contribute nothing, not negative infinity.

Full BM25 formula

score(d, q) = Σt ∈ q IDF(t) × tf(t,d) × (k1 + 1)
              ÷ (tf(t,d) + k1 × (1 - b + b × |d| / avgdl))

This implementation uses text.lower().translate(...punctuation...).split() for tokenization — no NLTK, no spaCy, no stopword lists. BM25's IDF naturally down-weights common words since they appear in nearly every document.

// hybrid retrieval — reciprocal rank fusion

Dense retrieval excels at semantic matching ("furry household companion" → "The cat sat on the mat"). Sparse retrieval excels at exact keywords ("PyTorch 2.1 changelog"). Neither dominates. A hybrid approach gets the best of both.

Why not average scores?

Dense scores are cosine similarities in [-1, 1]. BM25 scores are unbounded positive floats. These distributions are completely different — combining them requires calibration that varies by corpus.

RRF operates on rankings, not scores

RRF_score(d) = Σlist L 1 / (k + rankL(d))

Ranks are always integers starting at 1. No calibration needed. With k = 60 (empirically determined by Cormack et al., 2009), the difference between rank 1 and rank 10 is only ~15%, making RRF robust to noise in individual rankers.

Before fusing, we over-fetch 3× top_k from each retriever (minimum 20) to give RRF enough candidates to find the true best results.

// cross-encoder reranking — joint attention

A bi-encoder embeds query and document independently, then computes similarity in vector space. A cross-encoder concatenates them as [CLS] query [SEP] document [SEP] and runs them through the transformer together. Every query token attends to every document token.

Bi-encoder Cross-encoder
Accuracy Moderate High
Precompute docs Yes No
Query latency O(1) once indexed O(candidates)
Use case First-stage retrieval Re-ranking

We use raw logits from the MS-MARCO model, not softmax. Softmax would normalize across the batch, making one document's score dependent on which others happen to be present — wrong for ranking where we want absolute scores.

// generation — prompt engineering & source attribution

The system prompt establishes a binding contract: answer only from context, cite with [Source N] notation, and refuse gracefully when context is insufficient. This constrains the model, allows graceful refusal, and enables source attribution in the UI.

Source attribution metadata

Each chunk carries its origin through the entire pipeline: metadata["source"] is set during ingestion, preserved through chunking and storage, and read by the generator to build citations. The user sees which file and chunk each answer draws from.

Raw HTTP, no SDK

A single requests.post to /chat/completions works identically with OpenAI, Together AI, Groq, Ollama, or any OpenAI-compatible endpoint. The URL and key change; the code does not. Temperature is set to 0.2 — low enough for determinism, high enough to avoid repetition.

// evaluation — measuring what matters

RAG quality has two distinct components. Retrieval quality: are the right chunks being found? Measured without an LLM. Generation quality: does the answer faithfully reflect the context? Requires language understanding to measure.

Metric Formula What it measures
Precision@k |retrieved ∩ relevant| / k Fraction of retrieved chunks that are relevant
Recall@k |retrieved ∩ relevant| / |relevant| Fraction of all relevant chunks that were retrieved
MRR 1 / rank(first relevant) How early the first relevant chunk appears
Faithfulness LLM-as-judge (0.0 – 1.0) Whether the answer is grounded in context

MRR matters most for RAG: if the best chunk is at rank 5 but we only pass rank 1 to the LLM, the question goes unanswered. Faithfulness uses an LLM-as-judge approach — expensive but necessary since no formula can measure semantic grounding.

// implementation decisions

Decision Chose Over Why
PDF parsing PyPDF2 pdfminer.six Pure Python, zero compiled deps, simpler environment
Vector search NumPy exact FAISS (ANN) <10ms for 100K vectors, no ANN errors, trivially inspectable
Tokenization Whitespace NLTK/spaCy BM25's IDF handles stopwords; simpler is more transparent
Embeddings Raw transformers sentence-transformers lib Shows what the library does internally; generalizes to any model
LLM calls requests.post OpenAI SDK Works with any compatible endpoint; zero vendor lock-in

Every decision optimizes for transparency over convenience. The goal isn't the shortest code — it's the clearest mapping from math to implementation.


Repository ↗  ·  Full math derivations (LEARNING.md) ↗  ·  ← more writing