Large language models are trained on static snapshots of the internet. They cannot know about documents you wrote last week, the internal policy PDF sitting on your server, or a research paper published after their training cutoff. They also hallucinate — they fill knowledge gaps with plausible-sounding but fabricated text.
Retrieval-Augmented Generation (RAG) solves both problems. Instead of asking the model to recall facts from its weights, you store your documents in a searchable index, retrieve the most relevant passages when a question arrives, stuff those passages into the prompt as context, and ask the model to answer only from that context. The model becomes a reading comprehension engine over a knowledge base you control.
I wanted to understand every layer of this — not by calling langchain.RetrievalQA(), but by
deriving each algorithm from scratch and implementing it by hand. No frameworks, no abstractions. Nine components,
each in its own file, each with its own math.
// what is RAG and why?
The pipeline has two phases. Indexing happens once: load documents, split into chunks, embed into vectors, store in the vector database. Querying happens on every user request: embed the query, retrieve candidates from both dense (vector) and sparse (BM25) indexes, fuse rankings, rerank with a cross-encoder, and generate an answer grounded in the retrieved context.
Every component is a separate Python file. Every file maps to one section in this writeup.
// recursive text chunking
Embedding models have a fixed maximum input length. all-MiniLM-L6-v2 handles at most 256 tokens.
You cannot embed a 50-page PDF as a single vector — even if it fit, the resulting vector would be a diffuse
average of everything, matching nothing well.
Chunking splits documents into passages that fit within the model's window, are semantically coherent, and overlap slightly so information at boundaries isn't lost.
Why "recursive"?
A naive approach splits on \n\n. But what if a paragraph is still 2,000 characters? You split on
\n. Still too long? Split on ". ". Still? On " ". Still? Hard-slice by
character. This hierarchy of separators is the "recursive" part — try the most semantically meaningful
separator first, then fall back progressively.
The overlap window
After splitting, short pieces are merged back into full-sized chunks using a sliding deque. When the window fills,
emit the current chunk, then pop from the left until only chunk_overlap characters remain. That
retained tail becomes the start of the next chunk, preventing a question about a concept that straddles two
chunks from missing context entirely.
Position reconstruction
Each chunk needs start_char and end_char positions in the original text for source
attribution. A forward cursor tracks position: text.find(chunk, cursor) always starts from roughly
the right place, avoiding backtracking.
// sentence embeddings — mean pooling
A sentence embedding maps a variable-length string to a fixed-length vector in ℝd such that
semantically similar sentences are geometrically close. all-MiniLM-L6-v2 outputs 384-dimensional
vectors.
Why not the [CLS] token?
BERT-family models prepend a [CLS] token whose final hidden state is used for classification.
For sentence similarity, mean pooling — averaging all token embeddings weighted by the
attention mask — produces significantly better representations. This is what sentence-transformers were
fine-tuned to use.
The math
The model outputs last_hidden_state of shape (B, T, D). Not all T tokens are real —
padding fills shorter sequences. The attention mask is a (B, T) binary tensor.
mask_expanded = attention_mask.unsqueeze(-1).expand_as(token_embeddings)
# shape: (B, T, D)
sum_embeddings = torch.sum(token_embeddings * mask_expanded, dim=1)
# shape: (B, D) — sum of real token embeddings only
sum_mask = torch.clamp(mask_expanded.sum(dim=1), min=1e-9)
# shape: (B, D) — count of real tokens
pooled = sum_embeddings / sum_mask
# shape: (B, D) — mean of real token embeddings
.unsqueeze(-1) adds a trailing dimension so the mask broadcasts across D.
Forgetting .expand_as() is the most common implementation bug — you get a shape mismatch
or incorrect broadcast silently.
L2 normalization
After pooling, normalize each vector to unit length with np.clip(norms, 1e-12, None) to prevent
division by zero. This converts cosine similarity into a simple dot product.
// vector store — cosine similarity as dot product
For two vectors a and b:
cosine_similarity(a, b) = (a · b) / (‖a‖ · ‖b‖)
If all stored vectors are L2-normalized (‖v‖ = 1), then cosine similarity = dot product. A dot product against the entire matrix is a single BLAS call:
scores = self._embeddings @ query_embedding # (N, D) @ (D,) → (N,)
This is O(N × D) and runs in microseconds for tens of thousands of chunks.
Top-k without full sort
np.argpartition is a quickselect variant — it finds the k largest values in O(N) without
sorting everything. We then sort only those k candidates: O(k log k) instead of O(N log N).
// BM25 — sparse retrieval from first principles
Classic TF-IDF has two problems: term frequency is unbounded (mentioning "cat" 100 times scores 10× higher than 10 times, but both clearly discuss cats), and longer documents naturally have higher TF even for the same density.
BM25 fixes both
TF saturation (k1 = 1.5): As tf → ∞, the score approaches (k1 + 1). Mentioning a term 100 times is only slightly better than 10 times.
Length normalization (b = 0.75): A document twice the average length is penalized but not
eliminated. The formula divides by 1 - b + b × |d|/avgdl.
IDF — Robertson-Walker formula
IDF(t) = log( (N - df(t) + 0.5) / (df(t) + 0.5) + 1 )
+1 is critical. Without it, if df(t) = N (term in every document),
log(0) = -∞. With it: log(0 + 1) = 0 — universal terms contribute nothing,
not negative infinity.
Full BM25 formula
score(d, q) = Σt ∈ q IDF(t) × tf(t,d) × (k1 + 1)
÷ (tf(t,d) + k1 × (1 - b + b × |d| / avgdl))
This implementation uses text.lower().translate(...punctuation...).split() for tokenization —
no NLTK, no spaCy, no stopword lists. BM25's IDF naturally down-weights common words since they appear in
nearly every document.
// hybrid retrieval — reciprocal rank fusion
Dense retrieval excels at semantic matching ("furry household companion" → "The cat sat on the mat"). Sparse retrieval excels at exact keywords ("PyTorch 2.1 changelog"). Neither dominates. A hybrid approach gets the best of both.
Why not average scores?
Dense scores are cosine similarities in [-1, 1]. BM25 scores are unbounded positive floats. These distributions are completely different — combining them requires calibration that varies by corpus.
RRF operates on rankings, not scores
RRF_score(d) = Σlist L 1 / (k + rankL(d))
Ranks are always integers starting at 1. No calibration needed. With k = 60 (empirically determined by Cormack et al., 2009), the difference between rank 1 and rank 10 is only ~15%, making RRF robust to noise in individual rankers.
Before fusing, we over-fetch 3× top_k from each retriever (minimum 20) to give RRF enough candidates to find the true best results.
// cross-encoder reranking — joint attention
A bi-encoder embeds query and document independently, then computes similarity in vector space.
A cross-encoder concatenates them as [CLS] query [SEP] document [SEP] and runs them
through the transformer together. Every query token attends to every document token.
| Bi-encoder | Cross-encoder | |
|---|---|---|
| Accuracy | Moderate | High |
| Precompute docs | Yes | No |
| Query latency | O(1) once indexed | O(candidates) |
| Use case | First-stage retrieval | Re-ranking |
We use raw logits from the MS-MARCO model, not softmax. Softmax would normalize across the batch, making one document's score dependent on which others happen to be present — wrong for ranking where we want absolute scores.
// generation — prompt engineering & source attribution
The system prompt establishes a binding contract: answer only from context, cite with [Source N]
notation, and refuse gracefully when context is insufficient. This constrains the model, allows graceful
refusal, and enables source attribution in the UI.
Source attribution metadata
Each chunk carries its origin through the entire pipeline: metadata["source"] is set during
ingestion, preserved through chunking and storage, and read by the generator to build citations. The user
sees which file and chunk each answer draws from.
Raw HTTP, no SDK
A single requests.post to /chat/completions works identically with OpenAI, Together AI,
Groq, Ollama, or any OpenAI-compatible endpoint. The URL and key change; the code does not.
Temperature is set to 0.2 — low enough for determinism, high enough to avoid repetition.
// evaluation — measuring what matters
RAG quality has two distinct components. Retrieval quality: are the right chunks being found? Measured without an LLM. Generation quality: does the answer faithfully reflect the context? Requires language understanding to measure.
| Metric | Formula | What it measures |
|---|---|---|
| Precision@k | |retrieved ∩ relevant| / k |
Fraction of retrieved chunks that are relevant |
| Recall@k | |retrieved ∩ relevant| / |relevant| |
Fraction of all relevant chunks that were retrieved |
| MRR | 1 / rank(first relevant) |
How early the first relevant chunk appears |
| Faithfulness | LLM-as-judge (0.0 – 1.0) | Whether the answer is grounded in context |
MRR matters most for RAG: if the best chunk is at rank 5 but we only pass rank 1 to the LLM, the question goes unanswered. Faithfulness uses an LLM-as-judge approach — expensive but necessary since no formula can measure semantic grounding.
// implementation decisions
| Decision | Chose | Over | Why |
|---|---|---|---|
| PDF parsing | PyPDF2 | pdfminer.six | Pure Python, zero compiled deps, simpler environment |
| Vector search | NumPy exact | FAISS (ANN) | <10ms for 100K vectors, no ANN errors, trivially inspectable |
| Tokenization | Whitespace | NLTK/spaCy | BM25's IDF handles stopwords; simpler is more transparent |
| Embeddings | Raw transformers | sentence-transformers lib | Shows what the library does internally; generalizes to any model |
| LLM calls | requests.post |
OpenAI SDK | Works with any compatible endpoint; zero vendor lock-in |
Every decision optimizes for transparency over convenience. The goal isn't the shortest code — it's the clearest mapping from math to implementation.
Repository ↗ · Full math derivations (LEARNING.md) ↗ · ← more writing