How do you build a RAG pipeline from scratch in Python?

A RAG pipeline has two phases: indexing (load documents, chunk text, embed into vectors, store in a vector database) and querying (embed the query, retrieve candidates from dense and sparse indexes, fuse rankings with Reciprocal Rank Fusion, rerank with a cross-encoder, and generate an answer grounded in retrieved context). Key components include BM25 for sparse retrieval, sentence-transformers for dense embeddings, and cross-encoder models for reranking.

What is Reciprocal Rank Fusion in RAG?

Reciprocal Rank Fusion (RRF) is a technique for combining results from multiple retrieval methods (e.g., BM25 sparse search and dense vector search) into a single ranked list. Each document's score is calculated as the sum of 1/(k + rank) across all retrieval methods, where k is a constant (typically 60). This approach is rank-based rather than score-based, making it robust to different score distributions across retrieval methods.

What is cross-encoder reranking in a RAG pipeline?

Cross-encoder reranking is a second-stage ranking step that takes the top candidates from initial retrieval and re-scores them using a model that processes the query and document together (rather than independently). This produces more accurate relevance scores than bi-encoder embeddings because it can attend to fine-grained interactions between query and document tokens. Models like ms-marco-MiniLM are commonly used.

Building RAG From Scratch

Large language models are trained on static snapshots of the internet. They cannot know about documents you wrote last week, the internal policy PDF sitting on your server, or a research paper published after their training cutoff. They also hallucinate — they fill knowledge gaps with plausible-sounding but fabricated text.

Retrieval-Augmented Generation (RAG) solves both problems. Instead of asking the model to recall facts from its weights, you store your documents in a searchable index, retrieve the most relevant passages when a question arrives, stuff those passages into the prompt as context, and ask the model to answer only from that context. The model becomes a reading comprehension engine over a knowledge base you control.

I wanted to understand every layer of this — not by calling langchain.RetrievalQA(), but by deriving each algorithm from scratch and implementing it by hand. No frameworks, no abstractions. Nine components, each in its own file, each with its own math.

ingestion → chunker → embeddings → vectorstore ↓ query → embeddings → retriever (hybrid) → reranker → generator ↑ bm25

// what is RAG and why?

The pipeline has two phases. Indexing happens once: load documents, split into chunks, embed into vectors, store in the vector database. Querying happens on every user request: embed the query, retrieve candidates from both dense (vector) and sparse (BM25) indexes, fuse rankings, rerank with a cross-encoder, and generate an answer grounded in the retrieved context.

Every component is a separate Python file. Every file maps to one section in this writeup.

// recursive text chunking

Embedding models have a fixed maximum input length. all-MiniLM-L6-v2 handles at most 256 tokens. You cannot embed a 50-page PDF as a single vector — even if it fit, the resulting vector would be a diffuse average of everything, matching nothing well.

Chunking splits documents into passages that fit within the model's window, are semantically coherent, and overlap slightly so information at boundaries isn't lost.

Why "recursive"?

A naive approach splits on \n\n. But what if a paragraph is still 2,000 characters? You split on \n. Still too long? Split on ". ". Still? On " ". Still? Hard-slice by character. This hierarchy of separators is the "recursive" part — try the most semantically meaningful separator first, then fall back progressively.

The overlap window

After splitting, short pieces are merged back into full-sized chunks using a sliding deque. When the window fills, emit the current chunk, then pop from the left until only chunk_overlap characters remain. That retained tail becomes the start of the next chunk, preventing a question about a concept that straddles two chunks from missing context entirely.

Position reconstruction

Each chunk needs start_char and end_char positions in the original text for source attribution. A forward cursor tracks position: text.find(chunk, cursor) always starts from roughly the right place, avoiding backtracking.

// sentence embeddings — mean pooling

A sentence embedding maps a variable-length string to a fixed-length vector in ℝ^d such that semantically similar sentences are geometrically close. all-MiniLM-L6-v2 outputs 384-dimensional vectors.

Why not the [CLS] token?

BERT-family models prepend a [CLS] token whose final hidden state is used for classification. For sentence similarity, mean pooling — averaging all token embeddings weighted by the attention mask — produces significantly better representations. This is what sentence-transformers were fine-tuned to use.

The math

The model outputs last_hidden_state of shape (B, T, D). Not all T tokens are real — padding fills shorter sequences. The attention mask is a (B, T) binary tensor.

mask_expanded = attention_mask.unsqueeze(-1).expand_as(token_embeddings)
# shape: (B, T, D)

sum_embeddings = torch.sum(token_embeddings * mask_expanded, dim=1)
# shape: (B, D) — sum of real token embeddings only

sum_mask = torch.clamp(mask_expanded.sum(dim=1), min=1e-9)
# shape: (B, D) — count of real tokens

pooled = sum_embeddings / sum_mask
# shape: (B, D) — mean of real token embeddings

The .unsqueeze(-1) adds a trailing dimension so the mask broadcasts across D. Forgetting .expand_as() is the most common implementation bug — you get a shape mismatch or incorrect broadcast silently.

L2 normalization

After pooling, normalize each vector to unit length with np.clip(norms, 1e-12, None) to prevent division by zero. This converts cosine similarity into a simple dot product.

// vector store — cosine similarity as dot product

For two vectors a and b:

cosine_similarity(a, b) = (a · b) / (‖a‖ · ‖b‖)

If all stored vectors are L2-normalized (‖v‖ = 1), then cosine similarity = dot product. A dot product against the entire matrix is a single BLAS call:

scores = self._embeddings @ query_embedding   # (N, D) @ (D,) → (N,)

This is O(N × D) and runs in microseconds for tens of thousands of chunks.

Top-k without full sort

np.argpartition is a quickselect variant — it finds the k largest values in O(N) without sorting everything. We then sort only those k candidates: O(k log k) instead of O(N log N).

// BM25 — sparse retrieval from first principles

Classic TF-IDF has two problems: term frequency is unbounded (mentioning "cat" 100 times scores 10× higher than 10 times, but both clearly discuss cats), and longer documents naturally have higher TF even for the same density.

BM25 fixes both

TF saturation (k1 = 1.5): As tf → ∞, the score approaches (k1 + 1). Mentioning a term 100 times is only slightly better than 10 times.

Length normalization (b = 0.75): A document twice the average length is penalized but not eliminated. The formula divides by 1 - b + b × |d|/avgdl.

IDF — Robertson-Walker formula

IDF(t) = log( (N - df(t) + 0.5) / (df(t) + 0.5)  +  1 )

The outer +1 is critical. Without it, if df(t) = N (term in every document), log(0) = -∞. With it: log(0 + 1) = 0 — universal terms contribute nothing, not negative infinity.

Full BM25 formula

score(d, q) = Σ_{t ∈ q} IDF(t) × tf(t,d) × (k1 + 1)
              ÷ (tf(t,d) + k1 × (1 - b + b × |d| / avgdl))

This implementation uses text.lower().translate(...punctuation...).split() for tokenization — no NLTK, no spaCy, no stopword lists. BM25's IDF naturally down-weights common words since they appear in nearly every document.

// hybrid retrieval — reciprocal rank fusion

Dense retrieval excels at semantic matching ("furry household companion" → "The cat sat on the mat"). Sparse retrieval excels at exact keywords ("PyTorch 2.1 changelog"). Neither dominates. A hybrid approach gets the best of both.

Why not average scores?

Dense scores are cosine similarities in [-1, 1]. BM25 scores are unbounded positive floats. These distributions are completely different — combining them requires calibration that varies by corpus.

RRF operates on rankings, not scores

RRF_score(d) = Σ_{list L} 1 / (k + rank_L(d))

Ranks are always integers starting at 1. No calibration needed. With k = 60 (empirically determined by Cormack et al., 2009), the difference between rank 1 and rank 10 is only ~15%, making RRF robust to noise in individual rankers.

Before fusing, we over-fetch 3× top_k from each retriever (minimum 20) to give RRF enough candidates to find the true best results.

// cross-encoder reranking — joint attention

A bi-encoder embeds query and document independently, then computes similarity in vector space. A cross-encoder concatenates them as [CLS] query [SEP] document [SEP] and runs them through the transformer together. Every query token attends to every document token.

	Bi-encoder	Cross-encoder
Accuracy	Moderate	High
Precompute docs	Yes	No
Query latency	O(1) once indexed	O(candidates)
Use case	First-stage retrieval	Re-ranking

We use raw logits from the MS-MARCO model, not softmax. Softmax would normalize across the batch, making one document's score dependent on which others happen to be present — wrong for ranking where we want absolute scores.

// generation — prompt engineering & source attribution

The system prompt establishes a binding contract: answer only from context, cite with [Source N] notation, and refuse gracefully when context is insufficient. This constrains the model, allows graceful refusal, and enables source attribution in the UI.

Source attribution metadata

Each chunk carries its origin through the entire pipeline: metadata["source"] is set during ingestion, preserved through chunking and storage, and read by the generator to build citations. The user sees which file and chunk each answer draws from.

Raw HTTP, no SDK

A single requests.post to /chat/completions works identically with OpenAI, Together AI, Groq, Ollama, or any OpenAI-compatible endpoint. The URL and key change; the code does not. Temperature is set to 0.2 — low enough for determinism, high enough to avoid repetition.

// evaluation — measuring what matters

RAG quality has two distinct components. Retrieval quality: are the right chunks being found? Measured without an LLM. Generation quality: does the answer faithfully reflect the context? Requires language understanding to measure.

Metric	Formula	What it measures
Precision@k	`\|retrieved ∩ relevant\| / k`	Fraction of retrieved chunks that are relevant
Recall@k	`\|retrieved ∩ relevant\| / \|relevant\|`	Fraction of all relevant chunks that were retrieved
MRR	`1 / rank(first relevant)`	How early the first relevant chunk appears
Faithfulness	LLM-as-judge (0.0 – 1.0)	Whether the answer is grounded in context

MRR matters most for RAG: if the best chunk is at rank 5 but we only pass rank 1 to the LLM, the question goes unanswered. Faithfulness uses an LLM-as-judge approach — expensive but necessary since no formula can measure semantic grounding.

// implementation decisions

Decision	Chose	Over	Why
PDF parsing	PyPDF2	pdfminer.six	Pure Python, zero compiled deps, simpler environment
Vector search	NumPy exact	FAISS (ANN)	<10ms for 100K vectors, no ANN errors, trivially inspectable
Tokenization	Whitespace	NLTK/spaCy	BM25's IDF handles stopwords; simpler is more transparent
Embeddings	Raw transformers	sentence-transformers lib	Shows what the library does internally; generalizes to any model
LLM calls	`requests.post`	OpenAI SDK	Works with any compatible endpoint; zero vendor lock-in

Every decision optimizes for transparency over convenience. The goal isn't the shortest code — it's the clearest mapping from math to implementation.

Repository ↗ · Full math derivations (LEARNING.md) ↗ · ← more writing

// related posts

→ RAGOps API — production RAG with FastAPI and pgvector → SupportOps AI Monitor — LLM-powered ticket triage → Triagegeist — from 0.891 to 0.9995 CV accuracy

Building RAG From Scratch— A Complete Technical Deep-Dive

// what is RAG and why?

// recursive text chunking

Why "recursive"?

The overlap window

Position reconstruction

// sentence embeddings — mean pooling

Why not the [CLS] token?

The math

L2 normalization

// vector store — cosine similarity as dot product

Top-k without full sort

// BM25 — sparse retrieval from first principles

BM25 fixes both

IDF — Robertson-Walker formula

Full BM25 formula

// hybrid retrieval — reciprocal rank fusion

Why not average scores?

RRF operates on rankings, not scores

// cross-encoder reranking — joint attention

// generation — prompt engineering & source attribution

Source attribution metadata

Raw HTTP, no SDK

// evaluation — measuring what matters

// implementation decisions

// related posts

Building RAG From Scratch
— A Complete Technical Deep-Dive