// why build this
After building RAG from scratch — deriving every algorithm by hand — I wanted to answer a different question: what does it take to ship a RAG pipeline as a real API?
The from-scratch project proved I understood the algorithms. RAGOps proves I can productionise them. The gap between a Jupyter notebook and a deployed API is where most ML projects die. This project bridges that gap: persistent vector storage, containerised deployment, automated testing, and CI regression gating.
// architecture
Layered design: Routers handle HTTP, services contain business logic, rag_core holds the ML components, and PostgreSQL with pgvector stores vectors persistently. No business logic in routers — they only validate input and call services.
This is the standard separation you'd see in a production FastAPI codebase. Each layer is independently testable. The router tests mock services, the service tests mock the database, and the rag_core tests run against real models.
// tech stack
| Component | Technology |
|---|---|
| API | FastAPI with auto OpenAPI docs |
| Database | PostgreSQL 15 + pgvector (HNSW index) |
| Embeddings | sentence-transformers (all-MiniLM-L6-v2, 384-dim) |
| Reranker | cross-encoder (ms-marco-MiniLM-L-6-v2) |
| LLM | Raw HTTP to any OpenAI-compatible endpoint |
| Containers | Docker Compose (api + postgres) |
| CI | GitHub Actions — pytest + ruff on push |
| Config | pydantic-settings via .env |
// API design
Three endpoints. That's it. Simplicity is a feature.
POST /v1/ingest — upload a document (.txt, .md,
.pdf). The service chunks it, embeds each chunk, and stores vectors in pgvector. Returns a
document ID and chunk count.
POST /v1/query — ask a question with a top_k parameter.
The service embeds the query, retrieves candidates from pgvector, reranks with the cross-encoder, and
generates an answer with source attribution. Returns the answer, source chunks with scores, and token usage.
GET /health — liveness probe for container orchestration.
// offline evaluation
A RAG API without evaluation metrics is just a demo. RAGOps includes an offline evaluation harness that measures Precision@k, Recall@k, and MRR (Mean Reciprocal Rank) against a test corpus.
The evaluation runs as part of CI. If retrieval quality regresses below the baseline, the pipeline fails. This is the "ops" in RAGOps — you don't just build it, you continuously verify it works.
// key decisions
pgvector over Pinecone/Weaviate — PostgreSQL is battle-tested infrastructure. pgvector adds vector similarity search to a database most teams already run. No new service to manage, no vendor lock-in, and HNSW indexing gives sub-linear search time.
Raw HTTP over LangChain — the LLM integration is a single HTTP call to an OpenAI-compatible endpoint. LangChain adds 50+ transitive dependencies for something that's 20 lines of code. When the abstraction costs more than the thing it abstracts, skip the abstraction.
Docker Compose for local dev — make up starts the API and PostgreSQL
with pgvector pre-configured. No "install PostgreSQL, enable pgvector extension, run migrations" dance.
One command, everything works.
CI regression gate — every push runs tests and linting. The evaluation harness ensures retrieval quality doesn't regress silently. This is the difference between a project and a product.