← back to blog
$ cat ./blog/ragops-api.md  ·  March 2026  ·  ~8 min read

RAGOps API
— Production RAG with FastAPI and pgvector


// why build this

After building RAG from scratch — deriving every algorithm by hand — I wanted to answer a different question: what does it take to ship a RAG pipeline as a real API?

The from-scratch project proved I understood the algorithms. RAGOps proves I can productionise them. The gap between a Jupyter notebook and a deployed API is where most ML projects die. This project bridges that gap: persistent vector storage, containerised deployment, automated testing, and CI regression gating.

// architecture

Client FastAPI (routers) Services (ingest_service, rag_service) rag_core (chunking, embedding, rerank) + PostgreSQL + pgvector OpenAI-compatible LLM

Layered design: Routers handle HTTP, services contain business logic, rag_core holds the ML components, and PostgreSQL with pgvector stores vectors persistently. No business logic in routers — they only validate input and call services.

This is the standard separation you'd see in a production FastAPI codebase. Each layer is independently testable. The router tests mock services, the service tests mock the database, and the rag_core tests run against real models.

// tech stack

ComponentTechnology
APIFastAPI with auto OpenAPI docs
DatabasePostgreSQL 15 + pgvector (HNSW index)
Embeddingssentence-transformers (all-MiniLM-L6-v2, 384-dim)
Rerankercross-encoder (ms-marco-MiniLM-L-6-v2)
LLMRaw HTTP to any OpenAI-compatible endpoint
ContainersDocker Compose (api + postgres)
CIGitHub Actions — pytest + ruff on push
Configpydantic-settings via .env

// API design

Three endpoints. That's it. Simplicity is a feature.

POST /v1/ingest — upload a document (.txt, .md, .pdf). The service chunks it, embeds each chunk, and stores vectors in pgvector. Returns a document ID and chunk count.

POST /v1/query — ask a question with a top_k parameter. The service embeds the query, retrieves candidates from pgvector, reranks with the cross-encoder, and generates an answer with source attribution. Returns the answer, source chunks with scores, and token usage.

GET /health — liveness probe for container orchestration.

// offline evaluation

A RAG API without evaluation metrics is just a demo. RAGOps includes an offline evaluation harness that measures Precision@k, Recall@k, and MRR (Mean Reciprocal Rank) against a test corpus.

The evaluation runs as part of CI. If retrieval quality regresses below the baseline, the pipeline fails. This is the "ops" in RAGOps — you don't just build it, you continuously verify it works.

// key decisions

pgvector over Pinecone/Weaviate — PostgreSQL is battle-tested infrastructure. pgvector adds vector similarity search to a database most teams already run. No new service to manage, no vendor lock-in, and HNSW indexing gives sub-linear search time.

Raw HTTP over LangChain — the LLM integration is a single HTTP call to an OpenAI-compatible endpoint. LangChain adds 50+ transitive dependencies for something that's 20 lines of code. When the abstraction costs more than the thing it abstracts, skip the abstraction.

Docker Compose for local devmake up starts the API and PostgreSQL with pgvector pre-configured. No "install PostgreSQL, enable pgvector extension, run migrations" dance. One command, everything works.

CI regression gate — every push runs tests and linting. The evaluation harness ensures retrieval quality doesn't regress silently. This is the difference between a project and a product.