How do you build a production RAG API with FastAPI?

A production RAG API uses FastAPI for HTTP routing, PostgreSQL with pgvector for persistent vector storage, sentence-transformers for embeddings, and a cross-encoder for reranking. The architecture follows a layered design: routers handle HTTP validation, services contain business logic, and a rag_core package holds ML components. Docker Compose manages the API and database containers, and CI runs tests plus an offline evaluation harness that gates on retrieval quality metrics like Precision@k, Recall@k, and MRR.

Why use pgvector instead of Pinecone or Weaviate for RAG?

pgvector adds vector similarity search to PostgreSQL, which most teams already run. This eliminates the need for a separate vector database service, avoids vendor lock-in, and reduces infrastructure complexity. HNSW indexing in pgvector provides sub-linear search time. For most production RAG workloads, pgvector offers sufficient performance without the operational overhead of a dedicated vector database.

RAGOps API — Archit Konde

// why build this

After building RAG from scratch — deriving every algorithm by hand — I wanted to answer a different question: what does it take to ship a RAG pipeline as a real API?

The from-scratch project proved I understood the algorithms. RAGOps proves I can productionise them. The gap between a Jupyter notebook and a deployed API is where most ML projects die. This project bridges that gap: persistent vector storage, containerised deployment, automated testing, and CI regression gating.

// architecture

Client → FastAPI (routers) ↓ Services (ingest_service, rag_service) ↓ rag_core (chunking, embedding, rerank) + PostgreSQL + pgvector ↓ OpenAI-compatible LLM

Layered design: Routers handle HTTP, services contain business logic, rag_core holds the ML components, and PostgreSQL with pgvector stores vectors persistently. No business logic in routers — they only validate input and call services.

This is the standard separation you'd see in a production FastAPI codebase. Each layer is independently testable. The router tests mock services, the service tests mock the database, and the rag_core tests run against real models.

// tech stack

Component	Technology
API	FastAPI with auto OpenAPI docs
Database	PostgreSQL 15 + pgvector (HNSW index)
Embeddings	sentence-transformers (all-MiniLM-L6-v2, 384-dim)
Reranker	cross-encoder (ms-marco-MiniLM-L-6-v2)
LLM	Raw HTTP to any OpenAI-compatible endpoint
Containers	Docker Compose (api + postgres)
CI	GitHub Actions — pytest + ruff on push
Config	pydantic-settings via `.env`

// API design

Three endpoints. That's it. Simplicity is a feature.

POST /v1/ingest — upload a document (.txt, .md, .pdf). The service chunks it, embeds each chunk, and stores vectors in pgvector. Returns a document ID and chunk count.

POST /v1/query — ask a question with a top_k parameter. The service embeds the query, retrieves candidates from pgvector, reranks with the cross-encoder, and generates an answer with source attribution. Returns the answer, source chunks with scores, and token usage.

GET /health — liveness probe for container orchestration.

// offline evaluation

A RAG API without evaluation metrics is just a demo. RAGOps includes an offline evaluation harness that measures Precision@k, Recall@k, and MRR (Mean Reciprocal Rank) against a test corpus.

The evaluation runs as part of CI. If retrieval quality regresses below the baseline, the pipeline fails. This is the "ops" in RAGOps — you don't just build it, you continuously verify it works.

// key decisions

pgvector over Pinecone/Weaviate — PostgreSQL is battle-tested infrastructure. pgvector adds vector similarity search to a database most teams already run. No new service to manage, no vendor lock-in, and HNSW indexing gives sub-linear search time.

Raw HTTP over LangChain — the LLM integration is a single HTTP call to an OpenAI-compatible endpoint. LangChain adds 50+ transitive dependencies for something that's 20 lines of code. When the abstraction costs more than the thing it abstracts, skip the abstraction.

Docker Compose for local dev — make up starts the API and PostgreSQL with pgvector pre-configured. No "install PostgreSQL, enable pgvector extension, run migrations" dance. One command, everything works.

CI regression gate — every push runs tests and linting. The evaluation harness ensures retrieval quality doesn't regress silently. This is the difference between a project and a product.

// try it / source code

View on GitHub ↗

RAGOps API— Production RAG with FastAPI and pgvector