How to Design a Hybrid RAG Stack with pgvector + Elasticsearch
A production retrieval blueprint covering index design, parallel query planning, rank fusion, reranking, and offline evaluation before prompt tuning.
How to Optimize a Hybrid RAG Stack
In production RAG, the objective is not "best embedding similarity." The objective is: maximize answer quality under token budget and latency constraints. Hybrid retrieval works because semantic and lexical systems fail differently.
Vector retrieval captures intent-level similarity. BM25 captures exact terms and rare tokens. A robust stack queries both, fuses candidates, reranks top-k, and only then builds prompt context.
Retrieval pipeline stages
- Chunk and embed documents with stable IDs and version markers.
- Issue vector and lexical searches in parallel.
- Fuse candidates with reciprocal-rank fusion (RRF).
- Apply reranker to top fused set.
- Pack context with citation IDs and source metadata.
type Candidate = {
chunkId: string;
source: "vector" | "bm25";
rank: number;
score: number;
};
function reciprocalRankFusion(groups: Candidate[][], k = 60) {
const scoreMap = new Map<string, number>();
for (const group of groups) {
for (const item of group) {
const prev = scoreMap.get(item.chunkId) ?? 0;
scoreMap.set(item.chunkId, prev + 1 / (k + item.rank));
}
}
return [...scoreMap.entries()]
.map(([chunkId, fusedScore]) => ({ chunkId, fusedScore }))
.sort((a, b) => b.fusedScore - a.fusedScore);
}
export async function hybridRetrieve(query: string) {
const [vectorHits, bm25Hits] = await Promise.all([
vectorStore.search(query, { topK: 40 }),
elastic.search(query, { size: 40 }),
]);
const fused = reciprocalRankFusion([vectorHits, bm25Hits]);
return rerank(query, fused.slice(0, 25));
}Data Model and Indexing Choices
Index design decisions dominate recall and latency. For pgvector, choose index type by scale and update profile: IVFFlat for steady append-heavy workloads, HNSW for faster recall at query time with higher memory pressure.
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE rag_chunks (
chunk_id UUID PRIMARY KEY,
document_id UUID NOT NULL,
tenant_id UUID NOT NULL,
body TEXT NOT NULL,
embedding VECTOR(1536) NOT NULL,
lexical_tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', body)) STORED,
metadata JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX rag_chunks_embedding_hnsw
ON rag_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);
CREATE INDEX rag_chunks_lexical_gin
ON rag_chunks
USING gin (lexical_tsv);
CREATE INDEX rag_chunks_tenant_idx
ON rag_chunks (tenant_id, document_id);Evaluation Before Prompt Tuning
Most teams tune prompts before retrieval metrics. This is backwards. You need offline retrieval evaluation first: MRR, nDCG, and hit@k on labeled query-document pairs.
interface LabeledQuery {
query: string;
relevantChunkIds: string[];
}
export async function evaluateRetrieval(dataset: LabeledQuery[]) {
let hitAt5 = 0;
let reciprocalRankSum = 0;
for (const item of dataset) {
const hits = await hybridRetrieve(item.query);
const top5 = hits.slice(0, 5).map((h) => h.chunkId);
if (top5.some((id) => item.relevantChunkIds.includes(id))) {
hitAt5 += 1;
}
const rrIndex = hits.findIndex((h) => item.relevantChunkIds.includes(h.chunkId));
reciprocalRankSum += rrIndex === -1 ? 0 : 1 / (rrIndex + 1);
}
return {
hitAt5: hitAt5 / dataset.length,
mrr: reciprocalRankSum / dataset.length,
};
}Deployment rule
Ship retriever changes behind feature flags and log retrieval traces. Without per-query traces, you cannot debug why a good prompt produced a bad answer.