How to Design a Hybrid RAG Stack with pgvector + Elasticsearch

A production retrieval blueprint covering index design, parallel query planning, rank fusion, reranking, and offline evaluation before prompt tuning.

RAG

pgvector

Elasticsearch

AI Systems

How to Optimize a Hybrid RAG Stack

In production RAG, the objective is not "best embedding similarity." The objective is: maximize answer quality under token budget and latency constraints. Hybrid retrieval works because semantic and lexical systems fail differently.

Vector retrieval captures intent-level similarity. BM25 captures exact terms and rare tokens. A robust stack queries both, fuses candidates, reranks top-k, and only then builds prompt context.

Retrieval pipeline stages

Chunk and embed documents with stable IDs and version markers.
Issue vector and lexical searches in parallel.
Fuse candidates with reciprocal-rank fusion (RRF).
Apply reranker to top fused set.
Pack context with citation IDs and source metadata.

hybrid-retrieval.tsts

type Candidate = {
  chunkId: string;
  source: "vector" | "bm25";
  rank: number;
  score: number;
};

function reciprocalRankFusion(groups: Candidate[][], k = 60) {
  const scoreMap = new Map<string, number>();

  for (const group of groups) {
    for (const item of group) {
      const prev = scoreMap.get(item.chunkId) ?? 0;
      scoreMap.set(item.chunkId, prev + 1 / (k + item.rank));
    }
  }

  return [...scoreMap.entries()]
    .map(([chunkId, fusedScore]) => ({ chunkId, fusedScore }))
    .sort((a, b) => b.fusedScore - a.fusedScore);
}

export async function hybridRetrieve(query: string) {
  const [vectorHits, bm25Hits] = await Promise.all([
    vectorStore.search(query, { topK: 40 }),
    elastic.search(query, { size: 40 }),
  ]);

  const fused = reciprocalRankFusion([vectorHits, bm25Hits]);
  return rerank(query, fused.slice(0, 25));
}

Data Model and Indexing Choices

Index design decisions dominate recall and latency. For pgvector, choose index type by scale and update profile: IVFFlat for steady append-heavy workloads, HNSW for faster recall at query time with higher memory pressure.

hybrid-index.sqlsql

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE rag_chunks (
  chunk_id UUID PRIMARY KEY,
  document_id UUID NOT NULL,
  tenant_id UUID NOT NULL,
  body TEXT NOT NULL,
  embedding VECTOR(1536) NOT NULL,
  lexical_tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', body)) STORED,
  metadata JSONB NOT NULL,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX rag_chunks_embedding_hnsw
ON rag_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);

CREATE INDEX rag_chunks_lexical_gin
ON rag_chunks
USING gin (lexical_tsv);

CREATE INDEX rag_chunks_tenant_idx
ON rag_chunks (tenant_id, document_id);

Evaluation Before Prompt Tuning

Most teams tune prompts before retrieval metrics. This is backwards. You need offline retrieval evaluation first: MRR, nDCG, and hit@k on labeled query-document pairs.

retrieval-eval.tsts

interface LabeledQuery {
  query: string;
  relevantChunkIds: string[];
}

export async function evaluateRetrieval(dataset: LabeledQuery[]) {
  let hitAt5 = 0;
  let reciprocalRankSum = 0;

  for (const item of dataset) {
    const hits = await hybridRetrieve(item.query);
    const top5 = hits.slice(0, 5).map((h) => h.chunkId);

    if (top5.some((id) => item.relevantChunkIds.includes(id))) {
      hitAt5 += 1;
    }

    const rrIndex = hits.findIndex((h) => item.relevantChunkIds.includes(h.chunkId));
    reciprocalRankSum += rrIndex === -1 ? 0 : 1 / (rrIndex + 1);
  }

  return {
    hitAt5: hitAt5 / dataset.length,
    mrr: reciprocalRankSum / dataset.length,
  };
}

Deployment rule

Ship retriever changes behind feature flags and log retrieval traces. Without per-query traces, you cannot debug why a good prompt produced a bad answer.