RAG Done Right: Architecture Patterns, Pitfalls, and Evaluation

TL;DR

RAG grounds LLM answers in your documents for accuracy and auditability. Start with strong ingestion and ACLs, add hybrid retrieval + reranking, measure groundedness and cost with a golden set, and treat RAG as a product with owners, SLAs, and audit logs.

Retrieval-Augmented Generation (RAG) combines a language model with your enterprise knowledge so answers are grounded in the documents you trust. Done well, it elevates accuracy, auditability, and time‑to‑value. Done poorly, it collapses under access control, data drift, and measurement gaps. This article reframes RAG as a product with owners, SLAs, and evidence — not a demo.

Why RAG, Really

The chief benefit is grounding: the model cites and reasons over your approved sources at answer time, reducing hallucinations and enabling explainability. That same grounding creates a defensible audit trail — essential in finance and legal. RAG also tends to reach value faster than fine‑tuning because you don’t change model weights; you expose better context. The limit is clear: RAG won’t fix missing or low‑quality knowledge. If your content is chaotic, address that as part of the rollout.

The Core Pattern, In Practice

At heart, RAG is a short pipeline. Keep the steps explicit and you’ll keep quality controllable.

Ingest → 2) Index → 3) Retrieve → 4) Compose → 5) Generate → 6) Post‑process

Ingest: Convert PDFs, DOCX, HTML, and emails into clean text. Preserve headings, lists, tables, and figure captions.
Index: Split into structure‑aware chunks, embed, and store vectors + metadata (source, version, ACLs, timestamps).
Retrieve: For a user query, pull the top‑k relevant chunks using vectors (and often keywords), filtered by the user’s permissions.
Compose: Build a constrained prompt that includes instructions, the user’s question, and the retrieved snippets with clear citations.
Generate: Call the LLM with a size‑aware context. Prefer small and relevant over large and noisy.
Post‑process: Add citations, perform lightweight validations, and cache results where safe.

Baseline defaults that work well to start:

Chunk size: 300–800 tokens with 10–20% overlap; use heading‑aware splitting for manuals/policies.
Top‑k: 4–8 items for most Q&A; increase slightly if snippets are short or diverse.
Embeddings: Use a reliable general model first; consider domain‑specific embeddings only if recall is lacking.
Filters: Always filter by ACLs, document version, and jurisdiction where relevant.
Prompt: Instruct the model to answer strictly from sources and to say “I don’t know” when evidence is insufficient.

Latency budgeting (rule of thumb): 40–70% retrieval, 20–50% generation, the rest ingestion cache and glue. Set an explicit p95 target (e.g., 1.5s internal, 2.5s external) and measure each stage.

Baseline config (cheat‑sheet):

chunk_tokens: 600
chunk_overlap: 80
retrieval:
  top_k: 6
  hybrid: true        # bm25 + vectors
  rerank: false       # enable later for precision
filters:
  enforce_acls: true
  version: current
  jurisdiction: user_region
prompt:
  grounded_only: true
  require_citations: true
  refusal_on_insufficient_evidence: true

When You Need More Than “Basic RAG”

Reranking for Precision

How it works: Do a fast vector search to get 20–50 candidates, then pass them through a cross‑encoder reranker that scores each candidate in the context of the query. Keep the top 4–8 for the prompt.
Use when: Small mistakes are costly (e.g., clause selection, control procedures).
Defaults: candidates=32, keep=6.
Risks: +60–200 ms latency depending on model size.

Hybrid Retrieval for Jargon and Edge Cases

How it works: Run BM25 (keyword) and vector queries in parallel, then merge with a weighted score. This recovers exact phrases, codes, and IDs that vectors sometimes miss.
Use when: Content includes acronyms, ticket IDs, SKU codes, or legal citations.
Defaults: weight vectors 0.6–0.7 and BM25 0.3–0.4; tune per corpus.
Risks: Slight complexity; usually <50 ms added latency with good caching.

Structured Augmentation for Numbers and Facts

How it works: During ingestion, extract entities, amounts, dates, tables, and store them alongside the text. At query time, fetch both structured fields and passages, and render numbers explicitly in the answer.
Use when: Financial statements, SLAs, policy numbers, or any numeric reporting.
Defaults: prefer table‑aware splitters; add unit/numeric sanity checks in post‑processing.
Risks: More ingestion work; worth it when numbers matter.

Multi‑hop and Graph RAG for Complex Reasoning

How it works: Build a simple entity‑relation graph (people, matters, accounts, controls). Retrieve along 1–2 hops, collect the supporting snippets, and then ask the LLM to synthesize.
Use when: “If X then Y unless Z” style reasoning across multiple sources.
Defaults: limit hops to ≤2; cap context by entity type.
Risks: Extra infra; reserve for truly complex queries.

Ingestion That Survives Production

Think of ingestion as a supply chain. Normalization cleans and standardizes content. Chunking respects document structure (headings, lists, tables) and uses small overlaps so answers aren’t cut in the middle of a thought. Rich metadata — source, version, confidentiality, jurisdiction, timestamps — enables filtering and better relevance. Most importantly, permissions (ACLs) travel with each chunk, so retrieval only returns what the user is allowed to see. Plan for incremental updates and deletions so answers don’t quote stale policies. Add quality gates that reject unreadable OCR or low‑text density files and route them for cleanup.

Prompts, Context, and Cost

Prompts should be short, specific, and repetitive (in the good way). Tell the model to answer from sources only, cite them, and refuse when unsure. Larger context windows are not automatic wins — more tokens often mean more cost and weaker signal. Use a few realistic, permission‑safe examples to shape tone. Where needed, enforce guarded generation so the model explicitly answers “I don’t know” when evidence conflicts or is missing.

The Six Classic Failure Modes — And Fixes

The Prototype Trap.
Symptom: a dazzling demo and a disappointing pilot.
Why it happens: curated corpus and hand‑picked prompts hide retrieval gaps.
What to do: evaluate on real queries at scale with diverse permissions and stale data scenarios.
Chunking Gone Wrong.
Symptom: either missing context or noisy, redundant context.
Why it happens: arbitrary fixed sizes ignore document structure.
What to do: start with 300–800 tokens and 10–20% overlap; tune by format; prefer structure‑aware chunking.
Irrelevant Retrieval.
Symptom: fluent answers, weak citations.
Why it happens: embedding mismatch or lack of lexical grounding.
What to do: add hybrid retrieval plus reranking; improve metadata filters; consider domain‑specific embeddings.
Broken Access Control.
Symptom: users see content they shouldn’t.
Why it happens: ACLs applied at the app, not during retrieval.
What to do: enforce ACLs on retrieval results before prompting; log permission checks and test them.
Data Drift and Staleness.
Symptom: answers contradict updated policies.
Why it happens: slow re‑ingestion and no source‑of‑truth tracking.
What to do: schedule incremental re‑index, propagate source versions, and invalidate outdated chunks.
No Measurement Culture.
Symptom: quality debates by anecdote.
Why it happens: missing metrics and test sets.
What to do: define task‑specific metrics, build a golden set, and review weekly like uptime.

Measuring What Matters

Begin with retrieval: measure Recall@k (did the right sources appear?) and rank quality (MRR/nDCG). Include permission accuracy — the percentage of retrieved items the user is allowed to see. Then grade answers: are they grounded in the retrieved sources; are facts correct; does the response address the whole question; is the style helpful? Layer compliance checks: PII redaction where required, citation coverage, and refusal correctness for out‑of‑scope requests. Finally, watch cost and latency: median and p95 by workflow, with budgets and alerts. Keep a living golden set of real prompts and expected judgments, and run it on every change (model, prompt, retrieval, or index refresh).

Security and Auditability by Design

Design for data minimization (retrieve only what you need) and, when required, redact PII before prompting. Maintain tenant isolation by business unit or client. Keep secrets out of prompts and logs. Store prompt/context hashes, retrieval IDs, and citations so auditors can trace how an answer was produced. Route high‑risk outputs (regulatory or legal advice) through review queues.

Operating Model and Ownership

Treat RAG as a product with clear roles: a product owner for use cases and ROI; a knowledge owner for content quality; a platform team for ingestion, retrieval, and observability; and risk/compliance for controls. Define SLAs for latency, cost, and quality by tier (self‑serve vs. advisor‑facing). Publish change notes and run enablement sessions with examples so adoption sticks.

Finance and Legal Patterns

In finance, RAG accelerates policy and procedure answers, supports model risk management by retrieving validation evidence, and grounds narrative reporting in approved data dictionaries and footnotes. In legal, it powers clause comparison and precedent search, enforces jurisdictional filters, and helps capture knowledge from past matters with citations to filings and memos.

Build vs. Buy — Making It Boring

For indexing, Postgres + pgvector offers simplicity and governance; managed vector DBs offer scale and features; hybrid search engines provide robust lexical + vector retrieval. Add a simple reranker in precision‑critical domains. Choose observability that records retrieval results, prompts, tokens, and quality scores in one place. Your total cost is storage + embeddings + inference + ops; model selection and caching usually drive the biggest savings.

A Practical 90‑Day Plan

Days 0–30: Baseline

Identify the top 50–100 real queries and define quality, cost, and latency metrics. Clean and ingest a representative corpus with ACLs and metadata. Ship a minimal RAG with citations and stand up logging and dashboards.

Days 31–60: Harden

Introduce hybrid retrieval and reranking if groundedness is weak. Expand the golden set and add red‑teaming. Enforce p95 latency and cost budgets.

Days 61–90: Scale

Add SSO and permission hardening, review queues for high‑risk outputs, and publish weekly quality reports to stakeholders. Begin simple chargeback/showback for transparency.

Executive Checklist

Do we have a golden set with measurable quality gates and owners?
Are ACLs enforced during retrieval and tested regularly?
Are groundedness, refusal rates, latency p95, and cost monitored with alerts?
Are citations and audit logs stored, searchable, and reviewable?
Can we update content and re‑index without downtime?
Who signs off on compliance and risk controls, and on what cadence?

RAG becomes a durable advantage when it is treated as a product: strong ingestion, permission‑aware retrieval, disciplined evaluation, and the ownership to keep it improving. Start small, measure relentlessly, and scale what proves its worth.

FAQ

What’s the difference between RAG and fine-tuning?

RAG retrieves authoritative documents at query time and uses them to ground answers; fine-tuning changes model weights. For fast-changing, compliance-heavy knowledge, RAG is usually faster and safer to ship.

How do we enforce permissions?

Propagate ACLs into your index, filter retrieval results by user identity before prompting, and test permission scenarios in your golden set.

How big should chunks be?

Start with 300–800 tokens and 10–20% overlap. Tune by corpus and use case; structure-aware chunking (by headings/tables) works best.

Which vector DB should we choose?

Postgres + pgvector for simplicity and strong governance; Pinecone for turnkey scale; Milvus for open-source performance. Decide based on SLOs, team skills, and compliance.

How do we measure quality?

Track groundedness, factuality, completeness, permission accuracy, latency p95, and cost per request. Maintain a golden set and review weekly.