Defense in depth against hallucination

We are building a chat over a corpus of spiritual lectures, and the whole game is one sentence: every claim must trace to a real source, or the assistant refuses. Left to itself a language model breaks that rule cheerfully — it hallucinates: inventing lectures that do not exist, fabricating quotes, conjuring citations that look exactly like the real ones. Cheaper models, in longer conversations, do it most of all — and the gap between models is measurable, not anecdotal, which matters because we cannot afford a frontier model on every turn.

There is a comforting myth that the fix is one good prompt. It is not. Honesty here is a property of the architecture, and the architecture has three movements:

flowchart TD
  Q["Question"] --> S1["1 · Get the right data"]
  S1 --> S2["2 · Throw out the wrong data"]
  S2 --> S3["3 · Synthesize a grounded answer"]
  S3 --> A["Answer with playable citations"]

Fetch the right material. Discard everything that does not clear the bar. Only then let the model write — and validate every reference it produces. It is defense in depth borrowed from security engineering and applied to retrieval-augmented generation (RAG): no single layer is trusted to be enough, and most of them never call an LLM at all. This is a tour of all three.

Stage one — get the right data

You cannot ground an answer on material you failed to retrieve — and the surest way to make a model hallucinate is to hand it a plausible-but-wrong passage because the right one never surfaced. No single search finds everything: meaning-based search blurs exact names, exact-match search is deaf to paraphrase. So our retrieval is a deliberate hybrid — four lanes run in parallel and each covers the others’ blind spots. Recall failures are where invention begins, so this is where we spend the most effort.

flowchart TD
  Q["Question plus planned rephrasings"] --> D["Dense vector lanes"]
  Q --> L["Lexical lane"]
  Q --> ADR["Address lookup, e.g. BG 2.13"]
  Q --> MEM["Curated-memory lookup"]
  D --> POOL["Candidate pool"]
  L -->|"forces its hits in"| POOL
  ADR -->|"forced past the floor"| POOL
  POOL --> RR["Cross-encoder reranker"]
  RR -->|"top-k plus per-kind reserves"| N["Grounded notes"]
  MEM -->|"pinned refs"| N

Dense retrieval is the lane that finds passages by meaning. Each fragment is a 1536-dimension embedding in the bi-encoder / DPR tradition, so a question about “the social orders” still matches a lecture that only ever says varṇāśrama. Comparing a query against hundreds of thousands of vectors exactly would be far too slow for a live turn, so we search them with approximate nearest-neighbour over an HNSW graph (ANN) — near-exact recall in milliseconds. This is what rescues a reader who does not know the corpus’s vocabulary.
Lexical search exists because embeddings blur. Ask for a specific name, a transliterated term, or a verse number and a BM25-family sparse match (Postgres full-text plus trigram) finds the exact string a fuzzy vector might rank below its near-synonyms. So this lane does not compete on score — it forces its hits into the candidate pool, so an exact term the user typed is never silently lost.
Structural address lookup — “BG 2.13” is the address of a verse, not a phrase. A deterministic SQL query resolves it and pushes it into the pool past the noise floor, so the reranker judges it on its text.

All of this fires not on the raw question but on several rewrites of it. A user asks in their own words, which rarely match how a 1970s lecture phrased the same idea, so the question is first expanded — query rewriting — into a handful of typed sub-questions and paraphrases. Firing every lane for each closes that vocabulary gap, and a paraphrase often trips a curated-memory trigger the literal question missed. Recall first, precision later.

The reranker does the real choosing

A bi-encoder cosine is a cheap guess: it embeds the question and the passage separately and measures the angle between them. It never read the two together. So after recall — the classic retrieve-and-rerank pattern — the pooled candidates go to a cross-encoder (Voyage rerank-2) that scores each (question, passage) pair jointly — and that score, not cosine, decides the final order. Recall is cheap and greedy; reranking is where we buy precision back, and precision here is an anti-hallucination measure: the fewer off-topic near-misses reach the model, the less plausible-but-wrong material it can build a confident answer on top of.

pool = by_cos[:RERANK_POOL_CAP]                 # top 60 by cosine
scored = await reranker.rerank(rerank_query, texts)
for idx, rs in scored:
    pool[idx].rerank_score = rs                 # cosine is left untouched
ranked = sorted(pool, key=lambda r: _rank_key(r, boost_kinds), reverse=True)
kept = ranked[:RERANK_TOP_K]                    # keep 16; reserves added next

Two design choices matter. The reranker is the primary selector — there is no absolute rerank-score cutoff, because cross-encoder scores are not calibrated across different questions; the cut is by rank (top 16 of 60). And because a cross-encoder trained on prose quietly favours chatty lecture transcripts over terse verses, the cut is followed by per-kind reserves that guarantee at least two verses and two library passages survive, so the answer is never starved of scripture.

When the question explicitly asks for a kind — “show me the verse” — that kind gets a small ordering nudge and a guaranteed minimum:

def _rank_key(r, boost_kinds):
    base = r.rerank_score if r.rerank_score is not None else -1.0
    if boost_kinds and r.kind in boost_kinds:
        base += KIND_BOOST_DELTA                # 0.15: an ordering nudge, not a score
    return (base, r.score)

Curated memory: a human hand on the retrieval

Pure similarity is not always enough. Some questions deserve a specific answer an editor has already prepared — a canonical set of verses, or a piece of framing that connects them. That is curated memory, and it is the most interesting source in the system.

flowchart TD
  CUR["Curator"] -->|"seed skill, MCP tools"| LIB["library.db"]
  LIB -->|"indexer, hourly"| PG["Postgres: notes plus embeddings"]
  QQ["Question plus rephrasings"] --> LK["Attribution lookup"]
  PG --> LK
  LK -->|"pinned: 0.85, or border re-judged"| REF["Citable curated refs"]
  LK -->|"memory note: 0.60"| NB["Non-citable background"]
  REF --> ANS["Answer"]
  NB -->|"shapes framing, never cited"| ANS

How it is collected. A curator authors entries through a seeding skill that drives an internal toolset: pin an exact question to a hand-picked set of verses, or write a background note with a few trigger phrases. Texts are auto-translated to every locale and published; a few hours later an indexer mirrors them into Postgres and embeds them — and, crucially, it embeds the trigger phrases and the chunks of the note together, so a question that echoes the note’s body, not just a trigger, still finds it.

How it is found. At query time the user’s question and each planned rephrasing are embedded and matched against those curated vectors. The thresholds are deliberately asymmetric by kind, and that asymmetry is the whole safety argument:

A pinned source is citable and authoritative, so the bar is high: cosine ≥ 0.85 in-language. A “maybe” in the 0.70–0.85 border zone is not trusted on cosine alone — it is re-judged by the same cross-encoder, scoring (question × the curator’s phrasing), and accepted only at ≥ 0.50. No judge available means reject. A hand-curated citation can never be conjured from a fuzzy match.
A memory note is only advisory framing, so its bar is lower (cosine ≥ 0.60) and needs no judge — a loose match is harmless because the note is never quoted. That 0.60 is calibrated, not guessed: genuine paraphrases of a trigger land around 0.62–0.69, while a false match from merely the same book sits near 0.49.

A strong note — matched at ≥ 0.70 with at least three resolvable references — short-circuits the expensive corpus sweep entirely:

def assess_sufficiency(question_matches, memory):
    if question_matches:                 # a curated pinned question
        return CORRECT                   # → trust the curated refs, skip the sweep
    if memory_is_sufficient(memory):     # strong note: >= 3 refs and score >= 0.70
        return CORRECT
    return INCORRECT                     # → fall through to the full corpus sweep

How it is used. Here is the elegant part, and the part I got wrong the first time I wrote this: the note and its references are treated differently. The note’s hand-picked refs are real sources — they enter the citable pool and are cited normally. The note itself is injected as background that shapes the framing but carries no citation marker and is never quoted:

def _attach_memory(result, mem):
    result.memory_note = mem.note                  # non-citable framing
    if mem.envelopes:                              # the curator's chosen refs…
        result.authoritative_refs += mem.envelopes # …DO cite, at score 0.75
    return result

So a human can steer which real sources surface and how they connect, without that guidance ever becoming a hallucinated citation.

Stage two — throw out the wrong data

Good recall is only half the job. A pool full of near-misses is how a model talks itself into a confident wrong answer, so everything below the bar is discarded before the model sees it.

flowchart TD
  P["Reranked candidates"] --> F{"cosine prefloor 0.18"}
  F -->|"below"| X["discard as noise"]
  F -->|"above"| C{"coverage gate"}
  C -->|"max >= 0.65, or max >= 0.55 with 2+ lecture chunks"| G["ground it"]
  C -->|"max < 0.40"| B["bail to out-of-corpus fallback"]
  G --> PL{"planner finds a usable thesis?"}
  PL -->|"all notes weak"| R["return empty → refuse"]
  PL -->|"yes"| S["hand off to synthesis"]

The floor is lower than you would guess — on purpose. It is tempting to set a high cosine floor and call anything below it noise. We did, once, at 0.45 — and it silently killed relevant verses that a cross-encoder would have rescued. So on the live path the pre-floor drops to 0.18, just enough to discard pure garbage, and the reranker decides the rest:

# 0.18 on the live rerank path; the old flat 0.45 killed real ~0.30 verses
floor = RERANK_NOISE_PREFLOOR if rerank_active else _RELEVANCE_FLOOR
for batch in per_query:
    for r in batch:
        if r.score < floor and not r.forced:
            continue                     # too far off to be worth reranking

The real refusal is the coverage gate. Discarding individual weak hits is not the same as deciding the pool as a whole is too thin to answer — and that decision is a handful of plain booleans, no LLM involved:

def is_coverage_sufficient(result, min_max_score=0.55, min_lectures=2):
    if result.max_score < min_max_score:
        return False
    return len(result.by_kind.get("lecture", [])) >= min_lectures

def is_coverage_good_enough(result):
    if is_coverage_sufficient(result):
        return True
    return result.max_score >= 0.65      # one confident hit is enough on its own

def should_bail_out(result):
    return result.max_score < 0.40       # nothing close — stop digging

Plain RAG has a blind spot: it assumes retrieval worked. When the corpus simply does not cover a question, naïve RAG feeds the model weak passages anyway — an engraved invitation to confabulate. Corrective RAG (CRAG) closes that hole by grading the pool before trusting it, and ours is CRAG in miniature. A strong pool grounds the answer; a single confident hit is enough; and a pool whose best match is below 0.40 triggers a bail-out — the system stops searching and hands off to an out-of-corpus fallback that answers from the model’s general knowledge on a stronger model, stamped with a mandatory disclaimer and forbidden from attaching any scripture citation. This is also the only place a stronger model is summoned, and the trigger is coverage failing — never a guess that a question “looks hard.”

The last gate before generation is the synthesis planner: a cheap model ranks the surviving notes into an outline, and if none of them actually support a thesis, it returns an empty outline that forces a clean refusal — before the expensive answer model ever runs.

Stage three — synthesize a grounded answer

Only now does the model write, and even here it is boxed in.

flowchart TD
  N["Notes shown as integer footnotes"] --> O["Outline planner"]
  O --> CO["Compact to cited notes only"]
  CO --> ST["Synthesizer streams the answer"]
  ST --> E["Marker-expander gauntlet"]
  E -->|"resolves"| W["playable citation"]
  E -->|"invented"| DR["dropped silently"]

Shrink the surface. The model never sees a real track ID it could imitate. Each note is shown to it as a bare integer footnote; the real identifier is held server-side and substituted only on output:

header = f"[^{idx}]"   # 1, 2, 3 … minted per turn; the real track_id is
                       # held server-side and substituted only on output

There is no track_… or BG_… token shape anywhere in its context to copy. We told the longer version of this story in A citation a model cannot fake.

The model ladder. Different sub-tasks run on different models; the live model for each is set in Langfuse, so any can be swapped for an A/B test without a deploy.

Job	Model
Intent routing	`gemini-3.1-flash-lite`
Query & topic planning	`gemini-3.1-flash-lite`
Note-attribution planner	`gemini-2.5-flash`
Titles, suggestions	`gemini-2.5-flash-lite`
Failure fallback	`claude-3-haiku`
Out-of-corpus knowledge	`claude-sonnet-4.6`

Two findings are baked into that table. Flash-Lite was tried and rejected for the note-attribution planner: on the bench it produced broken note references about 5% of the time, so that step runs on full Flash. And the cheap models need babysitting — on a single-tool menu, Gemini Flash-Lite ignores tool_choice="required" and answers from training data instead, so we pass the literal tool name to force the call.

The gauntlet. As the model streams, every citation-shaped marker it emits is validated and either expanded into a playable widget or dropped. A missing citation is a safe failure; a confident wrong one is not:

if ref is None:                       # an alias the model invented
    log.info("chat_marker_alias_miss", ref=n, known_max=len(self._aliases))
    return ""                         # drop it; a missing cite is a safe failure

The same gauntlet drops a footnote whose integer was never minted, a verse or chapter card not in this turn’s source map, an action reference whose action never fired, a malformed bracket, a repeated citation, and a leaked internal token. Anything it cannot resolve to a real source, it deletes.

The prosecutor

We also want to measure faithfulness at scale, without a human reading every turn — which is exactly what an LLM-as-a-judge is for. But two failure modes make it dangerous as a live guard. A model scoring its own output shows self-preference bias — it flatters itself — so we never let the answer model grade its own facts on the live path. The online auditor is therefore heuristic: it counts dropped and invented markers as numeric scores, and it batch-checks every emitted identifier against the real catalogue:

existing = set(await catalog_repo.filter_existing_track_ids(all_track_ids))
for tid in (*cite_track_ids, *card_track_ids, *outline_track_ids):
    if tid not in existing:
        broken += 1

The real faithfulness judge runs offline: a blind, cross-model A/B judge on claude-opus-4.8 at temperature 0, with sides assigned deterministically from the question’s id so it cannot favour a position — pairwise LLM judges carry a documented position bias — scoring faithfulness, completeness, structure, and balance. A slow, expensive, trustworthy judge belongs exactly where it cannot slow down or bias a live answer.

What did not work

For completeness, the things we tried and dropped. Tool-calling for citations: cheap models ignored it, the strong one duplicated the reference inline. Constrained decoding to a schema — what vendors sell as structured outputs. We hoped it would make an invalid citation literally unsamplable; in practice real token-blocking exists only for a few models, and for the rest it degrades to a hint they ignore. The marker gauntlet ended up doing the job they could not. Forceful prompting with MANDATORY/FORBIDDEN: helps a little, never enough on its own. And we stopped quietly patching the model’s mistakes — now we log every one, because you cannot improve a number you do not measure.

In closing

A trustworthy assistant is not a big, clever prompt. A prompt is a request, and a model is free to decline it. Ask a language model nicely not to lie and, often enough to matter, it will lie anyway — fluently, confidently, in the same voice as the truth.

What we built instead is a system that structurally cannot pass an invention off as a fact. The right sources are fetched and cross-encoder-ranked before anything is written. Everything below the bar is discarded, and a thin pool forces an honest refusal. The model never sees an id it could fake, and any reference it does emit is checked against a live map and deleted if it does not resolve. The honesty is not a behaviour we hope for — it is enforced by the rules and the code, whether the model cooperates or not.

That is also why the system keeps improving. Every layer is explicit — a threshold, a gate, a validation, a logged score — so each can be measured, tuned, or deleted on evidence. A single giant prompt is a black box you can only pray to; a stack of small, legible rules is something you can actually engineer. Accuracy here is not the size of the model. It is the architecture around it.