A citation a model cannot fake
A cheap model invented citation IDs and timecodes out of thin air. The fix was not a smarter model — it was to never show it an ID at all.
In the chat, every quote the assistant gives is a chip — tap it and the original audio plays from exactly the right moment. For that to work, each chip has to carry a real lecture ID and real timecodes: which millisecond to start, which to stop.
What was breaking
The model made everything up. The first two or three citations were real; after
that it started generating IDs of a similar shape (track_iL7KjU5Lp4pZ) that
existed nowhere in the catalogue. Sometimes it slid into BG_1972_03.05 — a
format soaked up from religious texts during pre-training. The timecodes were
invented too: it would stamp @5000000-5060000 onto a lecture thirty minutes
long. The user tapped, and the app said “could not load audio.”
Why? The model sees a pattern in its search results — track_ + fourteen random
characters + @ + two big numbers — and a cheap model (Gemini Flash Lite, at
$0.10/M tokens) simply keeps writing tokens of the same shape, by analogy.
There is no link to the catalogue at all. The IDs and the numbers alike are
statistical imitation.
We tried tool calling, constrained generation, a hard MANDATORY/FORBIDDEN
prompt. On cheap models, none of it really worked.
The move: shrink the surface
The main finding was to reduce the surface for hallucination. Not to teach the model not to lie — to take away its ability to.
Before, the model saw the full catalogue record in its search results:
{ "track_id": "track_OkPV…", "start_ms": 630560, "end_ms": 684400, "text": "…" }
— and wrote [cite:track_OkPV…@630560-684400|the perfection of life], half of
which it invented.
Now the model sees only the text, numbered:
[1] "Bhakti is the path of service…"
[2] "Devotional service purifies…"
[3] …
No IDs, no milliseconds. The server keeps the map to itself — each
(track_id, start_ms, end_ms) triple gets a small sequential integer alias
starting from 1. The model cites a note by writing the bare integer back as a
footnote, [^1], and the server expands it into the real
[cite:track_OkPV…@630560-684400|the perfection of life] on the way out, just
before the marker reaches the client. The prompt is blunt about it:
To cite a note, write the exact same [^N] — copy the integer verbatim,
do not invent or increment.
What changes is categorical. The model physically cannot invent an ID or a time,
because it is working with a single short integer that already exists in its own
context. Write [^7] where there were only five notes and it is discarded
instantly. Sliding into BG_1972_03.05 is impossible too — there is no
track_… anywhere in its window for the pattern to copy.
A bonus: follow-ups still work
We keep this alias map on the front end as well, attached to each message. On the
next turn the client sends it back, the server folds the earlier chips back into
[^N] form, and the model sees its own past citations in the same numbered
format — so it understands “tell me more about that first quote.” Without the
map, those earlier references would expand into an unreadable mush of full IDs,
and the model would lose the thread between the question now and what was said
before.
The lesson
You can solve this head-on, by buying a model thirty times the price. Or you can change not the model but what it sees. The second path makes the system cheaper to run — which means more affordable for the people using it, and easier to scale. Quality does not have to be expensive: it can be the result of an architectural decision, not the size of the model.
Further reading: Structured model outputs (OpenAI docs) · Vectara Hallucination Leaderboard
Part of
Lectorium