A citation a model cannot fake

In the chat, every quote the assistant gives is a chip — tap it and the original audio plays from exactly the right moment. For that to work, each chip has to carry a real lecture ID and real timecodes: which millisecond to start, which to stop.

What was breaking

The model made everything up. The first two or three citations were real; after that it started generating IDs of a similar shape (track_iL7KjU5Lp4pZ) that existed nowhere in the catalogue. Sometimes it slid into BG_1972_03.05 — a format soaked up from religious texts during pre-training. The timecodes were invented too: it would stamp @5000000-5060000 onto a lecture thirty minutes long. The user tapped, and the app said “could not load audio.”

Why? The model sees a pattern in its search results — track_ + fourteen random characters + @ + two big numbers — and a cheap model (Gemini Flash Lite, at $0.10/M tokens) simply keeps writing tokens of the same shape, by analogy. There is no link to the catalogue at all. The IDs and the numbers alike are statistical imitation.

We tried tool calling, constrained generation, a hard MANDATORY/FORBIDDEN prompt. On cheap models, none of it really worked.

The move: shrink the surface

The main finding was to reduce the surface for hallucination. Not to teach the model not to lie — to take away its ability to.

Before, the model saw the full catalogue record in its search results:

{ "track_id": "track_OkPV…", "start_ms": 630560, "end_ms": 684400, "text": "…" }

— and wrote [cite:track_OkPV…@630560-684400|the perfection of life], half of which it invented.

Now the model sees only the text, numbered:

[1] "Bhakti is the path of service…"
[2] "Devotional service purifies…"
[3] …

No IDs, no milliseconds. The server keeps the map to itself — each (track_id, start_ms, end_ms) triple gets a small sequential integer alias starting from 1. The model cites a note by writing the bare integer back as a footnote, [^1], and the server expands it into the real [cite:track_OkPV…@630560-684400|the perfection of life] on the way out, just before the marker reaches the client. The prompt is blunt about it:

To cite a note, write the exact same [^N] — copy the integer verbatim,
do not invent or increment.

What changes is categorical. The model physically cannot invent an ID or a time, because it is working with a single short integer that already exists in its own context. Write [^7] where there were only five notes and it is discarded instantly. Sliding into BG_1972_03.05 is impossible too — there is no track_… anywhere in its window for the pattern to copy.

A bonus: follow-ups still work

We keep this alias map on the front end as well, attached to each message. On the next turn the client sends it back, the server folds the earlier chips back into [^N] form, and the model sees its own past citations in the same numbered format — so it understands “tell me more about that first quote.” Without the map, those earlier references would expand into an unreadable mush of full IDs, and the model would lose the thread between the question now and what was said before.

The lesson

You can solve this head-on, by buying a model thirty times the price. Or you can change not the model but what it sees. The second path makes the system cheaper to run — which means more affordable for the people using it, and easier to scale. Quality does not have to be expensive: it can be the result of an architectural decision, not the size of the model.

Further reading: Structured model outputs (OpenAI docs) · Vectara Hallucination Leaderboard