Where the thirty seconds went

Our chat assistant answered slowly — thirty to forty seconds per question. That is unacceptable: the user has time to get bored. So we sat down to look into it, and the story turned out to be instructive — the problem was not where we expected, and the fixes were all old ideas wearing new clothes.

First, measure

The first rule of optimization: do not optimize blind. We tag every chat turn with a trace id from the moment it leaves the phone, so each turn is one timeline in our tracing (Langfuse) — and the stages inside it are broken out one by one.

The first surprise: generating the text was not the problem. The model itself produced its answer in about four seconds. The other thirty-plus seconds were spent before the model ever spoke — in retrieval, the search for the relevant lecture fragments by meaning. Before answering, the system finds the passages that bear on the question, and that search alone was taking twelve to sixteen seconds, sometimes more.

So the model was the cheap part. Everything that fed it was the expensive part. That reframing is the whole article.

Idea 1 — filter before the search, not after

We expected the vector search itself to be slow. It wasn’t: the index returned its nearest neighbours in tens of milliseconds. The cost was in how we narrowed the results down.

Every passage in the corpus has a type and a language — a lecture transcript, a verse, a commentary; English, Russian. A given question only ever wants a slice of those: a Russian question about a verse wants Russian verses. The trouble was the order of operations. The search found the nearest neighbours by meaning across everything, and only then threw away the wrong type and language. When the passages you actually want are a minority of the corpus, that is brutal — to keep a couple of dozen survivors the search had to wade through hundreds of candidates of the wrong kind, and on production that wading spiked to five, ten, twelve seconds. It was the single largest contributor to the whole slow turn.

The fix is an old database instinct: don’t filter after the work, filter before it. Instead of one big index over the entire corpus, we build a separate partial index over each slice a query will ever ask for — this type, this language — so the search starts inside the right subset and never looks at anything it would only have to discard. (pgvector’s own docs flag this exact pitfall and point to partial indexes as the cure.)

-- a dedicated index per (type, language) slice, so the filter
-- is the index itself rather than a step after it
CREATE INDEX … ON … USING hnsw (embedding vector_cosine_ops)
  WHERE kind = 'track_transcript' AND lang = 'en';

The same worst-case search dropped from a couple of seconds to about 84 milliseconds.

Two less glamorous knobs mattered just as much. First, keep the index in RAM. On the default settings Postgres let the index spill to disk and our typical search wandered between 300 ms and a full second; sizing the database’s memory so the index lives entirely in the page cache pulled that down to 30–150 ms. Second, filtered vector search has a famous foot-gun — with the filter applied it can return zero rows unless you tell the engine to keep scanning, and the default candidate pool is too shallow once a filter prunes the top hits. Two one-line settings per query fixed both. None of these are clever; they are just the kind of thing you only find once you’ve measured.

On lecture searches end to end, that was 5.8 s → 0.5 s — about ten times faster.

Idea 2 — do the waiting in parallel

A chat turn isn’t one query. The planner expands a question into several sub-questions, and each of those searches a few lanes at once — lecture transcripts, verses, the rest of the library. Run one after another, that’s a long chain of round-trips stacked end to end.

So we stopped waiting in series. Every lane of every sub-question fires at the same time, and the turn waits only for the slowest of them — max(), not sum(). We push the same idea further upstream by working speculatively: the moment a question arrives we start embedding it and extracting its topics before we know whether we’ll even need them, hiding a second or two of latency under work we were going to do anyway, and throwing the result away on the cheap path. The speculative embed alone shaves 150–300 ms off every turn.

The discipline that makes aggressive parallelism safe is a deadline on every external call. A slow embedder or a stalled lane can’t hold the whole turn hostage — when a stage blows its budget, the pipeline carries on with the partial results it has rather than freezing. A slightly thinner answer beats a spinner. (The one thing we deliberately don’t paper over is the model provider being down — degrading that would produce a confident-sounding answer with nothing behind it, which is worse than an honest error.)

Idea 3 — only work hard on the hard questions

The biggest realization wasn’t about making the expensive search faster. It was that most questions don’t need it.

This is the Corrective-RAG instinct — judge what you already have before you go retrieve more. A great many questions are ones we already have a vetted, curated answer for, or that closely match something we’ve answered before. For those, doing the full wide sweep — expand into topics, search every lane over multiple rounds, gather a hundred-plus passages, re-rank them all — is pure waste. So the pipeline now makes a cheap judgment up front, from evidence already in hand and without an extra model call: is there already a strong answer for this? If yes, it takes a lean path — fetch the curated material, do a small bounded search, done. Only genuinely open questions pay for the full ~20-second sweep.

The same “stop when you’re done” instinct runs inside the expensive path too. After each round of searching we ask whether the results are already good enough, and bail out early if they are — a second round costs another planning call plus another fanout, roughly nine seconds of wall-clock, and almost never improves an answer that already has a confident hit. We bail in the other direction as well: if the first round comes back with nothing relevant, a second round is just rolling the dice for the same money, so we don’t.

Idea 4 — cheap by default, expensive only where it shows

A retrieval pipeline is a stack of model calls, not one — planning, topic extraction, the final answer. Most of them are small structured-data steps that a fast, cheap model does perfectly well; only the final synthesis needs a strong model. So we tier them, and the dominant cost falls away. The tiering is empirical, not dogmatic: one mid-pipeline step that ranks twenty-odd candidate passages kept hallucinating broken references on the cheapest model, so that one we bumped back up. You earn the savings everywhere and pay for quality only where the measurement tells you to.

The same goes for caching: the same handful of canonical questions and phrasings recur across users, so we cache their embeddings and topic extractions — but deliberately don’t cache the things that rarely repeat. Caching is only a win where reuse is real.

What did not work

A newer, “faster” planner model. The latest one turned out to be three to four times slower. Rolled back.
Embedding throttling. We suspected we were hitting the provider’s rate limits during search. The traces showed zero retries on the query path — the throttling was real, but only during bulk indexing, never during a live answer. A false trail that cost us a day.

Both are the same lesson as the first one: the trace told the truth, our hunch didn’t.

Where it stands

End to end, 35–50 s → ~20–25 s, and the heaviest stage — retrieval — is now almost instant. From here the levers get finer: smarter adaptive logic so the expensive path triggers even more selectively, and moving the database onto its own machine so a heavy turn can’t disturb everything else sharing the box.