Stories you can watch — Jiva Studio

Our chat assistant in Lectorium — the one that answers questions about the lectures and the scriptures — now answers with more than text. Ask about some episode from Śrīla Prabhupāda’s life and what comes back can be short video clips: the disciples themselves telling how it was.

Before, finding a particular story meant knowing exactly where to look. Now it does not. Write it in your own words — “I think someone told a story about how Prabhupāda cared for his disciples” — and the chat finds the fragment.

Under the hood

The source is the film Following Śrīla Prabhupāda, cut into individual stories:

~2,600 video fragments
~56 hours of disciples’ recollections in total
in both languages — ~1,390 Russian clips (27 h) and ~1,240 English clips (29 h)
350+ disciples of Śrīla Prabhupāda telling their own stories — Śyāmasundara, Brahmānanda, Śrutakīrti, Yamunā, Girirāja Swami, and many more
11 DVDs, events from 1966–1977: Māyāpur, Los Angeles, Vṛndāvana, Bombay, London…

From a film to ~2,600 stories

None of those fragments existed as files. The source is eleven DVDs of continuous footage — one recollection runs into the next, a scene cuts, a new disciple begins. Turning that into a searchable library is a pipeline, and each stage earns its place.

flowchart TD
  F["11 DVDs · 56 h of footage"] --> T["1 · Transcribe (Parakeet on Neural Engine, word-level timing)"]
  T --> SEG["2 · Find the stories (diarize, then LLM-judge the seams)"]
  SEG --> REV["3 · Proofread the ASR (names, Sanskrit)"]
  REV --> META["4 · Identify the speaker plus write a context line"]
  META --> CUT["5 · Cut the video (ffmpeg, re-encode to 720p H.264)"]
  CUT --> EMB["6 · Embed into the shared corpus + store on the CDN"]
  EMB --> CHAT["~2,600 clips, found by meaning"]

Transcribe the footage

For each DVD we pull the audio with yt-dlp, downmix it to mono 16 kHz, and run it through speech recognition. The model is NVIDIA’s Parakeet-TDT-0.6b-v3 — one multilingual model for both the Russian and the English material, the language passed as a hint rather than swapped out. It runs locally, on the Apple Neural Engine of an M4 Mac: the CoreML weights stay hot in memory behind a small HTTP job queue, so 56 hours transcribe at a few hundred times real-time, with no per-minute cloud bill. We drive it straight from an agent through an MCP tool — hand it a file, get back a job_id, wait on the job.

What comes back is not just text but word-level timing and a per-word confidence, which is exactly what later lets us cut on a sentence rather than mid-breath:

{ "word": "Prabhupada", "startTime": 4.24, "endTime": 4.83, "confidence": 0.997 }

The service runs with Russian as its default language; low-confidence stretches — chanting, off-language audio — fall below 0.5 and flag themselves for a human to look at.

Find where one story ends

A transcript is a wall of words; a story has a beginning and an end — almost always one disciple’s whole turn, from the moment they begin until a different person starts. This is the hardest stage, and the two editions of the film force two opposite strategies.

English — let the voices draw the lines. Each devotee is filmed and named on screen, and the edit cuts cleanly between them. So a speaker-diarization model (Deepgram’s nova-2) marks every point where the voice changes, and one continuous run by a single voice becomes one story. A Prabhupāda lecture or interview is one story; passages of pure music and credits, with no words, are dropped.

Then a few rules clean the edges: nothing under ~18 seconds stands on its own (it is folded into a neighbour), nothing over ~4 minutes stays whole (it is re-split at its largest internal pause), and a final check flags any clip that starts lower-case or ends without punctuation — the tell-tale of a cut made mid-sentence.

The Russian problem: one voice, many narrators

The Russian edition breaks that approach completely. It is not dubbed per person — a single translator voices every devotee. The voice never changes, so diarization is blind: to it, the whole five-hour DVD is one speaker.

So we invert the method: over-segment, then judge. First we cut the transcript far too finely — at every pause long enough to be a seam, each one snapped to the nearest real gap between words. That yields many more fragments than there are stories. Then, for every adjacent pair of fragments, a reasoning model (Claude) answers a single question — judging by who is speaking and what is told, not by the voice, since the voice is identical throughout:

Two consecutive fragments from the Russian voiceover of a documentary where many devotees recount memories of Śrīla Prabhupāda (one translator voices them all — the voice is no clue to who is speaking). MERGE only if B is the same person continuing the same remembrance as A… Answer SPLIT if B is a different person’s account or a different story. Reply with one word: MERGE or SPLIT.

Thousands of seams are weighed in parallel, and what comes back out is whole, per-narrator stories, reassembled from the pieces. Two editions of the same film, two opposite strategies — both forced by one decision in how each was dubbed.

Proofread the recognition

Speech recognition mishears Sanskrit and proper names. So a language model passes over every story and corrects only recognition errors — Tripurāri Mahārāja, Māyāpur, Ratha-yātrā, a consistent spelling of Prabhupāda — under strict instruction not to paraphrase, summarize, or add or drop a single line. (If it ever runs away and returns more than twice the input length, we discard its output and keep the raw transcript.)

Identify and situate each story

Who is speaking. A clip is worth far more when it is signed. In the English edition the name is on the screen — a lower-third caption, “Śyāmasundara dāsa remembers.” So ffmpeg grabs a single frame from each story, crops the lower third, and a vision model reads the caption and returns just the name. The Russian clips have no captions — only the translator’s voice — so they borrow: each Russian story is matched to its English twin by content (the editions differ in order and length, so the match is on what is told, not on timecode) and inherits the name from it.

A line of context. A standalone clip is easy to misread: “and then he just smiled at me” means nothing without knowing who he is, who is talking, and when. So in the same pass a model writes a one- or two-sentence situating line for each story — who is speaking, what it is about, when and where — primed with a hand-written era note for each DVD (DVD 7, for instance, is “the 1975 US tour and the Māyāpur festival”), and extracts the structured metadata alongside it: year, places, topics, people. That line is prepended to the transcript before embedding, so semantic search has something to grip even when the clip itself never names its subject. It is the idea behind Anthropic’s contextual retrieval — a little context bound to each fragment so the search is not fooled by a clip that, alone, says almost nothing. The viewer still sees only the clean transcript; the context works behind the glass.

Cut the video

Only now is the video itself touched. ffmpeg cuts each story out of the full-resolution download — re-encoded rather than stream-copied, so every clip opens on a real frame — to 720p H.264 with AAC audio, web fast-start, and a quarter-second of padding on each side so no one is clipped mid-word:

ffmpeg -ss {start-0.25} -i source.mp4 -t {duration} \
  -map 0:v:0 -map 0:a:0 \
  -c:v libx264 -preset veryfast -crf 23 -vf scale=-2:720 \
  -c:a aac -b:a 128k -movflags +faststart  clip.mp4

A poster frame is grabbed for each clip in the same step.

Embed, index, and serve

Each story — its context line prepended to the proofed transcript — is turned into an embedding by OpenAI’s text-embedding-3-small (1,536 dimensions) and written into the same chunks corpus as the lectures, verses, and letters: a clip is just kind = 'media', one of six kinds. Re-running the indexer re-embeds only what changed, since each row carries a content hash. The same retrieval that already serves audio fragments — the partial-HNSW search in pgvector — then reaches across the clips too, with no separate video index.

The cut files themselves go to object storage (an S3 bucket, with Yandex and Bunny CDN mirrors); the database keeps only a relative path, and the app resolves it against whichever CDN edge is nearest when it starts. So when a clip wins the search, the chat streams that exact moment to you, its transcript beside it.

The result is that video is not a feature bolted on the side. A clip is just another kind of entry in a corpus the rest of the system already understands — which is exactly why a question written in your own words can return one.

Ask about places, years, or events — or simply “tell me a story about…” — and watch how it was.