Where audio RAG pipelines fall apart
Two failure modes show up immediately. First, fixed-size chunking on flat transcript text routinely splits a single speaker's turn into two chunks, while joining the end of one speaker's turn to the start of another's. Embeddings then encode noise — half a question plus half an answer reads as nothing in particular. Second, retrieval over those chunks surfaces fragments that the LLM can't synthesise from, because the speaker context is gone.
Markdown with ## Speaker [HH:MM:SS] headings solves both. Header-aware chunking respects turn boundaries. Each chunk is one speaker saying one thing. Embeddings encode that thing cleanly. Retrieval surfaces complete exchanges instead of orphan fragments.
The pipeline
Convert each audio file on Audio to Markdown, save the .md, then chunk and embed locally. Building a multi-source pipeline? Also convert your PDFs (PDF for RAG) and web pages (URL for RAG) so every modality reaches the vector DB through the same structured-Markdown path.
Recommended chunking
Split first by ## (speaker turn), then sub-split anything still over your token budget. Target 600-1000 tokens per chunk, 50-100 overlap. Keep speaker name and timestamp as chunk metadata — your retrieval can then filter by speaker, by time range, or both.