Where video RAG pipelines fall apart without Markdown
Two failure modes show up immediately. First, naive chunking on auto-generated captions slices through topic boundaries — embeddings encode "the end of chapter 3 plus the beginning of chapter 4", which clusters at noise. Second, retrieval over those chunks surfaces 30-second fragments without the chapter or speaker context the LLM needs to synthesise an answer.
Structured Markdown with chapter headings and (for multi-speaker formats) speaker headings solves both. Header-aware chunking respects topic boundaries. Each chunk is one coherent unit — a chapter, a speaker turn, a section of an explanation. Embeddings encode that unit cleanly. Retrieval surfaces complete arguments.
The pipeline
Convert each video on Video to Markdown (paste a YouTube URL or upload an MP4), save the .md, then chunk and embed locally. Building a multi-source pipeline? Convert PDFs (PDF for RAG), web pages (URL for RAG), and audio (Audio for RAG) the same way.
Recommended chunking
Split first by ## (chapter or speaker boundary), then sub-split anything over 800 tokens with a recursive character splitter. Keep chapter title, speaker name, and timestamp as chunk metadata — your retrieval can filter by speaker, by time range, or by topic, and your synthesis prompts get free structural context.