What semantic search over video actually requires
Three things in order. (1) A transcript that preserves chapter structure and (when applicable) speaker attribution — flat captions lose too much for production retrieval. (2) Chunking that respects topic boundaries — character-count chunking on captions produces incoherent embeddings. (3) Metadata that survives ingestion — chapter title, speaker name, timestamp, source video — so retrieval can filter and synthesis can cite specific moments.
Markdown with chapter and speaker headings provides all three by construction. The conversion (paste a YouTube URL or upload an MP4 on Video to Markdown) gives you structured text. Header-aware chunking gives you coherent units. Heading metadata survives any ingestion pipeline. Pinecone, Chroma, Weaviate, and Qdrant all handle the resulting vectors equally well.
Recommended schema
Per-chunk metadata to store: chapter (string, indexed), speaker (string, indexed when applicable), timestamp_start (HH:MM:SS, indexed), timestamp_seconds (numeric, for range filters), source_video (filename or URL), source_date (when published or recorded). Speaker and chapter indexing lets you scope retrieval; numeric timestamps let you do range filters ("everything between 00:30:00 and 00:45:00 of talk X").
Vector DB choice
For video corpora specifically: Pinecone if you want managed and don't want to think about ops; Chroma for local development and small archives; Weaviate when hybrid retrieval matters because exact phrase matches happen often in technical talks; Qdrant when filter-heavy queries (per-speaker, per-conference, per-time-range) dominate. Pair with PDF (PDF for Vector DBs), URL (URL for Vector DBs), and audio (Audio for Vector DBs) sources for unified retrieval.