What "semantic search over audio" actually requires
Three things in order. (1) A transcript that preserves who said what — flat text loses too much for production retrieval. (2) Chunking that respects conversation structure — character-count chunking on transcripts produces incoherent embeddings. (3) Metadata that survives ingestion — speaker name, timestamp, source recording — so retrieval can filter and synthesis can attribute.
Markdown with speaker headings provides all three by construction. The conversion gives you structured text. Header-aware chunking gives you coherent units. Heading metadata survives any ingestion pipeline. Pinecone, Chroma, Weaviate, and Qdrant all handle the resulting vectors equally well.
Recommended schema
Per-chunk metadata to store: speaker (string, indexed), timestamp (HH:MM:SS, indexed), source_file (the original audio filename), source_date (when the recording was made), topic (optional, from ### subheadings). Indexing speaker and timestamp lets you scope retrieval to specific people or time ranges; indexing source_date lets you query "what did anyone say about X in Q1".
Vector DB choice
For audio corpora specifically: Pinecone if you want managed and don't want to think about ops; Chroma for local development and small archives; Weaviate when hybrid retrieval (keyword + vector) matters because exact phrase matches happen often in transcripts; Qdrant when filter-heavy queries (per-speaker, per-time-range) dominate your access patterns. Pair with PDF and web sources via PDF for Vector DBs and URL for Vector DBs for unified retrieval.