Using Video Content in RAG Pipelines: Architecture Guide
Most production RAG (retrieval-augmented generation) systems index documents — PDFs, web pages, internal wiki articles, support tickets, code documentation. A meaningful fraction of every organization's actual knowledge lives in video instead: recorded meetings, training sessions, conference talks, internal all-hands, recorded customer interviews, demo videos, recorded onboarding sessions. Most of this content is invisible to the RAG pipeline because the retrieval architecture is built for text and the video is, well, video. Adding video to a RAG system isn't conceptually hard — convert to transcript, chunk, embed, retrieve — but the practical details of doing it well (chunking strategy for multi-hour content, parent-document linking, handling the timestamp metadata, balancing chunk size against retrieval precision) matter enough to be worth covering carefully. This article walks through the architecture, with a working Python example using open-source components running entirely locally.
Why video is the under-tapped half of most knowledge bases
For an organization with the typical mix of knowledge artifacts:
- Wiki and documentation: indexed in the RAG system, accessible to retrieval
- Support tickets and historical chat logs: indexed
- PDF documents (contracts, reports, vendor docs): indexed (well or poorly depending on the PDF-extraction quality)
- Recorded meetings, all-hands, training, demos: not indexed, exists as opaque .mp4 files in Drive or Vimeo
The fourth category contains a significant amount of the org's actual operational knowledge — what the founder said about the product strategy in last quarter's all-hands, how the senior engineer explained the architecture in a recorded brown-bag, what the customer said in their recorded discovery call. None of it is searchable; none of it is retrievable; the AI assistant the team built can't see any of it.
Adding video to the RAG pipeline closes this gap. The architecture: video → transcript → Markdown → chunk → embed → store in vector DB → retrieve at query time → feed retrieved chunks to the generation model alongside the user's query. Same pattern as text-RAG; the front-end transcription step is the only addition.
The end-to-end pipeline
- Convert each video file to a structured Markdown transcript via video-to-markdown for cloud workflow, or via local yt-dlp + Whisper for batch/private content
- Chunk the transcript into retrieval-sized segments (typically 200-800 tokens each, with overlap) using the H2/H3 section structure as natural boundaries when possible
- Embed each chunk using a sentence-embedding model (sentence-transformers, OpenAI's text-embedding-3, Voyage's voyage-3, or any production embedding model)
- Store the embedded chunks in a vector database (ChromaDB, pgvector, Qdrant, Weaviate, Pinecone) along with metadata (source video URL, timestamp range, speaker if known, chunk position in the original document)
- Retrieve at query time by embedding the user's query, finding the nearest-neighbor chunks in the vector store, and returning the top-K matches
- Generate by feeding the retrieved chunks (with their source metadata) to the generation model alongside the user's query
Each step has design choices that affect retrieval quality. The next sections cover the ones that matter most for video content specifically.
Chunking strategy for multi-hour content
For short documents (a help-center article, a single-page contract), chunking is straightforward — split on paragraphs or sentences, embed each chunk, done. For multi-hour video transcripts (a 90-minute all-hands, a 3-hour deposition, a 6-hour day of recorded meetings), naive chunking produces poor retrieval quality. The two main considerations:
Topic-aware chunking. The H2 section structure in a structured Markdown transcript marks the natural topic boundaries — the speaker shifted from one topic to another, the conversation pivoted, the meeting moved to the next agenda item. Chunks that respect these boundaries retrieve better than chunks that arbitrarily split mid-topic, because the embedding model can capture the topic signal coherently within each chunk.
Practical approach: for each H2 section, generate one or more chunks depending on length. Sections shorter than ~500 tokens become a single chunk; longer sections get split with sentence-aware splitting and overlap.
Parent-document linking. For long videos, the most useful retrieval result is often "this short passage from the relevant context" — the user's query matches a specific moment in the video, but the answer they need is grounded in the broader context of the surrounding minutes. Parent-document linking solves this by storing each chunk alongside a pointer to the larger section it came from, and retrieving both the small chunk (for embedding-based matching) and the larger surrounding context (for generation-time grounding).
from dataclasses import dataclass
from typing import List
@dataclass
class TranscriptChunk:
text: str # the chunk for embedding (small, ~300 tokens)
parent_section: str # the surrounding H2 section (larger context)
source_video_url: str
timestamp_start: str # "[00:14:32]"
timestamp_end: str # "[00:17:45]"
chunk_index: int
speaker: str # if knownAt retrieval time, you embed and search on the small chunks for precision, then return the parent sections for generation-time context. The model gets both the precise match and the broader context.
Embedding model choice
The embedding model converts text chunks into fixed-length vectors that capture semantic meaning. Production options:
- sentence-transformers/all-MiniLM-L6-v2 (open-source, runs locally) — 384-dimensional embeddings, fast, good baseline quality, no API cost. Right starting point for most local pipelines.
- sentence-transformers/all-mpnet-base-v2 (open-source, runs locally) — 768-dim, slower than MiniLM but higher quality. Good middle-ground.
- BAAI/bge-large-en-v1.5 (open-source, runs locally) — strong retrieval performance on benchmarks, larger model.
- OpenAI text-embedding-3-small or text-embedding-3-large (API, paid) — production-grade quality, easy integration, costs ~$0.02-0.13 per million tokens.
- Voyage voyage-3 / voyage-3-large (API, paid) — competitive with or above OpenAI on retrieval benchmarks, similar pricing.
For a self-hosted local pipeline (privacy-sensitive content, no external dependencies), all-mpnet-base-v2 or bge-large is the typical choice. For production cloud pipelines where the embedding cost is acceptable, OpenAI's text-embedding-3 or Voyage's models are competitive choices.
Vector database choice
For local pipelines and small-to-medium scale (up to a few million chunks), ChromaDB is the easy choice — pure Python, embeddable in your application, no separate server required. Pinecone, Qdrant, Weaviate are the production-managed options at higher scale. pgvector on Postgres is the right choice when your existing infrastructure is already Postgres-centric.
For this article's example we use ChromaDB because it's the simplest path to a working local pipeline.
The working Python example
Below is a complete pipeline that takes a folder of video transcripts (in structured Markdown), chunks them by H2 section, embeds with sentence-transformers, stores in ChromaDB, and answers queries by retrieving the top-K relevant chunks.
import re
from pathlib import Path
from dataclasses import dataclass, asdict
from typing import List
import chromadb
from sentence_transformers import SentenceTransformer
# ---------- Chunking ----------
@dataclass
class Chunk:
text: str
parent_section: str
source: str
timestamp_start: str
chunk_index: int
def parse_markdown_transcript(md_path: Path) -> List[dict]:
"""Split a structured Markdown transcript by H2 sections."""
text = md_path.read_text(encoding="utf-8")
sections = []
current = {"heading": "Intro", "body": []}
for line in text.splitlines():
if line.startswith("## "):
if current["body"]:
sections.append({
"heading": current["heading"],
"body": "\n".join(current["body"]).strip(),
})
current = {"heading": line[3:].strip(), "body": []}
else:
current["body"].append(line)
if current["body"]:
sections.append({
"heading": current["heading"],
"body": "\n".join(current["body"]).strip(),
})
return sections
def chunk_section(section: dict, source: str, max_tokens: int = 400) -> List[Chunk]:
"""Split a section into ~max_tokens chunks with sentence-aware splitting."""
body = section["body"]
# Find first timestamp anchor for the section
ts_match = re.search(r"\[(\d{2}:\d{2}(?::\d{2})?)\]", body)
section_ts = ts_match.group(0) if ts_match else "[unknown]"
# Approximate token count by word count * 1.3
words = body.split()
if len(words) * 1.3 <= max_tokens:
return [Chunk(
text=body,
parent_section=f"## {section['heading']}\n\n{body}",
source=source,
timestamp_start=section_ts,
chunk_index=0,
)]
# Otherwise split into ~max_tokens chunks at sentence boundaries
sentences = re.split(r"(?<=[.!?])\s+", body)
chunks = []
current = []
current_words = 0
idx = 0
for sent in sentences:
sent_words = len(sent.split())
if current_words + sent_words > max_tokens / 1.3 and current:
chunks.append(Chunk(
text=" ".join(current),
parent_section=f"## {section['heading']}\n\n{body}",
source=source,
timestamp_start=section_ts,
chunk_index=idx,
))
idx += 1
current = [sent]
current_words = sent_words
else:
current.append(sent)
current_words += sent_words
if current:
chunks.append(Chunk(
text=" ".join(current),
parent_section=f"## {section['heading']}\n\n{body}",
source=source,
timestamp_start=section_ts,
chunk_index=idx,
))
return chunks
# ---------- Indexing ----------
embedder = SentenceTransformer("all-mpnet-base-v2")
client = chromadb.PersistentClient(path="./video_rag_db")
collection = client.get_or_create_collection("video_transcripts")
def index_transcript_folder(folder: str):
transcripts_dir = Path(folder)
all_chunks = []
for md_file in transcripts_dir.glob("*.md"):
sections = parse_markdown_transcript(md_file)
for section in sections:
chunks = chunk_section(section, source=md_file.stem)
all_chunks.extend(chunks)
if not all_chunks:
return
texts = [c.text for c in all_chunks]
embeddings = embedder.encode(texts, show_progress_bar=True).tolist()
ids = [f"{c.source}-{c.chunk_index}-{i}" for i, c in enumerate(all_chunks)]
metadatas = [{
"source": c.source,
"timestamp_start": c.timestamp_start,
"parent_section": c.parent_section,
} for c in all_chunks]
collection.add(
ids=ids,
documents=texts,
embeddings=embeddings,
metadatas=metadatas,
)
print(f"Indexed {len(all_chunks)} chunks from {folder}")
# ---------- Retrieval ----------
def retrieve(query: str, k: int = 5):
query_embedding = embedder.encode([query]).tolist()
results = collection.query(
query_embeddings=query_embedding,
n_results=k,
)
return [
{
"chunk_text": doc,
"source": meta["source"],
"timestamp": meta["timestamp_start"],
"parent_section": meta["parent_section"],
"distance": dist,
}
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0],
)
]
# ---------- Usage ----------
if __name__ == "__main__":
# Index a folder of Markdown transcripts
index_transcript_folder("./transcripts")
# Query
matches = retrieve("what did the founder say about pricing strategy?", k=3)
for m in matches:
print(f"\n--- {m['source']} at {m['timestamp']} (distance={m['distance']:.3f}) ---")
print(m['chunk_text'][:500])This is a complete working pipeline. Drop your Markdown transcripts into the ./transcripts folder, run the indexing step once, and queries return the top-K relevant passages with their source video and timestamp. For integration with a generation model, feed the retrieved parent_section values plus the user's query to your LLM of choice.
Cross-modal retrieval: video plus documents
The same vector store can hold chunks from multiple content types — video transcripts, PDF documents, web pages, internal wiki entries. The embedding model treats them all as text; the metadata distinguishes the source type. At retrieval time, the most relevant chunk wins regardless of where it came from.
For organizations building a unified knowledge base, this is the end-state architecture:
- Video transcripts via this article's pipeline
- Web content via building a web knowledge base for AI
- PDF documents via the standard PDF-to-Markdown pipeline
- Internal wiki content (Confluence, Notion, SharePoint) via their respective export tools
All converted to Markdown, all chunked into the same vector store, all retrievable by the same query. The internal AI assistant gains visibility across the full knowledge surface.
Timestamp-precise citation in generated answers
One under-used feature of video-RAG specifically: the timestamp metadata in each retrieved chunk lets the generation model produce answers with timestamp-precise citations back to the source video. Instead of "the founder mentioned this in the Q3 all-hands" (which leaves the user to find it), the answer can be "the founder mentioned this at [00:23:14] in the Q3 all-hands" with a link the user can click to jump to the exact moment.
For internal-tool AI assistants, this changes the user experience meaningfully — answers become verifiable, the source is traceable, and the user can confirm the AI's interpretation by watching the actual passage. The grounding is concrete in a way that's harder to achieve with text-source-only RAG.
Performance characteristics at scale
Realistic numbers for a small-to-medium organizational deployment:
| Corpus size | Chunks | Embedding time (one-shot) | Storage | Query latency |
|---|---|---|---|---|
| 50 hours of video (~50 transcripts) | ~5,000-10,000 | ~5-15 minutes (CPU) | ~50 MB | < 100ms |
| 500 hours (~500 transcripts) | ~50,000-100,000 | ~1-2 hours (CPU), 10-20 min (GPU) | ~500 MB | < 200ms |
| 5,000 hours (large org corpus) | ~500k-1M chunks | ~10-20 hours (CPU), 1-2 hours (GPU) | ~5 GB | < 500ms (move to Qdrant/Weaviate) |
For most teams, ChromaDB handles the small-to-medium tier comfortably. At larger scale, migrating to a managed vector database (Pinecone, Qdrant, Weaviate) is straightforward — the chunking, embedding, and metadata patterns transfer directly.
The pipeline summary
Convert videos to structured Markdown transcripts via video-to-markdown → chunk by H2 sections with parent-document linking → embed with sentence-transformers → store in ChromaDB (or production vector DB at scale) → retrieve at query time → feed to generation model with timestamp-precise citation. For the broader knowledge-base context that integrates web content alongside video, see building a web knowledge base for AI. For the workflow of building a complete searchable video library, see building a searchable video library. For the speaker-identification details that affect chunk metadata, see speaker identification in video transcription.