Pricing Dashboard Sign up
Recent
· 12 min read · MDisBetter

Using Video Content in RAG Pipelines: Architecture Guide

Most production RAG (retrieval-augmented generation) systems index documents — PDFs, web pages, internal wiki articles, support tickets, code documentation. A meaningful fraction of every organization's actual knowledge lives in video instead: recorded meetings, training sessions, conference talks, internal all-hands, recorded customer interviews, demo videos, recorded onboarding sessions. Most of this content is invisible to the RAG pipeline because the retrieval architecture is built for text and the video is, well, video. Adding video to a RAG system isn't conceptually hard — convert to transcript, chunk, embed, retrieve — but the practical details of doing it well (chunking strategy for multi-hour content, parent-document linking, handling the timestamp metadata, balancing chunk size against retrieval precision) matter enough to be worth covering carefully. This article walks through the architecture, with a working Python example using open-source components running entirely locally.

Why video is the under-tapped half of most knowledge bases

For an organization with the typical mix of knowledge artifacts:

The fourth category contains a significant amount of the org's actual operational knowledge — what the founder said about the product strategy in last quarter's all-hands, how the senior engineer explained the architecture in a recorded brown-bag, what the customer said in their recorded discovery call. None of it is searchable; none of it is retrievable; the AI assistant the team built can't see any of it.

Adding video to the RAG pipeline closes this gap. The architecture: video → transcript → Markdown → chunk → embed → store in vector DB → retrieve at query time → feed retrieved chunks to the generation model alongside the user's query. Same pattern as text-RAG; the front-end transcription step is the only addition.

The end-to-end pipeline

  1. Convert each video file to a structured Markdown transcript via video-to-markdown for cloud workflow, or via local yt-dlp + Whisper for batch/private content
  2. Chunk the transcript into retrieval-sized segments (typically 200-800 tokens each, with overlap) using the H2/H3 section structure as natural boundaries when possible
  3. Embed each chunk using a sentence-embedding model (sentence-transformers, OpenAI's text-embedding-3, Voyage's voyage-3, or any production embedding model)
  4. Store the embedded chunks in a vector database (ChromaDB, pgvector, Qdrant, Weaviate, Pinecone) along with metadata (source video URL, timestamp range, speaker if known, chunk position in the original document)
  5. Retrieve at query time by embedding the user's query, finding the nearest-neighbor chunks in the vector store, and returning the top-K matches
  6. Generate by feeding the retrieved chunks (with their source metadata) to the generation model alongside the user's query

Each step has design choices that affect retrieval quality. The next sections cover the ones that matter most for video content specifically.

Chunking strategy for multi-hour content

For short documents (a help-center article, a single-page contract), chunking is straightforward — split on paragraphs or sentences, embed each chunk, done. For multi-hour video transcripts (a 90-minute all-hands, a 3-hour deposition, a 6-hour day of recorded meetings), naive chunking produces poor retrieval quality. The two main considerations:

Topic-aware chunking. The H2 section structure in a structured Markdown transcript marks the natural topic boundaries — the speaker shifted from one topic to another, the conversation pivoted, the meeting moved to the next agenda item. Chunks that respect these boundaries retrieve better than chunks that arbitrarily split mid-topic, because the embedding model can capture the topic signal coherently within each chunk.

Practical approach: for each H2 section, generate one or more chunks depending on length. Sections shorter than ~500 tokens become a single chunk; longer sections get split with sentence-aware splitting and overlap.

Parent-document linking. For long videos, the most useful retrieval result is often "this short passage from the relevant context" — the user's query matches a specific moment in the video, but the answer they need is grounded in the broader context of the surrounding minutes. Parent-document linking solves this by storing each chunk alongside a pointer to the larger section it came from, and retrieving both the small chunk (for embedding-based matching) and the larger surrounding context (for generation-time grounding).

from dataclasses import dataclass
from typing import List

@dataclass
class TranscriptChunk:
    text: str  # the chunk for embedding (small, ~300 tokens)
    parent_section: str  # the surrounding H2 section (larger context)
    source_video_url: str
    timestamp_start: str  # "[00:14:32]"
    timestamp_end: str  # "[00:17:45]"
    chunk_index: int
    speaker: str  # if known

At retrieval time, you embed and search on the small chunks for precision, then return the parent sections for generation-time context. The model gets both the precise match and the broader context.

Embedding model choice

The embedding model converts text chunks into fixed-length vectors that capture semantic meaning. Production options:

For a self-hosted local pipeline (privacy-sensitive content, no external dependencies), all-mpnet-base-v2 or bge-large is the typical choice. For production cloud pipelines where the embedding cost is acceptable, OpenAI's text-embedding-3 or Voyage's models are competitive choices.

Vector database choice

For local pipelines and small-to-medium scale (up to a few million chunks), ChromaDB is the easy choice — pure Python, embeddable in your application, no separate server required. Pinecone, Qdrant, Weaviate are the production-managed options at higher scale. pgvector on Postgres is the right choice when your existing infrastructure is already Postgres-centric.

For this article's example we use ChromaDB because it's the simplest path to a working local pipeline.

The working Python example

Below is a complete pipeline that takes a folder of video transcripts (in structured Markdown), chunks them by H2 section, embeds with sentence-transformers, stores in ChromaDB, and answers queries by retrieving the top-K relevant chunks.

import re
from pathlib import Path
from dataclasses import dataclass, asdict
from typing import List
import chromadb
from sentence_transformers import SentenceTransformer

# ---------- Chunking ----------

@dataclass
class Chunk:
    text: str
    parent_section: str
    source: str
    timestamp_start: str
    chunk_index: int

def parse_markdown_transcript(md_path: Path) -> List[dict]:
    """Split a structured Markdown transcript by H2 sections."""
    text = md_path.read_text(encoding="utf-8")
    sections = []
    current = {"heading": "Intro", "body": []}
    for line in text.splitlines():
        if line.startswith("## "):
            if current["body"]:
                sections.append({
                    "heading": current["heading"],
                    "body": "\n".join(current["body"]).strip(),
                })
            current = {"heading": line[3:].strip(), "body": []}
        else:
            current["body"].append(line)
    if current["body"]:
        sections.append({
            "heading": current["heading"],
            "body": "\n".join(current["body"]).strip(),
        })
    return sections

def chunk_section(section: dict, source: str, max_tokens: int = 400) -> List[Chunk]:
    """Split a section into ~max_tokens chunks with sentence-aware splitting."""
    body = section["body"]
    # Find first timestamp anchor for the section
    ts_match = re.search(r"\[(\d{2}:\d{2}(?::\d{2})?)\]", body)
    section_ts = ts_match.group(0) if ts_match else "[unknown]"
    
    # Approximate token count by word count * 1.3
    words = body.split()
    if len(words) * 1.3 <= max_tokens:
        return [Chunk(
            text=body,
            parent_section=f"## {section['heading']}\n\n{body}",
            source=source,
            timestamp_start=section_ts,
            chunk_index=0,
        )]
    
    # Otherwise split into ~max_tokens chunks at sentence boundaries
    sentences = re.split(r"(?<=[.!?])\s+", body)
    chunks = []
    current = []
    current_words = 0
    idx = 0
    for sent in sentences:
        sent_words = len(sent.split())
        if current_words + sent_words > max_tokens / 1.3 and current:
            chunks.append(Chunk(
                text=" ".join(current),
                parent_section=f"## {section['heading']}\n\n{body}",
                source=source,
                timestamp_start=section_ts,
                chunk_index=idx,
            ))
            idx += 1
            current = [sent]
            current_words = sent_words
        else:
            current.append(sent)
            current_words += sent_words
    if current:
        chunks.append(Chunk(
            text=" ".join(current),
            parent_section=f"## {section['heading']}\n\n{body}",
            source=source,
            timestamp_start=section_ts,
            chunk_index=idx,
        ))
    return chunks

# ---------- Indexing ----------

embedder = SentenceTransformer("all-mpnet-base-v2")
client = chromadb.PersistentClient(path="./video_rag_db")
collection = client.get_or_create_collection("video_transcripts")

def index_transcript_folder(folder: str):
    transcripts_dir = Path(folder)
    all_chunks = []
    for md_file in transcripts_dir.glob("*.md"):
        sections = parse_markdown_transcript(md_file)
        for section in sections:
            chunks = chunk_section(section, source=md_file.stem)
            all_chunks.extend(chunks)
    
    if not all_chunks:
        return
    
    texts = [c.text for c in all_chunks]
    embeddings = embedder.encode(texts, show_progress_bar=True).tolist()
    
    ids = [f"{c.source}-{c.chunk_index}-{i}" for i, c in enumerate(all_chunks)]
    metadatas = [{
        "source": c.source,
        "timestamp_start": c.timestamp_start,
        "parent_section": c.parent_section,
    } for c in all_chunks]
    
    collection.add(
        ids=ids,
        documents=texts,
        embeddings=embeddings,
        metadatas=metadatas,
    )
    print(f"Indexed {len(all_chunks)} chunks from {folder}")

# ---------- Retrieval ----------

def retrieve(query: str, k: int = 5):
    query_embedding = embedder.encode([query]).tolist()
    results = collection.query(
        query_embeddings=query_embedding,
        n_results=k,
    )
    return [
        {
            "chunk_text": doc,
            "source": meta["source"],
            "timestamp": meta["timestamp_start"],
            "parent_section": meta["parent_section"],
            "distance": dist,
        }
        for doc, meta, dist in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0],
        )
    ]

# ---------- Usage ----------

if __name__ == "__main__":
    # Index a folder of Markdown transcripts
    index_transcript_folder("./transcripts")
    
    # Query
    matches = retrieve("what did the founder say about pricing strategy?", k=3)
    for m in matches:
        print(f"\n--- {m['source']} at {m['timestamp']} (distance={m['distance']:.3f}) ---")
        print(m['chunk_text'][:500])

This is a complete working pipeline. Drop your Markdown transcripts into the ./transcripts folder, run the indexing step once, and queries return the top-K relevant passages with their source video and timestamp. For integration with a generation model, feed the retrieved parent_section values plus the user's query to your LLM of choice.

Cross-modal retrieval: video plus documents

The same vector store can hold chunks from multiple content types — video transcripts, PDF documents, web pages, internal wiki entries. The embedding model treats them all as text; the metadata distinguishes the source type. At retrieval time, the most relevant chunk wins regardless of where it came from.

For organizations building a unified knowledge base, this is the end-state architecture:

All converted to Markdown, all chunked into the same vector store, all retrievable by the same query. The internal AI assistant gains visibility across the full knowledge surface.

Timestamp-precise citation in generated answers

One under-used feature of video-RAG specifically: the timestamp metadata in each retrieved chunk lets the generation model produce answers with timestamp-precise citations back to the source video. Instead of "the founder mentioned this in the Q3 all-hands" (which leaves the user to find it), the answer can be "the founder mentioned this at [00:23:14] in the Q3 all-hands" with a link the user can click to jump to the exact moment.

For internal-tool AI assistants, this changes the user experience meaningfully — answers become verifiable, the source is traceable, and the user can confirm the AI's interpretation by watching the actual passage. The grounding is concrete in a way that's harder to achieve with text-source-only RAG.

Performance characteristics at scale

Realistic numbers for a small-to-medium organizational deployment:

Corpus sizeChunksEmbedding time (one-shot)StorageQuery latency
50 hours of video (~50 transcripts)~5,000-10,000~5-15 minutes (CPU)~50 MB< 100ms
500 hours (~500 transcripts)~50,000-100,000~1-2 hours (CPU), 10-20 min (GPU)~500 MB< 200ms
5,000 hours (large org corpus)~500k-1M chunks~10-20 hours (CPU), 1-2 hours (GPU)~5 GB< 500ms (move to Qdrant/Weaviate)

For most teams, ChromaDB handles the small-to-medium tier comfortably. At larger scale, migrating to a managed vector database (Pinecone, Qdrant, Weaviate) is straightforward — the chunking, embedding, and metadata patterns transfer directly.

The pipeline summary

Convert videos to structured Markdown transcripts via video-to-markdown → chunk by H2 sections with parent-document linking → embed with sentence-transformers → store in ChromaDB (or production vector DB at scale) → retrieve at query time → feed to generation model with timestamp-precise citation. For the broader knowledge-base context that integrates web content alongside video, see building a web knowledge base for AI. For the workflow of building a complete searchable video library, see building a searchable video library. For the speaker-identification details that affect chunk metadata, see speaker identification in video transcription.

Frequently asked questions

Should I chunk by fixed token count or by semantic boundaries like H2 sections?
Both, in a hybrid approach. Use the H2 section boundaries as the primary chunking signal — sections that are short enough become single chunks; sections that are too long get split internally with sentence-aware splitting and slight overlap. The reason for this hybrid: H2-only chunking can produce chunks too large for efficient embedding (the embedding model has its own context limit), and pure fixed-token chunking can split mid-topic in ways that hurt retrieval quality. The hybrid approach gives you semantic coherence (matching the natural topic structure) plus size control (chunks fit the embedding model's context window). The example code in this article implements this pattern.
Can I run this whole pipeline without any cloud APIs?
Yes — the example uses open-source components throughout (sentence-transformers for embedding, ChromaDB for the vector store) and runs entirely on your own machine. Combined with local Whisper for the transcription step (yt-dlp + Whisper for downloading and transcribing videos), the entire pipeline from video file to AI-answerable knowledge base runs on consumer hardware with no cloud dependencies. For organizations with strict data-residency requirements or air-gapped environments, this is the right architecture. The tradeoff is upfront engineering effort vs. paid cloud APIs that abstract away the infrastructure — both are legitimate choices depending on your context.
How do I keep the index up to date as new videos are added?
Two patterns work depending on volume. For low-volume settings (a few new videos per week), a manual cron-style script that runs the indexing pipeline against the video folder on a schedule (e.g., every night at 2 AM) is simple and reliable. The script identifies new videos by checking which transcript files don't yet have entries in the vector store. For high-volume settings (continuous flow of new recorded meetings, training, etc.), wire the indexing into the upload/transcription pipeline directly — when a new video is uploaded and transcribed, the resulting Markdown is automatically chunked, embedded, and added to the vector store as the final step. Both patterns work; choose based on your operational cadence.