Markdown vs SRT vs VTT: Which Transcript Format for What?
Most transcription tools offer a choice of output formats: SRT, VTT, plain text, JSON, sometimes Markdown. The choice is not arbitrary — each format optimizes for a different downstream use case, and using the wrong one for the wrong task creates real friction. SRT and VTT are subtitle formats designed for playing alongside video in a player; plain text is the lowest-common-denominator readable output; JSON is the format you want when programmatically processing the transcript in code; Markdown is what you actually want when the transcript itself is a deliverable artifact intended to be read, edited, repurposed, or fed into an AI assistant. This article walks through the four formats with the same 30-second clip in each, identifies when each is the right choice, and explains why structured Markdown is the right default for most modern transcription workflows.
Same 30 seconds, four different formats
Consider a 30-second clip from an interview where the host introduces a topic and the guest responds. Here's the same content rendered in each of the four formats.
SRT (SubRip Subtitle):
1
00:00:14,200 --> 00:00:17,800
So today I want to talk about distributed
consensus, specifically the Raft protocol.
2
00:00:17,800 --> 00:00:21,400
Why Raft over Paxos for most production
use cases.
3
00:00:21,400 --> 00:00:25,600
Great question. The honest answer is that
Raft is genuinely easier to reason about.
4
00:00:25,600 --> 00:00:29,000
The academic Paxos paper is famously
difficult to implement correctly.VTT (Web Video Text Tracks):
WEBVTT
00:00:14.200 --> 00:00:17.800
So today I want to talk about distributed
consensus, specifically the Raft protocol.
00:00:17.800 --> 00:00:21.400
Why Raft over Paxos for most production
use cases.
00:00:21.400 --> 00:00:25.600
Great question. The honest answer is that
Raft is genuinely easier to reason about.
00:00:25.600 --> 00:00:29.000
The academic Paxos paper is famously
difficult to implement correctly.Plain text:
So today I want to talk about distributed consensus, specifically the Raft protocol. Why Raft over Paxos for most production use cases. Great question. The honest answer is that Raft is genuinely easier to reason about. The academic Paxos paper is famously difficult to implement correctly.Structured Markdown:
## Why Raft over Paxos
**Host:** [00:00:14] So today I want to talk about distributed consensus, specifically the Raft protocol. Why Raft over Paxos for most production use cases?
**Guest:** [00:00:21] Great question. The honest answer is that Raft is genuinely easier to reason about. The academic Paxos paper is famously difficult to implement correctly.Same 30 seconds. Four meaningfully different artifacts. The choice between them depends on what happens next.
SRT and VTT: subtitle formats for video players
SRT and VTT exist to solve one specific problem: displaying captions in a video player synchronized with playback. Both are time-coded text formats where each subtitle entry has a precise start and end time, and the player reads the file alongside the video to display the right text at the right moment.
The two formats are nearly identical in structure with minor syntax differences:
- SRT uses comma decimal separators in timestamps (
00:00:14,200) and numbers each entry sequentially. Older format, originating with the SubRip subtitle ripper, near-universal compatibility across video players. - VTT uses period decimal separators (
00:00:14.200), starts the file with aWEBVTTheader, and supports additional features like styling cues, regions, and notes. Designed specifically for HTML5 video and supported by all modern web players via the<track>element.
For their intended use — captions displayed in a video player — both formats work well and are the right choice. If you're producing captions for YouTube uploads, Vimeo, your own custom HTML5 player, or any video-distribution workflow where the captions render alongside the video, SRT or VTT is what you want.
For any other use case, both formats become friction. The line-broken structure (each subtitle entry is artificially short to fit the screen) makes the text awkward to read as prose. The timestamps are everywhere, breaking the reading flow. There are no semantic structures (no headings, no speaker labels, no topic boundaries) — the subtitles are just sequential timed text fragments. Pasting an SRT or VTT file into a blog post, a doc, or an AI assistant produces an artifact that needs significant cleanup before it's usable.
Plain text: readable but unstructured
Plain text is the simplest possible output: the words spoken in the video, one continuous stream, no metadata. For human reading at low cognitive load, this is genuinely useful — you can paste it anywhere, it renders identically across all environments, and there's no markup syntax to learn or strip.
The cost of the simplicity: no structure at all. No way to know where one topic ends and the next begins. No way to know who said what in multi-speaker content. No way to navigate to a specific point in the video corresponding to a specific passage. The whole transcript is a single undifferentiated block, and once it gets longer than a paragraph or two, it becomes impractical to work with.
For very short clips (a one-minute monologue, a brief soundbite), plain text is fine. For anything longer, the lack of structure becomes the bottleneck.
Why Markdown wins for the actual deliverable use case
Structured Markdown gives you the readability of plain text, the timing metadata of SRT/VTT, and additional structural elements (headings, speaker labels) that neither format provides. The result is a single artifact that works well across the full range of downstream uses where the transcript itself is what matters.
The three layers of structure that make the difference:
- H2 section headings derived from the natural topic pivots in the conversation. The reader can scan the table of contents, jump to the relevant section, and read in topic-sized chunks rather than time-sized chunks.
- Speaker labels as bold annotations (
**Host:**,**Guest:**). Multi-speaker content becomes readable without ambiguity about attribution. - Timestamp anchors as inline markers (
[00:14:32]) that map back to the source video. Every passage in the transcript is verifiable against the video at the corresponding timestamp, and the reader can jump to the video at any point of interest.
This structural choice is exactly what most downstream uses want:
- Publishing the transcript on a website — Markdown converts cleanly to HTML in WordPress, Ghost, Substack, every static-site generator, and most modern CMSes. The H2 headings render as section breaks, the bold speaker labels render as bold inline, the timestamps appear as readable anchors
- Pasting into a doc for editing or collaboration — Notion, Google Docs, Microsoft Word all accept Markdown directly or via paste-as-markdown
- Feeding into an AI assistant — the structural elements help the model parse the conversation more efficiently than flat text
- Storing for search and retrieval — full-text search tools (ripgrep, Obsidian, Logseq, any wiki) handle Markdown natively; the structural elements aid both readability and search precision
- Repurposing into derivative content (blog posts, social cuts, summaries) — the AI assistant doing the derivation works from a structured input that produces structured output
The format-affects-quality argument for AI use
The strongest empirical case for structured Markdown over plain text comes from downstream LLM use. The format of the input directly affects the quality of the AI output, and the difference is large enough to matter for any production workflow.
The mechanism: large language models allocate their attention budget across the input. A flat text dump forces the model to spend attention reconstructing structural information (who said what, when does the topic shift, what are the natural boundaries) before it can engage with the content. A pre-structured input lets the model spend its full attention on the content itself.
Empirical observation across many practical workflows: a 99%-accurate plain-text transcript can produce worse summaries, worse extraction, worse derivative content than a 95%-accurate structured Markdown transcript. The structural advantage outweighs the small accuracy difference for most downstream uses. The full theoretical discussion is in best format for LLM input; the principle applies to transcripts the same way it applies to any other text content.
When SRT or VTT is still the right choice
Despite the case for Markdown above, there are specific workflows where SRT or VTT is genuinely the right output format and Markdown would be the wrong choice:
- Uploading captions to a video platform — YouTube, Vimeo, Wistia, JW Player, and other video hosts all accept SRT or VTT for caption tracks. If your end goal is captions playing alongside your video in a player, this is the format you need
- HTML5 video with custom captioning — the
<track>element in an HTML5 video player loads VTT directly. For custom video implementations, VTT is the standard - Localization and translation workflows for video content — many translation tools and services work in SRT/VTT format because they're aligned to the timing of the video
- Subtitle editing software — Aegisub, Subtitle Edit, and similar tools work natively in SRT/VTT
The practical workflow for many creators: produce both formats from the same source — Markdown for the transcript published on the website and used for AI workflows, SRT/VTT for the captions uploaded to the video player. The structured Markdown can be converted to SRT/VTT programmatically when needed (the timestamps and text content are present in both formats; the conversion is straightforward).
Converting between formats
Going from structured Markdown to SRT for caption-track upload is a few lines of Python — extract the timestamp anchors and the surrounding text, format them as numbered SRT entries with start and end times:
import re
def markdown_to_srt(md_path, srt_path, segment_duration=4.0):
text = open(md_path, encoding="utf-8").read()
# Find all timestamp anchors and the text following them
pattern = re.compile(r"\[(\d{2}:\d{2}(?::\d{2})?)\]\s*([^\[]+?)(?=\n\n\*\*|\[|\Z)", re.DOTALL)
entries = []
for i, m in enumerate(pattern.finditer(text), start=1):
ts = m.group(1)
body = m.group(2).strip()
# Parse timestamp
parts = ts.split(":")
if len(parts) == 2:
mins, secs = int(parts[0]), int(parts[1])
hours = 0
else:
hours, mins, secs = int(parts[0]), int(parts[1]), int(parts[2])
start_sec = hours * 3600 + mins * 60 + secs
end_sec = start_sec + segment_duration
start = f"{int(start_sec // 3600):02d}:{int((start_sec % 3600) // 60):02d}:{int(start_sec % 60):02d},000"
end = f"{int(end_sec // 3600):02d}:{int((end_sec % 3600) // 60):02d}:{int(end_sec % 60):02d},000"
# Strip Markdown formatting from body
clean = re.sub(r"\*\*[^*]+:\*\*\s*", "", body)
entries.append(f"{i}\n{start} --> {end}\n{clean}\n")
with open(srt_path, "w", encoding="utf-8") as f:
f.write("\n".join(entries))
markdown_to_srt("interview.md", "interview.srt")The reverse direction (SRT to structured Markdown) requires more processing because SRT has less structural information than Markdown — the conversion has to infer where to break sections, which speakers are in the conversation (typically by adding a separate diarization pass), and how to merge subtitle fragments into readable paragraphs. The Markdown-as-source-of-truth pattern is easier than the reverse.
The decision summary
| Use case | Right format |
|---|---|
| Captions in a video player (YouTube, HTML5, custom player) | SRT or VTT |
| Video localization, subtitle translation | SRT or VTT |
| Subtitle-editing tools (Aegisub, Subtitle Edit) | SRT or VTT |
| Publishing transcript on a website | Markdown |
| Pasting into a document for editing | Markdown |
| Feeding into an AI assistant for summarization, extraction, repurposing | Markdown |
| Storing for full-text search across an archive | Markdown |
| Programmatic processing in code | JSON |
| Quick paste into a chat without markup | Plain text |
| Very short single-speaker clip | Plain text |
The pipeline summary
For most modern transcription workflows where the transcript is itself a deliverable artifact (publishing on a website, feeding into AI, building a searchable archive), structured Markdown is the right default output format. SRT and VTT remain the right choice for captions displayed in video players. Plain text is fine for very short or single-speaker content. Where multiple downstream uses coexist, generate the Markdown as the source of truth and convert to SRT/VTT for the player-display use case as needed. For the practical workflow of converting video to structured Markdown, see video-to-markdown. For the broader format discussion across content types, see best format for LLM input. For the technical details of how diarization produces the speaker labels in the Markdown structure, see speaker identification in video transcription.