Subtitles vs full transcript
Subtitles are timestamped chunks of text aligned to specific audio segments — typically 2-5 second chunks displayed on screen as captions. A full transcript is the prose-style writeup of the entire video's spoken content with structure (paragraphs, sections, speakers). Both come from the same underlying speech recognition; the format differs based on use case. For caption display in video players, you want SRT/VTT subtitles. For reading, repurposing, AI summarisation, you want the structured transcript.
What mdisbetter outputs
The Markdown variant outputs structured transcripts with inline timestamps ([12:34] markers). This is dramatically more useful than raw subtitles for most workflows (search, content repurposing, AI input) but it's not SRT format directly. For SRT specifically, see the conversion workflow below; for the structured-transcript use case, use the Markdown output as-is.
From mdisbetter Markdown to SRT
The conversion is mechanical: each timestamp + following text chunk becomes one SRT cue. A simple Python script (10-15 lines using a Markdown parser to extract timestamp/text pairs and emitting SRT format) handles it cleanly. Or paste the Markdown into ChatGPT/Claude with "convert this timestamped Markdown transcript to SRT subtitle format with 3-second cues" — works for most files in one pass.
For direct SRT, use OSS Whisper
OpenAI's open-source whisper command-line tool generates SRT directly from any audio/video file: whisper input.mp4 --output_format srt produces the SRT as part of normal operation. Combine with yt-dlp for YouTube URLs: yt-dlp -x URL && whisper extracted.mp3 --output_format srt. MIT-licensed, runs locally, free.