Batch Transcribe an Entire YouTube Playlist to Markdown
For one or two videos, paste-and-convert is fine. For 50, 100, or an entire YouTube channel's back catalog, you need a script. The MDisBetter web tool processes one video at a time — that's the right surface for ad-hoc use, not for batch. For batch, the right answer is open-source: yt-dlp to fetch, faster-whisper to transcribe, a Python loop to glue them together. Total cost: $0 plus your hardware time. Here is the working pipeline, the GPU vs CPU performance numbers, and the output structure that matches what the web tool ships.
Be honest about the web tool's limits
MDisBetter's video to Markdown is built for one-at-a-time conversions. It doesn't expose a batch upload, doesn't have an API or CLI, and doesn't accept playlist URLs. For a couple of videos, it's the fastest path. For 50+, it's the wrong tool, and pretending otherwise wastes your time.
The OSS path scales to thousands of videos with no per-video cost. It does require Python, basic command-line comfort, and ideally a GPU for reasonable speed. If those are dealbreakers, the right move is to spread the work across the web tool over multiple days (a few videos a day) rather than try to force batch into a non-batch tool.
The stack
- yt-dlp — the modern, actively-maintained successor to youtube-dl. Downloads any YouTube video or playlist, extracts audio.
- faster-whisper — CTranslate2 reimplementation of OpenAI's Whisper. 3-4x faster than the reference whisper package with the same accuracy.
- Python 3.10+ — for the loop and Markdown formatting
All MIT/Apache licensed. All free. All run locally.
Setup
# Install
pip install yt-dlp faster-whisper
# Verify yt-dlp
yt-dlp --version
# Test on one video
yt-dlp -x --audio-format mp3 -o 'test.%(ext)s' \
'https://www.youtube.com/watch?v=dQw4w9WgXcQ'For GPU acceleration with NVIDIA cards, also install CUDA toolkit and the right cuDNN version (faster-whisper docs have the matrix). For Apple Silicon, set device="auto" and faster-whisper will use Metal. CPU-only works on any platform without extra setup.
The full pipeline
#!/usr/bin/env python3
"""Batch transcribe a YouTube playlist to structured Markdown files."""
import argparse
import json
import subprocess
import sys
from pathlib import Path
from faster_whisper import WhisperModel
def get_playlist_videos(playlist_url):
"""Return list of (video_id, title) tuples for the playlist."""
cmd = ['yt-dlp', '--flat-playlist', '-J', playlist_url]
result = subprocess.run(cmd, capture_output=True, text=True, check=True)
data = json.loads(result.stdout)
return [(e['id'], e['title']) for e in data.get('entries', [])]
def download_audio(video_id, audio_dir):
"""Download just the audio for a single video. Returns the file path."""
output = audio_dir / f'{video_id}.mp3'
if output.exists():
return output
cmd = [
'yt-dlp', '-x', '--audio-format', 'mp3',
'-o', str(output.with_suffix('.%(ext)s')),
f'https://www.youtube.com/watch?v={video_id}',
]
subprocess.run(cmd, check=True)
return output
def transcribe(audio_path, model):
"""Transcribe an audio file. Returns list of (start, end, text) segments."""
segments, info = model.transcribe(
str(audio_path),
beam_size=5,
language=None, # auto-detect
)
return [(s.start, s.end, s.text.strip()) for s in segments]
def segments_to_markdown(title, video_id, segments):
"""Format segments as structured Markdown with H2 sections every ~5 minutes."""
lines = [f'# {title}', '']
lines.append(f'**Source:** https://www.youtube.com/watch?v={video_id}')
if segments:
duration = segments[-1][1]
m, s = divmod(int(duration), 60)
h, m = divmod(m, 60)
if h:
lines.append(f'**Duration:** {h}:{m:02d}:{s:02d}')
else:
lines.append(f'**Duration:** {m}:{s:02d}')
lines.append('')
# Group segments into sections every 5 minutes
section_seconds = 300
current_section = -1
for start, end, text in segments:
section = int(start // section_seconds)
if section != current_section:
current_section = section
mm = int(start // 60)
ss = int(start % 60)
lines.append('')
lines.append(f'## [{mm:02d}:{ss:02d}] Section {section + 1}')
lines.append('')
lines.append(text)
return '\n'.join(lines)
def main():
parser = argparse.ArgumentParser()
parser.add_argument('playlist_url')
parser.add_argument('--out', default='transcripts')
parser.add_argument('--model', default='large-v3',
help='tiny | base | small | medium | large-v3')
parser.add_argument('--device', default='auto',
help='auto | cpu | cuda')
args = parser.parse_args()
out_dir = Path(args.out)
out_dir.mkdir(exist_ok=True)
audio_dir = out_dir / '_audio'
audio_dir.mkdir(exist_ok=True)
md_dir = out_dir / 'markdown'
md_dir.mkdir(exist_ok=True)
print(f'Loading model: {args.model} on {args.device}...')
compute = 'float16' if args.device == 'cuda' else 'int8'
model = WhisperModel(args.model, device=args.device, compute_type=compute)
videos = get_playlist_videos(args.playlist_url)
print(f'Playlist has {len(videos)} videos')
for i, (vid, title) in enumerate(videos, 1):
out_md = md_dir / f'{vid}.md'
if out_md.exists():
print(f'[{i}/{len(videos)}] SKIP {title} (already done)')
continue
print(f'[{i}/{len(videos)}] {title}')
try:
audio = download_audio(vid, audio_dir)
segments = transcribe(audio, model)
md = segments_to_markdown(title, vid, segments)
out_md.write_text(md, encoding='utf-8')
print(f' -> {out_md}')
except Exception as e:
print(f' FAIL: {e}', file=sys.stderr)
if __name__ == '__main__':
main()Save as batch_transcribe.py. Run:
python batch_transcribe.py 'https://www.youtube.com/playlist?list=PLxxxxx' \
--out my-channel-transcripts --model large-v3 --device cudaWhat this produces
For a 50-video playlist, you get a folder structure like:
my-channel-transcripts/
├── _audio/ # cached MP3s (delete after if you want)
├── markdown/
│ ├── dQw4w9WgXcQ.md
│ ├── jNQXAC9IVRw.md
│ ├── ...
Each .md file follows the same structure as the MDisBetter web tool output: H1 title, source link, duration, H2 sections every 5 minutes with timestamps, transcript prose. Drop the folder into Obsidian, Notion, or any Markdown system.
Performance numbers
Per minute of audio, transcription speed varies wildly with hardware and model size:
| Hardware | Model | Speed (audio min / wall-clock min) |
|---|---|---|
| NVIDIA RTX 4090 | large-v3 | ~15x real-time (4 min wall-clock per hour of audio) |
| NVIDIA RTX 4070 | large-v3 | ~8x real-time (~7-8 min per hour) |
| NVIDIA RTX 3060 | large-v3 | ~5x real-time (~12 min per hour) |
| Apple M3 Max | large-v3 | ~3-4x real-time (~16 min per hour) |
| Apple M2 / M1 Pro | medium | ~2x real-time (~30 min per hour) |
| Modern CPU only (i7/Ryzen 7) | medium | ~0.5-1x real-time (~1-2 hours per hour of audio) |
| Modern CPU only | large-v3 | ~0.2-0.3x real-time — slow, use medium instead |
For a 50-video playlist averaging 30 minutes per video (25 hours of audio total): GPU finishes in 2-5 hours, CPU finishes overnight. Both are workable; GPU is just much more comfortable.
Speed/accuracy tradeoffs by model
| Model | Size | Speed | Accuracy on clean audio |
|---|---|---|---|
| tiny | 39M | 10-30x real-time | ~85% |
| base | 74M | 5-15x real-time | ~90% |
| small | 244M | 3-10x real-time | ~92% |
| medium | 769M | 1-5x real-time | ~95% |
| large-v3 | 1550M | 0.3-15x real-time (HW-dependent) | ~96-98% |
For batch, our recommendation: large-v3 if you have a GPU (the accuracy gain is worth it for permanent archives), medium if you're CPU-only (good enough for most uses, finishes in reasonable time).
Adding speaker labels
The faster-whisper output above is text-only — no speaker diarization. To add speaker labels, layer on WhisperX:
pip install whisperx pyannote.audioimport whisperx
model = whisperx.load_model('large-v3', device='cuda', compute_type='float16')
audio = whisperx.load_audio('audio.mp3')
result = model.transcribe(audio, batch_size=16)
# Align
model_a, metadata = whisperx.load_align_model(
language_code=result['language'], device='cuda')
result = whisperx.align(result['segments'], model_a, metadata, audio, 'cuda')
# Diarize
diarize_model = whisperx.DiarizationPipeline(
use_auth_token='YOUR_HF_TOKEN', device='cuda')
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)WhisperX needs a free HuggingFace token for the pyannote diarization models. Setup is fiddlier but the result is multi-speaker transcripts at quality matching the best paid services.
Batch tips
Use a manifest file for resumability
The script above checks for existing .md files and skips them, which gives you resumability for free. If transcription crashes after video 37 of 50, re-running the script picks up at video 38.
Delete audio cache after transcription
The _audio/ folder gets large (MP3 ~1MB per minute of audio). Once transcripts are done and you've verified them, rm -rf _audio/ reclaims the space.
Parallel processing with care
You can speed up the loop by transcribing N files in parallel — but you'll quickly saturate the GPU. For most setups, single-threaded transcription is already GPU-bound; parallel just thrashes. Where parallel helps: download (yt-dlp) running in parallel with transcription on the previous video. Easiest implementation: producer-consumer with Python's concurrent.futures.
Respect YouTube's rate limits
yt-dlp can hit rate limits if you hammer the API. Default delays are usually fine for casual playlist downloads. For large channels (500+ videos), spread across multiple days or use yt-dlp's --sleep-interval flag.
What about the language?
The script auto-detects language per video via Whisper's built-in detector. For multi-language playlists this works correctly. If you know all videos are in one language, force it with language='en' in the transcribe call — saves 1-2 seconds per video.
Comparison: web tool vs OSS batch
| Aspect | MDisBetter web tool | OSS batch |
|---|---|---|
| Setup time | 0 — paste URL | 30-60 min first time |
| Per-video time (manual) | ~1 min click + 1-3 min wait | 0 (script runs unattended) |
| Quality | ~94% (clean audio) | ~96-98% with large-v3 |
| Speaker labels | Yes, default | Optional via WhisperX |
| Markdown structure | Default | You write the formatter |
| Cost | Free tier or paid | Free (your hardware) |
| Privacy | Server-side processing | Fully local |
| Best for | Few videos, no setup | 50+ videos, automation |
Recommendation
Under 10 videos: use the web tool, manual paste-and-convert. 10-50 videos: split into multiple sessions with the web tool, or take a couple hours to set up the OSS pipeline (the script becomes reusable forever after). 50+ videos or recurring batch needs: the OSS pipeline pays back in the first run. For knowledge-system integration after batch transcription, see Obsidian video vault setup; for individual workflows see 5 methods compared and the tool benchmark; and for non-YouTube batch (uploaded MP4s, Zoom recordings), the same script works by skipping the yt-dlp step. Cross-reference: the same OSS approach applies to audio batch by stripping the download step.