May 10, 2026 · 10 min read · MDisBetter

Batch Transcribe an Entire YouTube Playlist to Markdown

For one or two videos, paste-and-convert is fine. For 50, 100, or an entire YouTube channel's back catalog, you need a script. The MDisBetter web tool processes one video at a time — that's the right surface for ad-hoc use, not for batch. For batch, the right answer is open-source: yt-dlp to fetch, faster-whisper to transcribe, a Python loop to glue them together. Total cost: $0 plus your hardware time. Here is the working pipeline, the GPU vs CPU performance numbers, and the output structure that matches what the web tool ships.

Be honest about the web tool's limits

MDisBetter's video to Markdown is built for one-at-a-time conversions. It doesn't expose a batch upload, doesn't have an API or CLI, and doesn't accept playlist URLs. For a couple of videos, it's the fastest path. For 50+, it's the wrong tool, and pretending otherwise wastes your time.

The OSS path scales to thousands of videos with no per-video cost. It does require Python, basic command-line comfort, and ideally a GPU for reasonable speed. If those are dealbreakers, the right move is to spread the work across the web tool over multiple days (a few videos a day) rather than try to force batch into a non-batch tool.

The stack

yt-dlp — the modern, actively-maintained successor to youtube-dl. Downloads any YouTube video or playlist, extracts audio.
faster-whisper — CTranslate2 reimplementation of OpenAI's Whisper. 3-4x faster than the reference whisper package with the same accuracy.
Python 3.10+ — for the loop and Markdown formatting

All MIT/Apache licensed. All free. All run locally.

Setup

# Install
pip install yt-dlp faster-whisper

# Verify yt-dlp
yt-dlp --version

# Test on one video
yt-dlp -x --audio-format mp3 -o 'test.%(ext)s' \
  'https://www.youtube.com/watch?v=dQw4w9WgXcQ'

For GPU acceleration with NVIDIA cards, also install CUDA toolkit and the right cuDNN version (faster-whisper docs have the matrix). For Apple Silicon, set device="auto" and faster-whisper will use Metal. CPU-only works on any platform without extra setup.

The full pipeline

#!/usr/bin/env python3
"""Batch transcribe a YouTube playlist to structured Markdown files."""

import argparse
import json
import subprocess
import sys
from pathlib import Path
from faster_whisper import WhisperModel

def get_playlist_videos(playlist_url):
    """Return list of (video_id, title) tuples for the playlist."""
    cmd = ['yt-dlp', '--flat-playlist', '-J', playlist_url]
    result = subprocess.run(cmd, capture_output=True, text=True, check=True)
    data = json.loads(result.stdout)
    return [(e['id'], e['title']) for e in data.get('entries', [])]

def download_audio(video_id, audio_dir):
    """Download just the audio for a single video. Returns the file path."""
    output = audio_dir / f'{video_id}.mp3'
    if output.exists():
        return output
    cmd = [
        'yt-dlp', '-x', '--audio-format', 'mp3',
        '-o', str(output.with_suffix('.%(ext)s')),
        f'https://www.youtube.com/watch?v={video_id}',
    ]
    subprocess.run(cmd, check=True)
    return output

def transcribe(audio_path, model):
    """Transcribe an audio file. Returns list of (start, end, text) segments."""
    segments, info = model.transcribe(
        str(audio_path),
        beam_size=5,
        language=None,  # auto-detect
    )
    return [(s.start, s.end, s.text.strip()) for s in segments]

def segments_to_markdown(title, video_id, segments):
    """Format segments as structured Markdown with H2 sections every ~5 minutes."""
    lines = [f'# {title}', '']
    lines.append(f'**Source:** https://www.youtube.com/watch?v={video_id}')
    if segments:
        duration = segments[-1][1]
        m, s = divmod(int(duration), 60)
        h, m = divmod(m, 60)
        if h:
            lines.append(f'**Duration:** {h}:{m:02d}:{s:02d}')
        else:
            lines.append(f'**Duration:** {m}:{s:02d}')
    lines.append('')

    # Group segments into sections every 5 minutes
    section_seconds = 300
    current_section = -1
    for start, end, text in segments:
        section = int(start // section_seconds)
        if section != current_section:
            current_section = section
            mm = int(start // 60)
            ss = int(start % 60)
            lines.append('')
            lines.append(f'## [{mm:02d}:{ss:02d}] Section {section + 1}')
            lines.append('')
        lines.append(text)
    return '\n'.join(lines)

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('playlist_url')
    parser.add_argument('--out', default='transcripts')
    parser.add_argument('--model', default='large-v3',
                        help='tiny | base | small | medium | large-v3')
    parser.add_argument('--device', default='auto',
                        help='auto | cpu | cuda')
    args = parser.parse_args()

    out_dir = Path(args.out)
    out_dir.mkdir(exist_ok=True)
    audio_dir = out_dir / '_audio'
    audio_dir.mkdir(exist_ok=True)
    md_dir = out_dir / 'markdown'
    md_dir.mkdir(exist_ok=True)

    print(f'Loading model: {args.model} on {args.device}...')
    compute = 'float16' if args.device == 'cuda' else 'int8'
    model = WhisperModel(args.model, device=args.device, compute_type=compute)

    videos = get_playlist_videos(args.playlist_url)
    print(f'Playlist has {len(videos)} videos')

    for i, (vid, title) in enumerate(videos, 1):
        out_md = md_dir / f'{vid}.md'
        if out_md.exists():
            print(f'[{i}/{len(videos)}] SKIP {title} (already done)')
            continue
        print(f'[{i}/{len(videos)}] {title}')
        try:
            audio = download_audio(vid, audio_dir)
            segments = transcribe(audio, model)
            md = segments_to_markdown(title, vid, segments)
            out_md.write_text(md, encoding='utf-8')
            print(f'  -> {out_md}')
        except Exception as e:
            print(f'  FAIL: {e}', file=sys.stderr)

if __name__ == '__main__':
    main()

Save as batch_transcribe.py. Run:

python batch_transcribe.py 'https://www.youtube.com/playlist?list=PLxxxxx' \
  --out my-channel-transcripts --model large-v3 --device cuda

What this produces

For a 50-video playlist, you get a folder structure like:

my-channel-transcripts/
├── _audio/                  # cached MP3s (delete after if you want)
├── markdown/
│   ├── dQw4w9WgXcQ.md
│   ├── jNQXAC9IVRw.md
│   ├── ...

Each .md file follows the same structure as the MDisBetter web tool output: H1 title, source link, duration, H2 sections every 5 minutes with timestamps, transcript prose. Drop the folder into Obsidian, Notion, or any Markdown system.

Performance numbers

Per minute of audio, transcription speed varies wildly with hardware and model size:

Hardware	Model	Speed (audio min / wall-clock min)
NVIDIA RTX 4090	large-v3	~15x real-time (4 min wall-clock per hour of audio)
NVIDIA RTX 4070	large-v3	~8x real-time (~7-8 min per hour)
NVIDIA RTX 3060	large-v3	~5x real-time (~12 min per hour)
Apple M3 Max	large-v3	~3-4x real-time (~16 min per hour)
Apple M2 / M1 Pro	medium	~2x real-time (~30 min per hour)
Modern CPU only (i7/Ryzen 7)	medium	~0.5-1x real-time (~1-2 hours per hour of audio)
Modern CPU only	large-v3	~0.2-0.3x real-time — slow, use medium instead

For a 50-video playlist averaging 30 minutes per video (25 hours of audio total): GPU finishes in 2-5 hours, CPU finishes overnight. Both are workable; GPU is just much more comfortable.

Speed/accuracy tradeoffs by model

Model	Size	Speed	Accuracy on clean audio
tiny	39M	10-30x real-time	~85%
base	74M	5-15x real-time	~90%
small	244M	3-10x real-time	~92%
medium	769M	1-5x real-time	~95%
large-v3	1550M	0.3-15x real-time (HW-dependent)	~96-98%

For batch, our recommendation: large-v3 if you have a GPU (the accuracy gain is worth it for permanent archives), medium if you're CPU-only (good enough for most uses, finishes in reasonable time).

Adding speaker labels

The faster-whisper output above is text-only — no speaker diarization. To add speaker labels, layer on WhisperX:

pip install whisperx pyannote.audio

import whisperx

model = whisperx.load_model('large-v3', device='cuda', compute_type='float16')
audio = whisperx.load_audio('audio.mp3')
result = model.transcribe(audio, batch_size=16)

# Align
model_a, metadata = whisperx.load_align_model(
    language_code=result['language'], device='cuda')
result = whisperx.align(result['segments'], model_a, metadata, audio, 'cuda')

# Diarize
diarize_model = whisperx.DiarizationPipeline(
    use_auth_token='YOUR_HF_TOKEN', device='cuda')
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)

WhisperX needs a free HuggingFace token for the pyannote diarization models. Setup is fiddlier but the result is multi-speaker transcripts at quality matching the best paid services.

Batch tips

Use a manifest file for resumability

The script above checks for existing .md files and skips them, which gives you resumability for free. If transcription crashes after video 37 of 50, re-running the script picks up at video 38.

Delete audio cache after transcription

The _audio/ folder gets large (MP3 ~1MB per minute of audio). Once transcripts are done and you've verified them, rm -rf _audio/ reclaims the space.

Parallel processing with care

You can speed up the loop by transcribing N files in parallel — but you'll quickly saturate the GPU. For most setups, single-threaded transcription is already GPU-bound; parallel just thrashes. Where parallel helps: download (yt-dlp) running in parallel with transcription on the previous video. Easiest implementation: producer-consumer with Python's concurrent.futures.

Respect YouTube's rate limits

yt-dlp can hit rate limits if you hammer the API. Default delays are usually fine for casual playlist downloads. For large channels (500+ videos), spread across multiple days or use yt-dlp's --sleep-interval flag.

What about the language?

The script auto-detects language per video via Whisper's built-in detector. For multi-language playlists this works correctly. If you know all videos are in one language, force it with language='en' in the transcribe call — saves 1-2 seconds per video.

Comparison: web tool vs OSS batch

Aspect	MDisBetter web tool	OSS batch
Setup time	0 — paste URL	30-60 min first time
Per-video time (manual)	~1 min click + 1-3 min wait	0 (script runs unattended)
Quality	~94% (clean audio)	~96-98% with large-v3
Speaker labels	Yes, default	Optional via WhisperX
Markdown structure	Default	You write the formatter
Cost	Free tier or paid	Free (your hardware)
Privacy	Server-side processing	Fully local
Best for	Few videos, no setup	50+ videos, automation

Recommendation

Under 10 videos: use the web tool, manual paste-and-convert. 10-50 videos: split into multiple sessions with the web tool, or take a couple hours to set up the OSS pipeline (the script becomes reusable forever after). 50+ videos or recurring batch needs: the OSS pipeline pays back in the first run. For knowledge-system integration after batch transcription, see Obsidian video vault setup; for individual workflows see 5 methods compared and the tool benchmark; and for non-YouTube batch (uploaded MP4s, Zoom recordings), the same script works by skipping the yt-dlp step. Cross-reference: the same OSS approach applies to audio batch by stripping the download step.

Frequently asked questions

Can I run this on a Mac without a dedicated GPU?

Yes — Apple Silicon Macs work via Metal. M3 Max gets you ~3-4x real-time on large-v3, M2/M1 closer to 1-2x real-time on medium. Set device='auto' in faster-whisper and it picks the best available backend. For old Intel Macs, fall back to CPU-only and the medium model — slower but still usable for overnight batch.

What if a video is age-restricted or region-locked?

yt-dlp handles many of these cases with --cookies-from-browser flag (passes your logged-in YouTube cookies). For age-gated videos, this typically works. Region-locked videos require a VPN at the network level — yt-dlp can't bypass geographic restrictions on its own. For private/unlisted videos with a direct URL, no workaround unless you have access; for fully private videos, you can't download them anyway.

How do I keep this updated as I add new videos to the playlist?

Re-run the script periodically. Because it skips videos with existing .md files, only new videos get transcribed. For automation, schedule it via cron (Linux/Mac) or Task Scheduler (Windows) to run weekly. The yt-dlp + faster-whisper combo doesn't need any updates for routine running, but check yt-dlp for updates monthly since YouTube occasionally changes their backend (yt-dlp pushes fast updates to track these changes).