Pricing Dashboard Sign up
Recent
· 10 min read · MDisBetter

Batch Transcribe an Entire YouTube Playlist to Markdown

For one or two videos, paste-and-convert is fine. For 50, 100, or an entire YouTube channel's back catalog, you need a script. The MDisBetter web tool processes one video at a time — that's the right surface for ad-hoc use, not for batch. For batch, the right answer is open-source: yt-dlp to fetch, faster-whisper to transcribe, a Python loop to glue them together. Total cost: $0 plus your hardware time. Here is the working pipeline, the GPU vs CPU performance numbers, and the output structure that matches what the web tool ships.

Be honest about the web tool's limits

MDisBetter's video to Markdown is built for one-at-a-time conversions. It doesn't expose a batch upload, doesn't have an API or CLI, and doesn't accept playlist URLs. For a couple of videos, it's the fastest path. For 50+, it's the wrong tool, and pretending otherwise wastes your time.

The OSS path scales to thousands of videos with no per-video cost. It does require Python, basic command-line comfort, and ideally a GPU for reasonable speed. If those are dealbreakers, the right move is to spread the work across the web tool over multiple days (a few videos a day) rather than try to force batch into a non-batch tool.

The stack

All MIT/Apache licensed. All free. All run locally.

Setup

# Install
pip install yt-dlp faster-whisper

# Verify yt-dlp
yt-dlp --version

# Test on one video
yt-dlp -x --audio-format mp3 -o 'test.%(ext)s' \
  'https://www.youtube.com/watch?v=dQw4w9WgXcQ'

For GPU acceleration with NVIDIA cards, also install CUDA toolkit and the right cuDNN version (faster-whisper docs have the matrix). For Apple Silicon, set device="auto" and faster-whisper will use Metal. CPU-only works on any platform without extra setup.

The full pipeline

#!/usr/bin/env python3
"""Batch transcribe a YouTube playlist to structured Markdown files."""

import argparse
import json
import subprocess
import sys
from pathlib import Path
from faster_whisper import WhisperModel

def get_playlist_videos(playlist_url):
    """Return list of (video_id, title) tuples for the playlist."""
    cmd = ['yt-dlp', '--flat-playlist', '-J', playlist_url]
    result = subprocess.run(cmd, capture_output=True, text=True, check=True)
    data = json.loads(result.stdout)
    return [(e['id'], e['title']) for e in data.get('entries', [])]

def download_audio(video_id, audio_dir):
    """Download just the audio for a single video. Returns the file path."""
    output = audio_dir / f'{video_id}.mp3'
    if output.exists():
        return output
    cmd = [
        'yt-dlp', '-x', '--audio-format', 'mp3',
        '-o', str(output.with_suffix('.%(ext)s')),
        f'https://www.youtube.com/watch?v={video_id}',
    ]
    subprocess.run(cmd, check=True)
    return output

def transcribe(audio_path, model):
    """Transcribe an audio file. Returns list of (start, end, text) segments."""
    segments, info = model.transcribe(
        str(audio_path),
        beam_size=5,
        language=None,  # auto-detect
    )
    return [(s.start, s.end, s.text.strip()) for s in segments]

def segments_to_markdown(title, video_id, segments):
    """Format segments as structured Markdown with H2 sections every ~5 minutes."""
    lines = [f'# {title}', '']
    lines.append(f'**Source:** https://www.youtube.com/watch?v={video_id}')
    if segments:
        duration = segments[-1][1]
        m, s = divmod(int(duration), 60)
        h, m = divmod(m, 60)
        if h:
            lines.append(f'**Duration:** {h}:{m:02d}:{s:02d}')
        else:
            lines.append(f'**Duration:** {m}:{s:02d}')
    lines.append('')

    # Group segments into sections every 5 minutes
    section_seconds = 300
    current_section = -1
    for start, end, text in segments:
        section = int(start // section_seconds)
        if section != current_section:
            current_section = section
            mm = int(start // 60)
            ss = int(start % 60)
            lines.append('')
            lines.append(f'## [{mm:02d}:{ss:02d}] Section {section + 1}')
            lines.append('')
        lines.append(text)
    return '\n'.join(lines)

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('playlist_url')
    parser.add_argument('--out', default='transcripts')
    parser.add_argument('--model', default='large-v3',
                        help='tiny | base | small | medium | large-v3')
    parser.add_argument('--device', default='auto',
                        help='auto | cpu | cuda')
    args = parser.parse_args()

    out_dir = Path(args.out)
    out_dir.mkdir(exist_ok=True)
    audio_dir = out_dir / '_audio'
    audio_dir.mkdir(exist_ok=True)
    md_dir = out_dir / 'markdown'
    md_dir.mkdir(exist_ok=True)

    print(f'Loading model: {args.model} on {args.device}...')
    compute = 'float16' if args.device == 'cuda' else 'int8'
    model = WhisperModel(args.model, device=args.device, compute_type=compute)

    videos = get_playlist_videos(args.playlist_url)
    print(f'Playlist has {len(videos)} videos')

    for i, (vid, title) in enumerate(videos, 1):
        out_md = md_dir / f'{vid}.md'
        if out_md.exists():
            print(f'[{i}/{len(videos)}] SKIP {title} (already done)')
            continue
        print(f'[{i}/{len(videos)}] {title}')
        try:
            audio = download_audio(vid, audio_dir)
            segments = transcribe(audio, model)
            md = segments_to_markdown(title, vid, segments)
            out_md.write_text(md, encoding='utf-8')
            print(f'  -> {out_md}')
        except Exception as e:
            print(f'  FAIL: {e}', file=sys.stderr)

if __name__ == '__main__':
    main()

Save as batch_transcribe.py. Run:

python batch_transcribe.py 'https://www.youtube.com/playlist?list=PLxxxxx' \
  --out my-channel-transcripts --model large-v3 --device cuda

What this produces

For a 50-video playlist, you get a folder structure like:

my-channel-transcripts/
├── _audio/                  # cached MP3s (delete after if you want)
├── markdown/
│   ├── dQw4w9WgXcQ.md
│   ├── jNQXAC9IVRw.md
│   ├── ...

Each .md file follows the same structure as the MDisBetter web tool output: H1 title, source link, duration, H2 sections every 5 minutes with timestamps, transcript prose. Drop the folder into Obsidian, Notion, or any Markdown system.

Performance numbers

Per minute of audio, transcription speed varies wildly with hardware and model size:

HardwareModelSpeed (audio min / wall-clock min)
NVIDIA RTX 4090large-v3~15x real-time (4 min wall-clock per hour of audio)
NVIDIA RTX 4070large-v3~8x real-time (~7-8 min per hour)
NVIDIA RTX 3060large-v3~5x real-time (~12 min per hour)
Apple M3 Maxlarge-v3~3-4x real-time (~16 min per hour)
Apple M2 / M1 Promedium~2x real-time (~30 min per hour)
Modern CPU only (i7/Ryzen 7)medium~0.5-1x real-time (~1-2 hours per hour of audio)
Modern CPU onlylarge-v3~0.2-0.3x real-time — slow, use medium instead

For a 50-video playlist averaging 30 minutes per video (25 hours of audio total): GPU finishes in 2-5 hours, CPU finishes overnight. Both are workable; GPU is just much more comfortable.

Speed/accuracy tradeoffs by model

ModelSizeSpeedAccuracy on clean audio
tiny39M10-30x real-time~85%
base74M5-15x real-time~90%
small244M3-10x real-time~92%
medium769M1-5x real-time~95%
large-v31550M0.3-15x real-time (HW-dependent)~96-98%

For batch, our recommendation: large-v3 if you have a GPU (the accuracy gain is worth it for permanent archives), medium if you're CPU-only (good enough for most uses, finishes in reasonable time).

Adding speaker labels

The faster-whisper output above is text-only — no speaker diarization. To add speaker labels, layer on WhisperX:

pip install whisperx pyannote.audio
import whisperx

model = whisperx.load_model('large-v3', device='cuda', compute_type='float16')
audio = whisperx.load_audio('audio.mp3')
result = model.transcribe(audio, batch_size=16)

# Align
model_a, metadata = whisperx.load_align_model(
    language_code=result['language'], device='cuda')
result = whisperx.align(result['segments'], model_a, metadata, audio, 'cuda')

# Diarize
diarize_model = whisperx.DiarizationPipeline(
    use_auth_token='YOUR_HF_TOKEN', device='cuda')
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)

WhisperX needs a free HuggingFace token for the pyannote diarization models. Setup is fiddlier but the result is multi-speaker transcripts at quality matching the best paid services.

Batch tips

Use a manifest file for resumability

The script above checks for existing .md files and skips them, which gives you resumability for free. If transcription crashes after video 37 of 50, re-running the script picks up at video 38.

Delete audio cache after transcription

The _audio/ folder gets large (MP3 ~1MB per minute of audio). Once transcripts are done and you've verified them, rm -rf _audio/ reclaims the space.

Parallel processing with care

You can speed up the loop by transcribing N files in parallel — but you'll quickly saturate the GPU. For most setups, single-threaded transcription is already GPU-bound; parallel just thrashes. Where parallel helps: download (yt-dlp) running in parallel with transcription on the previous video. Easiest implementation: producer-consumer with Python's concurrent.futures.

Respect YouTube's rate limits

yt-dlp can hit rate limits if you hammer the API. Default delays are usually fine for casual playlist downloads. For large channels (500+ videos), spread across multiple days or use yt-dlp's --sleep-interval flag.

What about the language?

The script auto-detects language per video via Whisper's built-in detector. For multi-language playlists this works correctly. If you know all videos are in one language, force it with language='en' in the transcribe call — saves 1-2 seconds per video.

Comparison: web tool vs OSS batch

AspectMDisBetter web toolOSS batch
Setup time0 — paste URL30-60 min first time
Per-video time (manual)~1 min click + 1-3 min wait0 (script runs unattended)
Quality~94% (clean audio)~96-98% with large-v3
Speaker labelsYes, defaultOptional via WhisperX
Markdown structureDefaultYou write the formatter
CostFree tier or paidFree (your hardware)
PrivacyServer-side processingFully local
Best forFew videos, no setup50+ videos, automation

Recommendation

Under 10 videos: use the web tool, manual paste-and-convert. 10-50 videos: split into multiple sessions with the web tool, or take a couple hours to set up the OSS pipeline (the script becomes reusable forever after). 50+ videos or recurring batch needs: the OSS pipeline pays back in the first run. For knowledge-system integration after batch transcription, see Obsidian video vault setup; for individual workflows see 5 methods compared and the tool benchmark; and for non-YouTube batch (uploaded MP4s, Zoom recordings), the same script works by skipping the yt-dlp step. Cross-reference: the same OSS approach applies to audio batch by stripping the download step.

Frequently asked questions

Can I run this on a Mac without a dedicated GPU?
Yes — Apple Silicon Macs work via Metal. M3 Max gets you ~3-4x real-time on large-v3, M2/M1 closer to 1-2x real-time on medium. Set device='auto' in faster-whisper and it picks the best available backend. For old Intel Macs, fall back to CPU-only and the medium model — slower but still usable for overnight batch.
What if a video is age-restricted or region-locked?
yt-dlp handles many of these cases with --cookies-from-browser flag (passes your logged-in YouTube cookies). For age-gated videos, this typically works. Region-locked videos require a VPN at the network level — yt-dlp can't bypass geographic restrictions on its own. For private/unlisted videos with a direct URL, no workaround unless you have access; for fully private videos, you can't download them anyway.
How do I keep this updated as I add new videos to the playlist?
Re-run the script periodically. Because it skips videos with existing .md files, only new videos get transcribed. For automation, schedule it via cron (Linux/Mac) or Task Scheduler (Windows) to run weekly. The yt-dlp + faster-whisper combo doesn't need any updates for routine running, but check yt-dlp for updates monthly since YouTube occasionally changes their backend (yt-dlp pushes fast updates to track these changes).