Pricing Dashboard Sign up
Recent
· 9 min read · MDisBetter

Batch Transcribe Multiple Audio Files at Once

You have 50 voice memos, 30 podcast back catalog episodes, or 200 customer interviews you want transcribed. The mdisbetter web tool processes one file at a time — perfect for ad-hoc use, painful for batches. Here's the honest answer: for batch volume, run OSS Whisper locally with a Python script. Free, fast on a decent machine, and the recipe is short. Full working code inside.

The honest disclosure

The mdisbetter audio-to-markdown web tool is a one-file-at-a-time interface. It's optimized for ad-hoc conversions where you upload, click, download. There is no batch endpoint, no API, no CLI, no SDK. For two reasons: most users only ever need one file at a time, and high-volume use is genuinely better served by running the OSS speech-to-text models locally.

The right tool for batch is OpenAI Whisper (or its faster reimplementation, faster-whisper). Both are open-source, MIT-licensed, and produce equivalent or better quality on most content compared to the web tool. The setup cost is small; the per-file cost is essentially zero after that.

Practical thresholds:

Setting up Whisper locally

Prerequisites

Install Whisper

pip install openai-whisper

That's the entire installation. The model weights download automatically on first use.

Test on a single file

whisper test.m4a --model large-v3 --output_format md --output_dir transcripts/

The first run downloads the model (a few GB for large-v3). Subsequent runs use the cached weights. Output is a test.md file in the transcripts/ directory.

The basic batch script

For straightforward batch processing of all audio files in a folder:

import whisper
from pathlib import Path

model = whisper.load_model("large-v3")
audio_dir = Path("./audio-inbox")
output_dir = Path("./transcripts")
output_dir.mkdir(exist_ok=True)

extensions = {".mp3", ".m4a", ".wav", ".ogg", ".flac", ".mp4"}

for audio_file in audio_dir.iterdir():
    if audio_file.suffix.lower() not in extensions:
        continue
    md_path = output_dir / f"{audio_file.stem}.md"
    if md_path.exists():
        print(f"SKIP {audio_file.name} (already transcribed)")
        continue
    print(f"PROCESSING {audio_file.name}...")
    result = model.transcribe(str(audio_file), verbose=False)
    md_path.write_text(
        f"# {audio_file.stem}\n\n{result['text']}\n",
        encoding="utf-8"
    )
    print(f"OK {md_path}")

print("Done.")

This is the simplest version. It produces a single Markdown file per audio with the full transcript as flowing text — no speaker labels, no section headings.

Adding speaker diarization with WhisperX

For interviews, meetings, or any multi-speaker content, you want speaker labels. Plain Whisper doesn't include diarization; WhisperX adds it.

Install WhisperX

pip install whisperx

WhisperX requires a Hugging Face token (free) for the diarization models. Sign up at huggingface.co, generate a read token in your account settings, and accept the terms for the pyannote/speaker-diarization-3.1 model.

Batch script with diarization

import whisperx
from pathlib import Path
import os

DEVICE = "cuda" if whisperx.utils.torch.cuda.is_available() else "cpu"
COMPUTE_TYPE = "float16" if DEVICE == "cuda" else "int8"
HF_TOKEN = os.environ["HF_TOKEN"]  # set this in your shell

model = whisperx.load_model("large-v3", DEVICE, compute_type=COMPUTE_TYPE)
diarize_model = whisperx.DiarizationPipeline(use_auth_token=HF_TOKEN, device=DEVICE)

audio_dir = Path("./audio-inbox")
output_dir = Path("./transcripts")
output_dir.mkdir(exist_ok=True)

extensions = {".mp3", ".m4a", ".wav", ".ogg", ".flac", ".mp4"}

for audio_file in audio_dir.iterdir():
    if audio_file.suffix.lower() not in extensions:
        continue
    md_path = output_dir / f"{audio_file.stem}.md"
    if md_path.exists():
        continue
    print(f"PROCESSING {audio_file.name}...")

    audio = whisperx.load_audio(str(audio_file))
    result = model.transcribe(audio, batch_size=16)

    # alignment
    align_model, metadata = whisperx.load_align_model(
        language_code=result["language"], device=DEVICE
    )
    result = whisperx.align(
        result["segments"], align_model, metadata, audio, DEVICE
    )

    # diarization
    diarize_segments = diarize_model(audio)
    result = whisperx.assign_word_speakers(diarize_segments, result)

    # write Markdown with speaker labels
    lines = [f"# {audio_file.stem}\n"]
    current_speaker = None
    for segment in result["segments"]:
        speaker = segment.get("speaker", "UNKNOWN")
        if speaker != current_speaker:
            lines.append(f"\n**{speaker}:** {segment['text'].strip()}")
            current_speaker = speaker
        else:
            lines.append(segment["text"].strip())

    md_path.write_text("\n".join(lines), encoding="utf-8")
    print(f"OK {md_path}")

This produces Markdown with bold speaker labels (**SPEAKER_00:**, **SPEAKER_01:**) similar to the web tool's output. Run a find-and-replace pass after batch completion to rename speakers per file.

Adding section headings with an LLM pass

The web tool's H2 section headings come from an LLM analysis of the transcript content. To replicate this in your batch pipeline, add a second pass after transcription that uses an LLM to insert headings at topic transitions.

import openai  # or anthropic, or any other LLM client

def add_section_headings(transcript_md: str) -> str:
    prompt = f"""Read this transcript. Insert H2 Markdown headings (## Topic) at natural topic shifts.

Rules:
- Insert headings only where the conversation clearly shifts to a new topic
- Headings should be substantive noun phrases (not 'Section 1' or 'New Topic')
- Preserve all existing content verbatim
- Do not add commentary or summary; just insert headings

Transcript:
{transcript_md}"""
    response = openai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
    )
    return response.choices[0].message.content

Call this on each transcribed file after the diarization pass. The headings make the long transcripts navigable.

Note: this step costs LLM API tokens (small but non-zero). For batches of hundreds of files, the math is still favorable — typically pennies per file.

Performance and hardware

Realistic transcription speed by hardware:

For batches under 50 files, even CPU-only is acceptable — leave it running while you do other work. For larger batches or production pipelines, GPU is worth the investment.

Memory considerations

The large-v3 model needs roughly 10GB of RAM (CPU) or 10GB of VRAM (GPU). For machines with less, fall back to medium (5GB) or small (2GB) models — accuracy is lower but still usable for many use cases.

Trade-off comparison on clean English audio:

Error handling for production batches

For batches of hundreds of files, robustness matters. Three patterns to add:

Skip-if-exists

Already in the basic script — check for the output .md before processing. Lets you re-run the script after partial failures without redoing work.

Try/except per file

try:
    result = model.transcribe(str(audio_file))
    # ...
except Exception as e:
    print(f"FAIL {audio_file.name}: {e}")
    failed_log.write(f"{audio_file.name}\t{e}\n")
    continue

Don't let one corrupt file kill the whole batch. Log failures, continue, retry the failed list at the end.

Progress checkpointing

For very long batches (1000+ files), write progress to a state file so you can resume after interruption. A simple JSON file mapping audio filename → status (pending/done/failed) is sufficient.

Cross-feature: combining with PDF batch

If you have a corpus that includes both audio and PDFs (e.g., research interview project with consent forms and transcripts as PDFs), the parallel batch pipeline for PDFs uses similar tooling. See batch convert 100 PDFs to Markdown for that pipeline. Combined, you can produce a unified Markdown corpus across audio and document inputs.

Privacy: local-only processing

One of the biggest reasons to use the local Whisper path is privacy. The audio never leaves your machine. For sensitive content (legal, medical, HR, deeply personal), this is the right answer regardless of volume — even for single files. The recipe above runs offline once the model weights are cached.

When the web tool is still the right choice

Despite all of the above, the web tool wins in three cases:

  1. Ad-hoc one-off conversions. The setup cost of the local pipeline isn't worth it for a single file.
  2. Non-technical users. Setting up Python, FFmpeg, and Hugging Face tokens is a real friction barrier.
  3. Casual use that doesn't justify maintenance. A local pipeline needs occasional model updates, dependency updates, and machine maintenance.

For everyone else with batch needs, the local OSS path is the right answer.

Recommendation

For most users with batch transcription needs, the path is: install Whisper or WhisperX once, write the script once (start from the recipes above), and reuse it for every future batch. The setup is a few hours; the payback is essentially every batch you process from then on. The output Markdown integrates with the same downstream workflows (Obsidian, Notion, AI Q&A) as the web tool's output. The format is identical; only the production pipeline differs.

For occasional users who hit a one-time batch (clean out a backlog, transcribe a research project), the choice between web tool and local script is genuinely close. If the batch is under ~50 files and the friction of installing tooling is high, the web tool one file at a time is fine. Above that volume, the local script wins decisively.

One additional pattern: scheduled overnight processing

For teams accumulating recordings continuously (daily standups, weekly all-hands, ongoing customer interviews), a scheduled overnight batch is the lowest-effort production pattern. Drop new audio files into an inbox folder during the day; a scheduled job (cron on Linux/Mac, Task Scheduler on Windows) runs the batch script at 2am; transcripts appear in the output folder by morning.

The scheduling configuration is straightforward. On Linux/Mac, a cron entry like 0 2 * * * cd /path/to/project && /path/to/python script.py runs the script every night at 2am. On Windows, Task Scheduler does the equivalent through a GUI. Add a small notification step at the end of the script (push to a Slack webhook, email, or just a desktop notification) so you know overnight work completed and which files succeeded.

The pattern works because the batch script's skip-if-exists check makes it idempotent. The same script can run nightly without re-processing already-done files; it only handles whatever is new in the inbox folder. Combined with a simple intake convention (drop new audio in ./inbox/; transcripts appear in ./transcripts/ by morning) this becomes the lowest-friction production transcription pipeline for any team that produces recordings regularly.

One last note on accuracy and quality

The local Whisper path produces output that's at least as accurate as the web tool, and often better when you can use the largest model on a capable machine. The downside is the lack of the LLM-driven post-processing step (intelligent H2 section breaks at topic transitions). The recipe above shows how to add that as a second pass with an LLM API call. For teams that want the same structured output the web tool produces, the two-stage pipeline (local Whisper for transcription + diarization, LLM API for section headings) is the production answer.

Frequently asked questions

Does mdisbetter offer a batch endpoint, API, or CLI for audio transcription?
No. The supported surface is the web tool at /convert/audio-to-markdown — designed for ad-hoc one-off conversions. For automation and batch processing, the right path is open-source Whisper or WhisperX running locally on your machine. Both are MIT-licensed and a few pip-installs away. The recipes above are working starting points.
Can I run Whisper on a Mac without a GPU?
Yes. Apple Silicon Macs (M1/M2/M3) handle Whisper reasonably well even without an external GPU — the unified memory architecture and the Apple Neural Engine help. Expect 5-15x real-time on the large-v3 model. Intel Macs without dedicated GPU run on CPU only, which is slower (~3-5x real-time) but still usable for overnight batches.
What's the file size limit on local Whisper?
No hard limit — the model processes audio in 30-second windows internally, so file length doesn't bottleneck on memory. Practical limits are disk space and patience. For multi-hour files, the script handles them; just expect proportionally longer runtime. A 4-hour audio file at 10x real-time on GPU takes about 24 minutes.