Batch Transcribe Multiple Audio Files at Once
You have 50 voice memos, 30 podcast back catalog episodes, or 200 customer interviews you want transcribed. The mdisbetter web tool processes one file at a time — perfect for ad-hoc use, painful for batches. Here's the honest answer: for batch volume, run OSS Whisper locally with a Python script. Free, fast on a decent machine, and the recipe is short. Full working code inside.
The honest disclosure
The mdisbetter audio-to-markdown web tool is a one-file-at-a-time interface. It's optimized for ad-hoc conversions where you upload, click, download. There is no batch endpoint, no API, no CLI, no SDK. For two reasons: most users only ever need one file at a time, and high-volume use is genuinely better served by running the OSS speech-to-text models locally.
The right tool for batch is OpenAI Whisper (or its faster reimplementation, faster-whisper). Both are open-source, MIT-licensed, and produce equivalent or better quality on most content compared to the web tool. The setup cost is small; the per-file cost is essentially zero after that.
Practical thresholds:
- 1-10 files: web tool, one at a time. Total time is dominated by upload and small enough to be tolerable.
- 10-50 files: web tool gets tedious. Local Whisper is worth setting up.
- 50+ files: local Whisper with a script is the only sane option.
Setting up Whisper locally
Prerequisites
- Python 3.9 or newer
- FFmpeg installed and on your PATH (audio decoding)
- A reasonably modern machine. A GPU helps but isn't required for moderate batches; large batches benefit significantly from CUDA-capable hardware.
Install Whisper
pip install openai-whisperThat's the entire installation. The model weights download automatically on first use.
Test on a single file
whisper test.m4a --model large-v3 --output_format md --output_dir transcripts/The first run downloads the model (a few GB for large-v3). Subsequent runs use the cached weights. Output is a test.md file in the transcripts/ directory.
The basic batch script
For straightforward batch processing of all audio files in a folder:
import whisper
from pathlib import Path
model = whisper.load_model("large-v3")
audio_dir = Path("./audio-inbox")
output_dir = Path("./transcripts")
output_dir.mkdir(exist_ok=True)
extensions = {".mp3", ".m4a", ".wav", ".ogg", ".flac", ".mp4"}
for audio_file in audio_dir.iterdir():
if audio_file.suffix.lower() not in extensions:
continue
md_path = output_dir / f"{audio_file.stem}.md"
if md_path.exists():
print(f"SKIP {audio_file.name} (already transcribed)")
continue
print(f"PROCESSING {audio_file.name}...")
result = model.transcribe(str(audio_file), verbose=False)
md_path.write_text(
f"# {audio_file.stem}\n\n{result['text']}\n",
encoding="utf-8"
)
print(f"OK {md_path}")
print("Done.")This is the simplest version. It produces a single Markdown file per audio with the full transcript as flowing text — no speaker labels, no section headings.
Adding speaker diarization with WhisperX
For interviews, meetings, or any multi-speaker content, you want speaker labels. Plain Whisper doesn't include diarization; WhisperX adds it.
Install WhisperX
pip install whisperxWhisperX requires a Hugging Face token (free) for the diarization models. Sign up at huggingface.co, generate a read token in your account settings, and accept the terms for the pyannote/speaker-diarization-3.1 model.
Batch script with diarization
import whisperx
from pathlib import Path
import os
DEVICE = "cuda" if whisperx.utils.torch.cuda.is_available() else "cpu"
COMPUTE_TYPE = "float16" if DEVICE == "cuda" else "int8"
HF_TOKEN = os.environ["HF_TOKEN"] # set this in your shell
model = whisperx.load_model("large-v3", DEVICE, compute_type=COMPUTE_TYPE)
diarize_model = whisperx.DiarizationPipeline(use_auth_token=HF_TOKEN, device=DEVICE)
audio_dir = Path("./audio-inbox")
output_dir = Path("./transcripts")
output_dir.mkdir(exist_ok=True)
extensions = {".mp3", ".m4a", ".wav", ".ogg", ".flac", ".mp4"}
for audio_file in audio_dir.iterdir():
if audio_file.suffix.lower() not in extensions:
continue
md_path = output_dir / f"{audio_file.stem}.md"
if md_path.exists():
continue
print(f"PROCESSING {audio_file.name}...")
audio = whisperx.load_audio(str(audio_file))
result = model.transcribe(audio, batch_size=16)
# alignment
align_model, metadata = whisperx.load_align_model(
language_code=result["language"], device=DEVICE
)
result = whisperx.align(
result["segments"], align_model, metadata, audio, DEVICE
)
# diarization
diarize_segments = diarize_model(audio)
result = whisperx.assign_word_speakers(diarize_segments, result)
# write Markdown with speaker labels
lines = [f"# {audio_file.stem}\n"]
current_speaker = None
for segment in result["segments"]:
speaker = segment.get("speaker", "UNKNOWN")
if speaker != current_speaker:
lines.append(f"\n**{speaker}:** {segment['text'].strip()}")
current_speaker = speaker
else:
lines.append(segment["text"].strip())
md_path.write_text("\n".join(lines), encoding="utf-8")
print(f"OK {md_path}")This produces Markdown with bold speaker labels (**SPEAKER_00:**, **SPEAKER_01:**) similar to the web tool's output. Run a find-and-replace pass after batch completion to rename speakers per file.
Adding section headings with an LLM pass
The web tool's H2 section headings come from an LLM analysis of the transcript content. To replicate this in your batch pipeline, add a second pass after transcription that uses an LLM to insert headings at topic transitions.
import openai # or anthropic, or any other LLM client
def add_section_headings(transcript_md: str) -> str:
prompt = f"""Read this transcript. Insert H2 Markdown headings (## Topic) at natural topic shifts.
Rules:
- Insert headings only where the conversation clearly shifts to a new topic
- Headings should be substantive noun phrases (not 'Section 1' or 'New Topic')
- Preserve all existing content verbatim
- Do not add commentary or summary; just insert headings
Transcript:
{transcript_md}"""
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
)
return response.choices[0].message.contentCall this on each transcribed file after the diarization pass. The headings make the long transcripts navigable.
Note: this step costs LLM API tokens (small but non-zero). For batches of hundreds of files, the math is still favorable — typically pennies per file.
Performance and hardware
Realistic transcription speed by hardware:
- CPU only (modern laptop): ~3-5x real-time. A 60-minute audio takes 12-20 minutes. Fine for overnight batches.
- Mid-range GPU (e.g., RTX 3060): ~10-20x real-time. 60-minute audio in 3-6 minutes.
- High-end GPU (e.g., RTX 4090): 30-50x real-time. 60-minute audio in 1-2 minutes.
- Apple Silicon Mac (M1/M2/M3): 5-15x real-time depending on model and size.
For batches under 50 files, even CPU-only is acceptable — leave it running while you do other work. For larger batches or production pipelines, GPU is worth the investment.
Memory considerations
The large-v3 model needs roughly 10GB of RAM (CPU) or 10GB of VRAM (GPU). For machines with less, fall back to medium (5GB) or small (2GB) models — accuracy is lower but still usable for many use cases.
Trade-off comparison on clean English audio:
large-v3: 95-99% word accuracy, slowestmedium: 92-97% word accuracy, ~2x fastersmall: 88-94% word accuracy, ~4x fastertiny: 80-88% word accuracy, ~10x faster (mostly useful for triage, not final output)
Error handling for production batches
For batches of hundreds of files, robustness matters. Three patterns to add:
Skip-if-exists
Already in the basic script — check for the output .md before processing. Lets you re-run the script after partial failures without redoing work.
Try/except per file
try:
result = model.transcribe(str(audio_file))
# ...
except Exception as e:
print(f"FAIL {audio_file.name}: {e}")
failed_log.write(f"{audio_file.name}\t{e}\n")
continueDon't let one corrupt file kill the whole batch. Log failures, continue, retry the failed list at the end.
Progress checkpointing
For very long batches (1000+ files), write progress to a state file so you can resume after interruption. A simple JSON file mapping audio filename → status (pending/done/failed) is sufficient.
Cross-feature: combining with PDF batch
If you have a corpus that includes both audio and PDFs (e.g., research interview project with consent forms and transcripts as PDFs), the parallel batch pipeline for PDFs uses similar tooling. See batch convert 100 PDFs to Markdown for that pipeline. Combined, you can produce a unified Markdown corpus across audio and document inputs.
Privacy: local-only processing
One of the biggest reasons to use the local Whisper path is privacy. The audio never leaves your machine. For sensitive content (legal, medical, HR, deeply personal), this is the right answer regardless of volume — even for single files. The recipe above runs offline once the model weights are cached.
When the web tool is still the right choice
Despite all of the above, the web tool wins in three cases:
- Ad-hoc one-off conversions. The setup cost of the local pipeline isn't worth it for a single file.
- Non-technical users. Setting up Python, FFmpeg, and Hugging Face tokens is a real friction barrier.
- Casual use that doesn't justify maintenance. A local pipeline needs occasional model updates, dependency updates, and machine maintenance.
For everyone else with batch needs, the local OSS path is the right answer.
Recommendation
For most users with batch transcription needs, the path is: install Whisper or WhisperX once, write the script once (start from the recipes above), and reuse it for every future batch. The setup is a few hours; the payback is essentially every batch you process from then on. The output Markdown integrates with the same downstream workflows (Obsidian, Notion, AI Q&A) as the web tool's output. The format is identical; only the production pipeline differs.
For occasional users who hit a one-time batch (clean out a backlog, transcribe a research project), the choice between web tool and local script is genuinely close. If the batch is under ~50 files and the friction of installing tooling is high, the web tool one file at a time is fine. Above that volume, the local script wins decisively.
One additional pattern: scheduled overnight processing
For teams accumulating recordings continuously (daily standups, weekly all-hands, ongoing customer interviews), a scheduled overnight batch is the lowest-effort production pattern. Drop new audio files into an inbox folder during the day; a scheduled job (cron on Linux/Mac, Task Scheduler on Windows) runs the batch script at 2am; transcripts appear in the output folder by morning.
The scheduling configuration is straightforward. On Linux/Mac, a cron entry like 0 2 * * * cd /path/to/project && /path/to/python script.py runs the script every night at 2am. On Windows, Task Scheduler does the equivalent through a GUI. Add a small notification step at the end of the script (push to a Slack webhook, email, or just a desktop notification) so you know overnight work completed and which files succeeded.
The pattern works because the batch script's skip-if-exists check makes it idempotent. The same script can run nightly without re-processing already-done files; it only handles whatever is new in the inbox folder. Combined with a simple intake convention (drop new audio in ./inbox/; transcripts appear in ./transcripts/ by morning) this becomes the lowest-friction production transcription pipeline for any team that produces recordings regularly.
One last note on accuracy and quality
The local Whisper path produces output that's at least as accurate as the web tool, and often better when you can use the largest model on a capable machine. The downside is the lack of the LLM-driven post-processing step (intelligent H2 section breaks at topic transitions). The recipe above shows how to add that as a second pass with an LLM API call. For teams that want the same structured output the web tool produces, the two-stage pipeline (local Whisper for transcription + diarization, LLM API for section headings) is the production answer.