May 10, 2026 · 10 min read · MDisBetter

How AI Transcription Actually Works (Whisper, ASR, and Beyond)

Automatic speech recognition was a research curiosity in 1995, an unreliable consumer toy in 2005, a niche enterprise tool in 2015, and a near-human commodity in 2025. The trajectory tracks one of the steeper improvement curves in any field of applied machine learning. Most of the gains since 2022 trace to a single architectural shift — encoder-decoder transformers trained on hundreds of thousands of hours of paired audio-and-text — and most of that body of work flows from one paper: OpenAI's Whisper. Here's what's actually happening when you upload an audio file and get a transcript back, why modern accuracy is what it is, and why the format of the output (plain text vs structured Markdown) matters more than people typically expect.

The four-decade arc of speech recognition

Speech recognition has gone through four distinct technological eras. Each transition produced a step-function improvement in usable accuracy.

1980s-2000s: HMM-GMM era. Hidden Markov Models with Gaussian Mixture Model emissions modeled speech as a sequence of phonemes (the smallest units of sound). The system tried to figure out which sequence of phonemes was most likely given the audio, then mapped that sequence to words via a pronunciation dictionary. Accuracy plateaued in the 70-85% range on clean studio audio, much lower on real-world recordings. Required hand-engineered acoustic features (MFCCs), per-language dictionaries, and language-specific tuning. This is what dictation software like Dragon NaturallySpeaking ran on for two decades.

2010-2017: hybrid HMM-DNN era. Deep neural networks replaced the GMM emission model while keeping the HMM backbone. Accuracy jumped 10-20 percentage points. This was the era of Google Voice Search becoming actually usable and of Siri/Alexa launching. Still required separate language models, pronunciation dictionaries, and acoustic models that had to be trained per language with substantial labeled data.

2017-2022: end-to-end deep learning. Architectures like DeepSpeech (Mozilla), Listen-Attend-Spell, and CTC-based models replaced the entire pipeline with a single neural network that mapped audio waveforms directly to text. Trainable on raw paired audio-text without needing phoneme dictionaries. Accuracy continued climbing on benchmarks. Still typically required language-specific models trained on language-specific data.

2022-present: large multilingual encoder-decoder transformers. Whisper (OpenAI, 2022), Conformer-based models, and their successors. Single model handles dozens of languages. Trained on hundreds of thousands of hours of weakly supervised data scraped from the internet. Robustness to background noise, accent variation, and audio quality variation that previous eras never approached. This is the architecture every modern transcription product runs on, with proprietary tweaks.

How Whisper specifically works

Whisper is the model whose architecture and training methodology defined the current era. The technical details, simplified:

Architecture. Encoder-decoder transformer. Input is 30-second chunks of audio converted to log-Mel spectrograms (a frequency-vs-time representation that the model processes as a 2D image). The encoder processes the spectrogram into a sequence of latent representations. The decoder is an autoregressive transformer that generates the transcribed text one token at a time, attending to the encoder's outputs.

Training data. 680,000 hours of multilingual audio paired with text transcripts, scraped from the public internet. Roughly two-thirds was English, one-third other languages. The transcripts were of variable quality (some were professional captions, some were auto-generated, some were translations) — Whisper was trained to be robust to this variability rather than requiring clean labels.

Multitask training. The model was trained simultaneously on multiple tasks: transcription (audio → same-language text), translation (audio → English text from any language), language identification, and voice activity detection. A special token at the start of each output specifies which task. This multitask training is part of why Whisper generalizes so well — the model learned the structure of speech across many tasks rather than overfitting to one.

Output. Plain text by default, with optional segment-level timestamps. The model outputs natural punctuation and capitalization (learned from the training data, where most transcripts had both), which earlier ASR systems generally did not. Speaker diarization is not part of base Whisper — it requires a separate model run in a complementary pipeline (see the dedicated article on speaker identification).

Whisper variants and the model-size trade-off

OpenAI released five Whisper sizes, plus subsequent v2 and v3 updates and various fine-tuned community variants:

Model	Parameters	VRAM	Relative speed	Relative accuracy
tiny	39M	~1 GB	~32x	Lowest
base	74M	~1 GB	~16x	Low-medium
small	244M	~2 GB	~6x	Medium
medium	769M	~5 GB	~2x	High
large-v3	1550M	~10 GB	1x	Highest

For most production use, large-v3 is the right starting point — accuracy headroom matters more than throughput for one-off transcription. For batch processing of large archives where compute cost dominates, medium offers most of the accuracy at meaningfully lower cost. tiny and base are useful for real-time scenarios on edge devices where latency requirements force sub-second processing.

Beyond the official OpenAI sizes, the community has produced several useful variants:

faster-whisper: a CTranslate2 reimplementation that runs Whisper 4x faster with the same accuracy by using INT8 quantization and optimized inference
WhisperX: wraps Whisper with forced alignment (better word-level timestamps) and pyannote-based speaker diarization
Distil-Whisper: distilled smaller-faster variants from HuggingFace, retain ~99% of large-v2 accuracy at 6x speed

Accuracy improvements 2023-2026

Whisper-v3 was released late 2023 with a noticeable accuracy improvement on rare languages and improved robustness on noisy audio. Throughout 2024-2025, frontier-AI labs released speech models that closed remaining gaps on hard cases:

gpt-4o-transcribe (OpenAI, 2025): integrated speech model with end-to-end audio understanding. Outperforms Whisper-v3 on noisy audio and on disfluent speech (uh, um, false starts) by significant margins. Available via API for production use.
Gemini multimodal audio (Google): handles audio input natively in the same context as text, allowing the model to transcribe and analyze in a single call.
NVIDIA Parakeet, Canary: open models competitive with Whisper-v3 on English, faster on consumer GPUs.
Deepgram Nova-3, AssemblyAI Universal-2: proprietary models from dedicated speech-AI companies with state-of-the-art accuracy on enterprise benchmarks.

Practical word-error-rate (WER) numbers on clean English speech are now in the 2-5% range for the best models — close enough to human transcription that the residual errors cluster on cases humans also find hard (proper nouns, acronyms, overlapping speech). For practical accuracy expectations by audio quality, see the dedicated article on audio quality vs transcription accuracy.

Why output format matters: plain text vs Markdown

Whisper's default output is plain text. For a long-form audio file, this means a single block of text with optional sentence-level segment markers. For human reading, this is barely usable. For downstream LLM use, it is meaningfully suboptimal.

The key insight: large language models perform better on structured input than on flat input. A plain-text transcript of a 60-minute conversation is harder for a downstream LLM to summarize, extract from, or analyze than a structured version of the same content. The model spends more of its attention on parsing the conversational structure (who said what, when does the topic shift) and less on the content itself.

Structured Markdown transcript output addresses this with three layers of structure:

Speaker labels as bold inline annotations: **Speaker 1:**, **Speaker 2:**
Topic sections as H2 headings derived from the conversation's natural pivots
Timestamp anchors as inline markers: [00:14:32]

For a downstream prompt like "Summarize the main points discussed in this transcript," the structured version produces noticeably better summaries than the plain version — measurable on extraction-quality benchmarks. The differential matters most for long inputs (1-hour-plus transcripts) where attention budget is most constrained.

The pre- and post-processing pipeline

Production transcription systems do more than just call the model. The actual pipeline:

Audio normalization: resample to 16kHz mono, normalize loudness, optionally apply noise reduction
Voice activity detection (VAD): identify segments containing speech and skip silent segments (saves compute, improves accuracy by not asking the model to hallucinate during silence)
Chunking: split long audio into ~30-second windows with overlap (Whisper has a fixed 30-second context window; longer audio requires chunking)
Inference: run each chunk through the model
Stitching: merge chunk-level outputs into a single transcript, deduplicate at chunk boundaries (the overlap regions can produce duplicate words if not handled)
Diarization (optional): run a separate pyannote-style model to identify speakers, align with transcript
Forced alignment (optional): use a phoneme-level alignment model to produce word-level timestamps (Whisper's default segment timestamps are coarser)
Format conversion: convert from internal segment representation to output format (plain text, SRT, VTT, structured Markdown, JSON)

Each stage has failure modes. Naive implementations skip several stages and produce noticeably worse output. Production systems do all of them.

The hallucination problem and how it's mitigated

One known failure mode of Whisper specifically: it can hallucinate text during silent segments or noisy non-speech audio. The model was trained on data where audio was always paired with text, so when faced with audio that has no clear speech, it sometimes generates plausible-sounding but fabricated text. This typically appears as repeated phrases ("thank you for watching" appearing dozens of times in YouTube-derived training data leaks into output).

Mitigations:

VAD pre-processing: skip silent segments entirely so the model never sees them
Repetition penalties: at inference time, penalize the model for generating recently-seen tokens, reducing the loop-into-hallucination failure mode
Hallucination detection: compare output token probabilities to a threshold, drop low-confidence outputs that match known hallucination patterns
Newer models: gpt-4o-transcribe and other newer models have meaningfully reduced this failure mode through better training data curation

Cross-feature: structure matters for downstream LLM use

The argument for Markdown over plain text generalizes beyond transcription. The same logic explains why structured Markdown beats raw HTML for LLM input on web content (see Markdown vs HTML for LLM token comparison) and why structured PDF-to-Markdown beats raw PDF text for any AI-assisted document workflow.

The pattern: any time you're feeding extracted content to an LLM, the format of the extraction determines downstream quality at least as much as the accuracy of the extraction itself. A 99%-accurate plain-text transcript can produce worse summaries than a 95%-accurate structured-Markdown transcript, because the LLM's attention is more efficiently spent.

What's next in transcription

Three trends to watch through the rest of 2026:

Native multimodal models: GPT-5, Gemini 3, and Claude continue absorbing speech understanding directly into the foundation model. The gap between "transcribe then analyze" and "analyze the audio directly" narrows toward zero.
Real-time streaming: low-latency speech models suitable for live captioning at conversational latency (sub-200ms). Useful for accessibility, dictation, and live note-taking.
Specialized domain models: medical, legal, and industry-specific fine-tunes that handle vocabulary general models miss. Several startups are building these on top of Whisper or proprietary base models.

For practical use today, the structured Markdown pipeline is at audio-to-markdown; for the format-debate that drives the structural choice, see Markdown vs plain text for transcripts; for archival workflows that build on transcription, see building a searchable audio archive.

Frequently asked questions

Why is Whisper still considered a reference model when newer ones beat it on benchmarks?

Two reasons. First, Whisper's open weights make it the substrate everyone builds on — faster-whisper, WhisperX, Distil-Whisper, and dozens of fine-tuned domain variants all start from Whisper checkpoints. The ecosystem effect is large. Second, on noisy real-world audio (the kind that doesn't appear in clean academic benchmarks), Whisper's robustness is still competitive with proprietary alternatives. Newer models like gpt-4o-transcribe beat it on specific axes but have less of a developer ecosystem and aren't open-weights, so for any pipeline you want to run yourself, Whisper remains the default starting point.

How does Whisper handle code-switching (speakers mixing languages mid-sentence)?

Whisper handles code-switching at segment granularity reasonably well — it can detect and transcribe both languages within the same audio file. Within a single sentence that mixes languages, performance is more variable; the model tends to pick one language and force-fit the other words, which can produce errors. For multilingual content where code-switching is frequent (bilingual interviews, multilingual meetings), running the model with the dominant language explicitly set typically produces cleaner output than letting it auto-detect, with the secondary-language passages cleaned up in post-processing.

What's the realistic word error rate (WER) I should expect on my own audio?

On clean, near-mic audio in a major language, expect 2-5% WER for large-v3 — close to human transcription. On phone-quality audio with some background noise, 5-12% WER. On heavily degraded audio (noisy room, distant mic, overlapping speech), 15-30%+ WER, sometimes worse. The single biggest factor is signal-to-noise ratio at the mic — improving recording conditions has more impact on WER than choosing a different model. The detailed treatment is in our audio-quality-vs-accuracy article.