Pricing Dashboard Sign up
Recent
· 10 min read · MDisBetter

How AI Transcription Actually Works (Whisper, ASR, and Beyond)

Automatic speech recognition was a research curiosity in 1995, an unreliable consumer toy in 2005, a niche enterprise tool in 2015, and a near-human commodity in 2025. The trajectory tracks one of the steeper improvement curves in any field of applied machine learning. Most of the gains since 2022 trace to a single architectural shift — encoder-decoder transformers trained on hundreds of thousands of hours of paired audio-and-text — and most of that body of work flows from one paper: OpenAI's Whisper. Here's what's actually happening when you upload an audio file and get a transcript back, why modern accuracy is what it is, and why the format of the output (plain text vs structured Markdown) matters more than people typically expect.

The four-decade arc of speech recognition

Speech recognition has gone through four distinct technological eras. Each transition produced a step-function improvement in usable accuracy.

1980s-2000s: HMM-GMM era. Hidden Markov Models with Gaussian Mixture Model emissions modeled speech as a sequence of phonemes (the smallest units of sound). The system tried to figure out which sequence of phonemes was most likely given the audio, then mapped that sequence to words via a pronunciation dictionary. Accuracy plateaued in the 70-85% range on clean studio audio, much lower on real-world recordings. Required hand-engineered acoustic features (MFCCs), per-language dictionaries, and language-specific tuning. This is what dictation software like Dragon NaturallySpeaking ran on for two decades.

2010-2017: hybrid HMM-DNN era. Deep neural networks replaced the GMM emission model while keeping the HMM backbone. Accuracy jumped 10-20 percentage points. This was the era of Google Voice Search becoming actually usable and of Siri/Alexa launching. Still required separate language models, pronunciation dictionaries, and acoustic models that had to be trained per language with substantial labeled data.

2017-2022: end-to-end deep learning. Architectures like DeepSpeech (Mozilla), Listen-Attend-Spell, and CTC-based models replaced the entire pipeline with a single neural network that mapped audio waveforms directly to text. Trainable on raw paired audio-text without needing phoneme dictionaries. Accuracy continued climbing on benchmarks. Still typically required language-specific models trained on language-specific data.

2022-present: large multilingual encoder-decoder transformers. Whisper (OpenAI, 2022), Conformer-based models, and their successors. Single model handles dozens of languages. Trained on hundreds of thousands of hours of weakly supervised data scraped from the internet. Robustness to background noise, accent variation, and audio quality variation that previous eras never approached. This is the architecture every modern transcription product runs on, with proprietary tweaks.

How Whisper specifically works

Whisper is the model whose architecture and training methodology defined the current era. The technical details, simplified:

Architecture. Encoder-decoder transformer. Input is 30-second chunks of audio converted to log-Mel spectrograms (a frequency-vs-time representation that the model processes as a 2D image). The encoder processes the spectrogram into a sequence of latent representations. The decoder is an autoregressive transformer that generates the transcribed text one token at a time, attending to the encoder's outputs.

Training data. 680,000 hours of multilingual audio paired with text transcripts, scraped from the public internet. Roughly two-thirds was English, one-third other languages. The transcripts were of variable quality (some were professional captions, some were auto-generated, some were translations) — Whisper was trained to be robust to this variability rather than requiring clean labels.

Multitask training. The model was trained simultaneously on multiple tasks: transcription (audio → same-language text), translation (audio → English text from any language), language identification, and voice activity detection. A special token at the start of each output specifies which task. This multitask training is part of why Whisper generalizes so well — the model learned the structure of speech across many tasks rather than overfitting to one.

Output. Plain text by default, with optional segment-level timestamps. The model outputs natural punctuation and capitalization (learned from the training data, where most transcripts had both), which earlier ASR systems generally did not. Speaker diarization is not part of base Whisper — it requires a separate model run in a complementary pipeline (see the dedicated article on speaker identification).

Whisper variants and the model-size trade-off

OpenAI released five Whisper sizes, plus subsequent v2 and v3 updates and various fine-tuned community variants:

ModelParametersVRAMRelative speedRelative accuracy
tiny39M~1 GB~32xLowest
base74M~1 GB~16xLow-medium
small244M~2 GB~6xMedium
medium769M~5 GB~2xHigh
large-v31550M~10 GB1xHighest

For most production use, large-v3 is the right starting point — accuracy headroom matters more than throughput for one-off transcription. For batch processing of large archives where compute cost dominates, medium offers most of the accuracy at meaningfully lower cost. tiny and base are useful for real-time scenarios on edge devices where latency requirements force sub-second processing.

Beyond the official OpenAI sizes, the community has produced several useful variants:

Accuracy improvements 2023-2026

Whisper-v3 was released late 2023 with a noticeable accuracy improvement on rare languages and improved robustness on noisy audio. Throughout 2024-2025, frontier-AI labs released speech models that closed remaining gaps on hard cases:

Practical word-error-rate (WER) numbers on clean English speech are now in the 2-5% range for the best models — close enough to human transcription that the residual errors cluster on cases humans also find hard (proper nouns, acronyms, overlapping speech). For practical accuracy expectations by audio quality, see the dedicated article on audio quality vs transcription accuracy.

Why output format matters: plain text vs Markdown

Whisper's default output is plain text. For a long-form audio file, this means a single block of text with optional sentence-level segment markers. For human reading, this is barely usable. For downstream LLM use, it is meaningfully suboptimal.

The key insight: large language models perform better on structured input than on flat input. A plain-text transcript of a 60-minute conversation is harder for a downstream LLM to summarize, extract from, or analyze than a structured version of the same content. The model spends more of its attention on parsing the conversational structure (who said what, when does the topic shift) and less on the content itself.

Structured Markdown transcript output addresses this with three layers of structure:

For a downstream prompt like "Summarize the main points discussed in this transcript," the structured version produces noticeably better summaries than the plain version — measurable on extraction-quality benchmarks. The differential matters most for long inputs (1-hour-plus transcripts) where attention budget is most constrained.

The pre- and post-processing pipeline

Production transcription systems do more than just call the model. The actual pipeline:

  1. Audio normalization: resample to 16kHz mono, normalize loudness, optionally apply noise reduction
  2. Voice activity detection (VAD): identify segments containing speech and skip silent segments (saves compute, improves accuracy by not asking the model to hallucinate during silence)
  3. Chunking: split long audio into ~30-second windows with overlap (Whisper has a fixed 30-second context window; longer audio requires chunking)
  4. Inference: run each chunk through the model
  5. Stitching: merge chunk-level outputs into a single transcript, deduplicate at chunk boundaries (the overlap regions can produce duplicate words if not handled)
  6. Diarization (optional): run a separate pyannote-style model to identify speakers, align with transcript
  7. Forced alignment (optional): use a phoneme-level alignment model to produce word-level timestamps (Whisper's default segment timestamps are coarser)
  8. Format conversion: convert from internal segment representation to output format (plain text, SRT, VTT, structured Markdown, JSON)

Each stage has failure modes. Naive implementations skip several stages and produce noticeably worse output. Production systems do all of them.

The hallucination problem and how it's mitigated

One known failure mode of Whisper specifically: it can hallucinate text during silent segments or noisy non-speech audio. The model was trained on data where audio was always paired with text, so when faced with audio that has no clear speech, it sometimes generates plausible-sounding but fabricated text. This typically appears as repeated phrases ("thank you for watching" appearing dozens of times in YouTube-derived training data leaks into output).

Mitigations:

Cross-feature: structure matters for downstream LLM use

The argument for Markdown over plain text generalizes beyond transcription. The same logic explains why structured Markdown beats raw HTML for LLM input on web content (see Markdown vs HTML for LLM token comparison) and why structured PDF-to-Markdown beats raw PDF text for any AI-assisted document workflow.

The pattern: any time you're feeding extracted content to an LLM, the format of the extraction determines downstream quality at least as much as the accuracy of the extraction itself. A 99%-accurate plain-text transcript can produce worse summaries than a 95%-accurate structured-Markdown transcript, because the LLM's attention is more efficiently spent.

What's next in transcription

Three trends to watch through the rest of 2026:

For practical use today, the structured Markdown pipeline is at audio-to-markdown; for the format-debate that drives the structural choice, see Markdown vs plain text for transcripts; for archival workflows that build on transcription, see building a searchable audio archive.

Frequently asked questions

Why is Whisper still considered a reference model when newer ones beat it on benchmarks?
Two reasons. First, Whisper's open weights make it the substrate everyone builds on — faster-whisper, WhisperX, Distil-Whisper, and dozens of fine-tuned domain variants all start from Whisper checkpoints. The ecosystem effect is large. Second, on noisy real-world audio (the kind that doesn't appear in clean academic benchmarks), Whisper's robustness is still competitive with proprietary alternatives. Newer models like gpt-4o-transcribe beat it on specific axes but have less of a developer ecosystem and aren't open-weights, so for any pipeline you want to run yourself, Whisper remains the default starting point.
How does Whisper handle code-switching (speakers mixing languages mid-sentence)?
Whisper handles code-switching at segment granularity reasonably well — it can detect and transcribe both languages within the same audio file. Within a single sentence that mixes languages, performance is more variable; the model tends to pick one language and force-fit the other words, which can produce errors. For multilingual content where code-switching is frequent (bilingual interviews, multilingual meetings), running the model with the dominant language explicitly set typically produces cleaner output than letting it auto-detect, with the secondary-language passages cleaned up in post-processing.
What's the realistic word error rate (WER) I should expect on my own audio?
On clean, near-mic audio in a major language, expect 2-5% WER for large-v3 — close to human transcription. On phone-quality audio with some background noise, 5-12% WER. On heavily degraded audio (noisy room, distant mic, overlapping speech), 15-30%+ WER, sometimes worse. The single biggest factor is signal-to-noise ratio at the mic — improving recording conditions has more impact on WER than choosing a different model. The detailed treatment is in our audio-quality-vs-accuracy article.