Pricing Dashboard Sign up
Recent
· 11 min read · MDisBetter

Transcription Accuracy by Audio Quality: What to Expect (Real Tests)

Vendors quote 99% accuracy. Real-world users routinely get 85%. The gap isn't lying — it's audio quality. The same transcription engine produces wildly different results depending on the microphone, the room, and the background noise. We tested this directly: one speaker reading the same 8-minute script under four conditions, run through five top tools. The numbers below let you predict your own results before paying for transcription you'll have to redo.

Test methodology

One speaker, one 8-minute script (a mix of common English plus 20 deliberately tricky words: technical jargon, proper nouns, numbers, foreign loanwords). Recorded four times under four conditions:

  1. Studio mic — Shure SM7B through a Cloudlifter into a Focusrite Scarlett 2i2, treated room, pop filter. The best-case setup.
  2. USB headset — Logitech H390, untreated home office, light ambient HVAC noise.
  3. Phone (held) — iPhone 15 native Voice Memos app, held about 8 inches from mouth, quiet living room.
  4. Phone with background noise — same iPhone, same position, with a TV playing news at conversational volume in the same room.

Each recording was run through five top tools from our broader 12-tool benchmark: MDisBetter, HappyScribe AI, Whisper large-v3 (local), Otter, and TurboScribe. Word Error Rate (WER) computed against the source script.

Headline results

ToolStudio micUSB headsetPhone (quiet)Phone (TV noise)
HappyScribe AI99.2%97.8%95.4%83.1%
Whisper large-v398.9%97.1%94.7%87.4%
MDisBetter98.6%96.9%94.2%82.8%
Otter98.4%96.2%93.5%80.9%
TurboScribe98.7%96.4%93.8%81.5%

What the numbers mean

Studio mic — 98-99% across the board

This is what vendors quote when they say 99%. Clean signal, minimal noise, proper room treatment, single speaker. Every top-tier tool essentially solves the problem here. The 0.6-point spread between best and worst is irrelevant for almost any use case.

If your job permits investing in a real microphone and a quiet space, this is achievable. Podcasters, voice-over artists, professional interviewers should aim here.

USB headset — 96-98%

The realistic best case for most knowledge-worker recording. A $50 USB headset in a normal home office gets you within 1-2 points of studio quality. The losses come from the headset's smaller diaphragm and the untreated room (some echo, some HVAC drone), not from the digital pipeline.

This is the recommended baseline for anyone doing regular transcription work. Cheap, simple, dramatic improvement over phone recording.

Phone (quiet room) — 93-95%

A held iPhone in a quiet room is surprisingly competent. About 5 errors per 100 words. Most errors are subtle (wrong proper noun, missed comma, occasional dropped word) — the gist is fully intact.

This is the right baseline for casual users: voice memos, quick interview snippets, field notes. Acceptable for most non-publication uses.

Phone with TV background noise — 80-87%

The cliff. Adding moderate background noise drops accuracy by 10-15 points. This is one error every 6-7 words — a serious editing burden, often more work than transcribing yourself.

Notably, Whisper large-v3 outperformed the cloud tools by 4-5 points in this category. Whisper's training corpus includes a lot of noisy real-world audio, and that breadth shows up specifically when conditions degrade.

Why noise hurts more than expected

Acoustically, doubling the noise (3 dB increase) doesn't double the error rate — it more than doubles it, especially for non-speech noise that overlaps speech frequencies. TV news playing in the background is the worst case: it's intelligible speech at similar frequencies to your speech, which confuses the model into transcribing both.

Music, traffic, HVAC, and dishwasher noise are easier to filter — they're spectrally distant from speech and the models are trained to ignore steady-state noise. People talking in the background is the hardest case in real-world recording.

What the per-tool differences mean

HappyScribe and Whisper alternated for the lead depending on noise. HappyScribe was best on clean audio (99.2% studio); Whisper was best on noisy audio (87.4% phone+TV). MDisBetter, Otter, and TurboScribe clustered tightly within 1-2 points of each other across all conditions.

If your recording quality is reliable (you control your mic and room), the tool choice matters relatively little — pick on workflow features. If your recordings are unpredictable (field interviews, ad hoc capture), Whisper-class noise robustness becomes a real differentiator.

How to improve any recording

Distance to mic — the single biggest fix

Keeping the mic 6-12 inches from your mouth (instead of arm's length on a desk or laptop) is the most impactful single change you can make. Doubling distance roughly halves signal-to-noise ratio. A $5 lapel mic clipped to your collar beats a $200 condenser sitting across the desk.

Room treatment — soft surfaces matter

Echo destroys transcription. The simplest fix: record in a room with soft surfaces. Carpets, curtains, sofas, bookshelves all absorb sound. The hardest places to transcribe are kitchens, bathrooms, and empty rooms — hard reflective surfaces create reverb that smears speech across the timeline.

If you can't change rooms, throw a thick blanket over the recording desk and around the mic. It looks ridiculous and works.

Background noise — kill it before recording, not after

Turn off the TV, the dishwasher, the air purifier. Close the window facing the street. Mute Slack notifications. Five minutes of pre-recording prep saves hours of post-recording cleanup.

Noise reduction software (Krisp, Adobe Enhance Speech, Descript Studio Sound) helps significantly when you can't kill the source — it removes steady-state noise reasonably well. It's not magic for speech-on-speech contamination.

Use the right format

Record in WAV or M4A (lossless or high-bitrate compressed). Avoid low-bitrate MP3 (under 96 kbps) — the compression artifacts confuse transcription models. Most modern phone recorders default to acceptable formats; voice-memo apps from 2020+ are fine.

Speak clearly without overdoing it

Articulate consonants, slow down slightly, but don't shift into news-anchor voice — overly stilted speech actually transcribes worse than natural speech (the models are trained on conversational data). Just speak the way you'd want a human to hear you.

Specific tips by recording scenario

Solo voice memo on a phone

Two-person interview in a public place

Multi-person meeting

Lecture or conference talk

Field recording outdoors

Calibrating expectations

Use the table at the top of this article to set your expectations:

This is also why the AI-pipeline argument matters: if the transcript is going to ChatGPT or Claude, you want clean structured Markdown output (covered in speech to text vs audio to Markdown) — but the LLM cannot fix accuracy errors in your transcript. Garbage in, garbage out applies. Get the recording quality right first.

What about other input formats?

Audio quality has the biggest impact on transcription accuracy. Document conversion has its own "input quality" considerations — scanned PDFs vs digital PDFs, JS-rendered web pages vs server-rendered. We cover the document side in PDF to Markdown accuracy: what to expect. The principle is the same: clean inputs make every downstream step easier.

Picking your tool

If your audio is reliably clean (studio or USB headset), pick on workflow features. MDisBetter for AI workflows + Markdown output. TurboScribe for unlimited volume. Otter for meeting bot.

If your audio is unpredictable (field, phone, noisy environments), Whisper large-v3 is meaningfully more robust. Run locally if you have a GPU, or pick a tool that uses a recent Whisper variant under the hood.

If accuracy must be perfect (legal, medical, journalism), HappyScribe's human-transcription tier is the answer regardless of recording quality.

Frequently asked questions

Will recording in stereo improve accuracy over mono?
Not directly — most transcription models downmix to mono internally. Stereo helps only if the two channels capture different speakers separately (e.g., a Zoom recording with each participant on their own track). For single-mic capture, stereo doubles file size with no benefit.
Does Apple's Voice Memos transcribe well?
iOS 18+ added on-device transcription to Voice Memos with reasonable quality on clean phone-quality audio (around 92-94% in our quick test). It's free, instant, and private. Compared to cloud tools, it loses on multi-speaker handling and on noisy-audio robustness, but it's an excellent default for solo voice memos.
Should I clean up audio with noise reduction before uploading?
Modern transcription models are mostly noise-robust on their own — pre-cleaning with consumer noise reduction tools rarely helps and can hurt (artifacts confuse the model). The exception: if your recording has a steady-state noise (HVAC hum) much louder than the speech, a basic noise gate or high-pass filter at 80 Hz can help. For TV/voice background noise, no preprocessing reliably helps; re-record if you can.