Speaker Identification in Transcription: How It Works
Whisper transcribes what was said. It does not, by default, tell you who said it. The task of separating an audio recording into segments by speaker — "speaker diarization" — is a separate machine-learning problem with its own models, its own failure modes, and its own accuracy ceilings. For two-speaker conversations, modern diarization systems get it right roughly 95% of the time. For a six-person panel discussion with crosstalk, accuracy drops sharply. Understanding how diarization works, where it succeeds, and where it predictably fails is the difference between trusting your transcript's speaker labels and over-relying on them.
Diarization is a separate task from transcription
The clarifying observation that almost every product hides: transcription and diarization are distinct neural networks doing different jobs.
Transcription takes an audio waveform and outputs text. The model (Whisper, Conformer, etc.) maps acoustic features to phonemes to words. It does not need to know how many speakers there are or who they are.
Diarization takes the same audio waveform and outputs speaker labels per time segment. The model (pyannote.audio, NVIDIA TitaNet, proprietary alternatives) embeds short audio chunks as voice fingerprints, clusters the embeddings, and assigns each cluster a speaker ID. It does not need to know what was said.
The two outputs get fused at a final stage: each transcribed word gets the speaker label corresponding to its timestamp. The fusion is straightforward when both stages succeed; it gets messy when speaker boundaries fall mid-word, when speakers overlap, or when one of the two stages got something wrong.
How modern diarization actually works
The standard pipeline for speaker diarization, simplified:
- Voice activity detection (VAD): identify which segments of the audio contain speech (vs silence, music, noise)
- Segmentation: split the speech into short uniform chunks (typically 1.5-3 seconds), or use a model that detects speaker-change points
- Embedding: convert each chunk into a fixed-dimensional vector (a "voice fingerprint") using a neural network trained on speaker-recognition data. ECAPA-TDNN, x-vectors, and ResNet-based architectures are all common.
- Clustering: group the embeddings into clusters where each cluster represents one speaker. Algorithms include agglomerative hierarchical clustering, spectral clustering, and modern neural-network-based clusterers.
- Re-segmentation: refine the boundaries using the cluster assignments — a word that started during speaker A but extended into speaker B's turn gets re-assigned
- Optional: overlap detection: identify segments where two speakers talk simultaneously and label them appropriately
The end product is a sequence like: "Speaker 1 from 0:00-0:14, Speaker 2 from 0:14-0:23, Speaker 1 from 0:23-0:31..." These are then fused with the transcription to produce labeled text.
pyannote.audio: the open standard
pyannote.audio is the most widely used open-source diarization toolkit. Built on PyTorch, it provides pretrained models for VAD, segmentation, embedding, and clustering, and bundles them as a pipeline. The current version (3.x) achieves competitive accuracy with proprietary solutions on standard benchmarks.
A minimal pyannote pipeline looks like:
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="YOUR_HF_TOKEN" # required for pyannote model download
)
diarization = pipeline("interview.wav")
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"[{turn.start:.2f}s -> {turn.end:.2f}s] Speaker {speaker}")Output is a sequence of speaker turns with their start/end times. Combined with Whisper-class transcription, you get speaker-labeled text. The two-step pipeline is what WhisperX wraps into a single function call.
WhisperX: transcription + diarization together
WhisperX is a community wrapper around Whisper that adds:
- Forced alignment using wav2vec2 for word-level timestamps (more precise than Whisper's segment-level timestamps)
- pyannote-based diarization integrated into the same pipeline
- Output formats with speaker labels per word
The full pipeline:
import whisperx
import gc
device = "cuda"
audio_file = "interview.wav"
batch_size = 16
compute_type = "float16"
# 1. Transcribe with Whisper
model = whisperx.load_model("large-v3", device, compute_type=compute_type)
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
# 2. Align for word-level timestamps
model_a, metadata = whisperx.load_align_model(
language_code=result["language"], device=device
)
result = whisperx.align(
result["segments"], model_a, metadata, audio, device,
return_char_alignments=False
)
# 3. Diarize with pyannote
diarize_model = whisperx.DiarizationPipeline(
use_auth_token="YOUR_HF_TOKEN", device=device
)
diarize_segments = diarize_model(audio_file)
# 4. Assign speaker labels to words
result = whisperx.assign_word_speakers(diarize_segments, result)
# 5. Output as Markdown with speaker labels
last_speaker = None
with open("interview.md", "w", encoding="utf-8") as f:
for seg in result["segments"]:
speaker = seg.get("speaker", "UNKNOWN")
ts = f"[{int(seg['start']//60):02d}:{int(seg['start']%60):02d}]"
if speaker != last_speaker:
f.write(f"\n**{speaker}:** {ts} {seg['text'].strip()}\n\n")
last_speaker = speaker
else:
f.write(f"{seg['text'].strip()}\n")This is the complete LOCAL pipeline that produces a speaker-labeled, timestamped Markdown transcript. Runs on a consumer GPU at 5-15x real-time depending on the audio length and GPU.
Accuracy by number of speakers
Diarization accuracy degrades with the number of speakers, but not linearly. The empirical pattern across published benchmarks (DIHARD III, VoxConverse, AMI):
| Speakers | Typical DER (lower is better) | Practical accuracy |
|---|---|---|
| 2 | 3-7% | ~95%, very reliable |
| 3-4 | 8-15% | 80-90%, mostly reliable |
| 5-6 | 15-25% | 70-80%, noticeable errors |
| 7+ | 25-40%+ | Unreliable, manual cleanup needed |
DER (Diarization Error Rate) is the standard metric: percentage of total time that's misassigned. For a 60-minute audio with 10% DER, roughly 6 minutes of attribution is wrong somewhere in the transcript.
The non-linearity comes from voice similarity. With two speakers, the embeddings rarely cluster ambiguously. With six speakers, two of them often have similar voice profiles (similar fundamental frequency, similar accent, similar speaking style) and the clustering struggles to separate them. For panel discussions and large meetings, this is the practical accuracy ceiling.
Where diarization predictably fails
Several scenarios reliably break diarization:
Overlapping speech
When two speakers talk simultaneously, the audio at that moment contains both voices. Standard diarization assigns the segment to one speaker, missing the other. Some systems (pyannote 3.x with overlap detection) can flag overlapping segments but still don't reliably separate the two voices' words. Crosstalk-heavy recordings (debates, animated discussions, family meals) are systematically harder.
Similar voices
Two adult male speakers with similar pitch and accent. Two siblings. Two colleagues from the same regional background. The voice embeddings end up close in the embedding space and clustering merges them or splits them incorrectly. The transcript shows fewer distinct speakers than were actually present, or assigns one speaker's words to another.
Low audio quality
Background noise, reverb, distant microphones, and codec compression all degrade voice embeddings. A phone call recording diarizes worse than a studio recording of the same conversation. The signal-to-noise ratio matters as much for diarization as for transcription — see audio quality vs transcription accuracy.
Unknown speaker count
Diarization systems either ask for the number of speakers as input (pyannote can be configured this way) or estimate it from clustering structure. The estimation is imperfect; for ambiguous audio with 4-6 speakers, the system might detect 3 or 7. Providing the known speaker count if you know it improves accuracy substantially.
Long silences
If a speaker is silent for several minutes and then returns, the system may or may not re-cluster them with their original embedding. Re-clustering generally works for the same recording but can fail on edge cases. Speakers who appear only briefly (under 30 seconds total) often get merged with another speaker.
Naming speakers post-hoc
Diarization assigns generic labels (Speaker 0, Speaker 1, Speaker 2). Replacing them with actual names is a manual post-processing step. The simple approach: identify each speaker by listening to a sample of their first few utterances, then find-and-replace across the file.
def rename_speakers(md_path, name_map):
"""name_map = {'SPEAKER_00': 'Sarah', 'SPEAKER_01': 'Guest Name'}"""
text = open(md_path, encoding="utf-8").read()
for old, new in name_map.items():
text = text.replace(f"**{old}:**", f"**{new}:**")
with open(md_path, "w", encoding="utf-8") as f:
f.write(text)
rename_speakers("interview.md", {
"SPEAKER_00": "Host",
"SPEAKER_01": "Dr. Smith",
})For interview shows where the host is consistent across episodes, the host's voice embedding can be cached and matched automatically across new episodes — "speaker enrollment." Pyannote supports this via its embedding API; the host's saved embedding is compared against the new audio's clusters and the matching cluster gets the host label automatically.
Markdown representation patterns
Three common patterns for representing multiple speakers in Markdown:
Inline bold labels (most common)
**Sarah:** [00:01:14] Welcome back to the podcast.
**Guest:** [00:01:18] Thanks for having me.Compact, readable, easy to grep, easy for LLMs to parse. The default for most production transcripts.
Block-quote per speaker
> **Sarah** [00:01:14]: Welcome back to the podcast.
> **Guest** [00:01:18]: Thanks for having me.More visually distinct in rendered Markdown but adds vertical space and complicates LLM parsing slightly. Useful for short transcripts where visual separation matters.
Section per speaker turn
### Sarah [00:01:14]
Welcome back to the podcast.
### Guest [00:01:18]
Thanks for having me.Most navigable for very long transcripts but creates an excessive number of headings. Useful only for transcripts where each turn is substantial (formal interviews) rather than rapid back-and-forth.
For most use cases, inline bold labels are the right default. mdisbetter.com's audio-to-markdown output uses this format.
Cross-feature: where diarization meets transcription accuracy
The transcript's word-level accuracy and the speaker labels' accuracy degrade for related but distinct reasons. Both depend on audio quality (treated in detail at audio quality vs transcription accuracy). Diarization specifically also depends on speaker count and voice distinctiveness. A clean studio recording of a 6-person panel will have excellent word-level transcription but mediocre diarization. A noisy phone call between 2 people will have mediocre transcription but acceptable diarization.
For background on the underlying transcription model, see how AI transcription actually works. For why structuring transcripts in Markdown (with speaker labels among other elements) outperforms plain text, see Markdown vs plain text for transcripts.
Practical recommendations
- For 2-speaker interviews: trust the diarization, do a 30-second visual check on the labels, ship.
- For 3-4 speaker conversations: expect to fix 2-5 attribution errors per hour of audio. Easy to spot during familiarization.
- For 5+ speakers: budget meaningful manual review time. Consider recording each speaker on a separate channel if the recording setup allows — single-channel-per-speaker mixes diarize trivially because each channel is one known speaker.
- For panel discussions and large meetings, multi-track recording (one mic per speaker, recorded to separate channels) is the only reliable way to get clean attribution at scale.
For the user-facing tool that wraps this pipeline, see audio-to-markdown. For the broader workflow context including diarization quality expectations, see the industry-specific guides linked above.