Pricing Dashboard Sign up
Recent
· 9 min read · MDisBetter

Speaker Identification in Transcription: How It Works

Whisper transcribes what was said. It does not, by default, tell you who said it. The task of separating an audio recording into segments by speaker — "speaker diarization" — is a separate machine-learning problem with its own models, its own failure modes, and its own accuracy ceilings. For two-speaker conversations, modern diarization systems get it right roughly 95% of the time. For a six-person panel discussion with crosstalk, accuracy drops sharply. Understanding how diarization works, where it succeeds, and where it predictably fails is the difference between trusting your transcript's speaker labels and over-relying on them.

Diarization is a separate task from transcription

The clarifying observation that almost every product hides: transcription and diarization are distinct neural networks doing different jobs.

Transcription takes an audio waveform and outputs text. The model (Whisper, Conformer, etc.) maps acoustic features to phonemes to words. It does not need to know how many speakers there are or who they are.

Diarization takes the same audio waveform and outputs speaker labels per time segment. The model (pyannote.audio, NVIDIA TitaNet, proprietary alternatives) embeds short audio chunks as voice fingerprints, clusters the embeddings, and assigns each cluster a speaker ID. It does not need to know what was said.

The two outputs get fused at a final stage: each transcribed word gets the speaker label corresponding to its timestamp. The fusion is straightforward when both stages succeed; it gets messy when speaker boundaries fall mid-word, when speakers overlap, or when one of the two stages got something wrong.

How modern diarization actually works

The standard pipeline for speaker diarization, simplified:

  1. Voice activity detection (VAD): identify which segments of the audio contain speech (vs silence, music, noise)
  2. Segmentation: split the speech into short uniform chunks (typically 1.5-3 seconds), or use a model that detects speaker-change points
  3. Embedding: convert each chunk into a fixed-dimensional vector (a "voice fingerprint") using a neural network trained on speaker-recognition data. ECAPA-TDNN, x-vectors, and ResNet-based architectures are all common.
  4. Clustering: group the embeddings into clusters where each cluster represents one speaker. Algorithms include agglomerative hierarchical clustering, spectral clustering, and modern neural-network-based clusterers.
  5. Re-segmentation: refine the boundaries using the cluster assignments — a word that started during speaker A but extended into speaker B's turn gets re-assigned
  6. Optional: overlap detection: identify segments where two speakers talk simultaneously and label them appropriately

The end product is a sequence like: "Speaker 1 from 0:00-0:14, Speaker 2 from 0:14-0:23, Speaker 1 from 0:23-0:31..." These are then fused with the transcription to produce labeled text.

pyannote.audio: the open standard

pyannote.audio is the most widely used open-source diarization toolkit. Built on PyTorch, it provides pretrained models for VAD, segmentation, embedding, and clustering, and bundles them as a pipeline. The current version (3.x) achieves competitive accuracy with proprietary solutions on standard benchmarks.

A minimal pyannote pipeline looks like:

from pyannote.audio import Pipeline

pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token="YOUR_HF_TOKEN"  # required for pyannote model download
)

diarization = pipeline("interview.wav")

for turn, _, speaker in diarization.itertracks(yield_label=True):
    print(f"[{turn.start:.2f}s -> {turn.end:.2f}s] Speaker {speaker}")

Output is a sequence of speaker turns with their start/end times. Combined with Whisper-class transcription, you get speaker-labeled text. The two-step pipeline is what WhisperX wraps into a single function call.

WhisperX: transcription + diarization together

WhisperX is a community wrapper around Whisper that adds:

The full pipeline:

import whisperx
import gc

device = "cuda"
audio_file = "interview.wav"
batch_size = 16
compute_type = "float16"

# 1. Transcribe with Whisper
model = whisperx.load_model("large-v3", device, compute_type=compute_type)
audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)

# 2. Align for word-level timestamps
model_a, metadata = whisperx.load_align_model(
    language_code=result["language"], device=device
)
result = whisperx.align(
    result["segments"], model_a, metadata, audio, device,
    return_char_alignments=False
)

# 3. Diarize with pyannote
diarize_model = whisperx.DiarizationPipeline(
    use_auth_token="YOUR_HF_TOKEN", device=device
)
diarize_segments = diarize_model(audio_file)

# 4. Assign speaker labels to words
result = whisperx.assign_word_speakers(diarize_segments, result)

# 5. Output as Markdown with speaker labels
last_speaker = None
with open("interview.md", "w", encoding="utf-8") as f:
    for seg in result["segments"]:
        speaker = seg.get("speaker", "UNKNOWN")
        ts = f"[{int(seg['start']//60):02d}:{int(seg['start']%60):02d}]"
        if speaker != last_speaker:
            f.write(f"\n**{speaker}:** {ts} {seg['text'].strip()}\n\n")
            last_speaker = speaker
        else:
            f.write(f"{seg['text'].strip()}\n")

This is the complete LOCAL pipeline that produces a speaker-labeled, timestamped Markdown transcript. Runs on a consumer GPU at 5-15x real-time depending on the audio length and GPU.

Accuracy by number of speakers

Diarization accuracy degrades with the number of speakers, but not linearly. The empirical pattern across published benchmarks (DIHARD III, VoxConverse, AMI):

SpeakersTypical DER (lower is better)Practical accuracy
23-7%~95%, very reliable
3-48-15%80-90%, mostly reliable
5-615-25%70-80%, noticeable errors
7+25-40%+Unreliable, manual cleanup needed

DER (Diarization Error Rate) is the standard metric: percentage of total time that's misassigned. For a 60-minute audio with 10% DER, roughly 6 minutes of attribution is wrong somewhere in the transcript.

The non-linearity comes from voice similarity. With two speakers, the embeddings rarely cluster ambiguously. With six speakers, two of them often have similar voice profiles (similar fundamental frequency, similar accent, similar speaking style) and the clustering struggles to separate them. For panel discussions and large meetings, this is the practical accuracy ceiling.

Where diarization predictably fails

Several scenarios reliably break diarization:

Overlapping speech

When two speakers talk simultaneously, the audio at that moment contains both voices. Standard diarization assigns the segment to one speaker, missing the other. Some systems (pyannote 3.x with overlap detection) can flag overlapping segments but still don't reliably separate the two voices' words. Crosstalk-heavy recordings (debates, animated discussions, family meals) are systematically harder.

Similar voices

Two adult male speakers with similar pitch and accent. Two siblings. Two colleagues from the same regional background. The voice embeddings end up close in the embedding space and clustering merges them or splits them incorrectly. The transcript shows fewer distinct speakers than were actually present, or assigns one speaker's words to another.

Low audio quality

Background noise, reverb, distant microphones, and codec compression all degrade voice embeddings. A phone call recording diarizes worse than a studio recording of the same conversation. The signal-to-noise ratio matters as much for diarization as for transcription — see audio quality vs transcription accuracy.

Unknown speaker count

Diarization systems either ask for the number of speakers as input (pyannote can be configured this way) or estimate it from clustering structure. The estimation is imperfect; for ambiguous audio with 4-6 speakers, the system might detect 3 or 7. Providing the known speaker count if you know it improves accuracy substantially.

Long silences

If a speaker is silent for several minutes and then returns, the system may or may not re-cluster them with their original embedding. Re-clustering generally works for the same recording but can fail on edge cases. Speakers who appear only briefly (under 30 seconds total) often get merged with another speaker.

Naming speakers post-hoc

Diarization assigns generic labels (Speaker 0, Speaker 1, Speaker 2). Replacing them with actual names is a manual post-processing step. The simple approach: identify each speaker by listening to a sample of their first few utterances, then find-and-replace across the file.

def rename_speakers(md_path, name_map):
    """name_map = {'SPEAKER_00': 'Sarah', 'SPEAKER_01': 'Guest Name'}"""
    text = open(md_path, encoding="utf-8").read()
    for old, new in name_map.items():
        text = text.replace(f"**{old}:**", f"**{new}:**")
    with open(md_path, "w", encoding="utf-8") as f:
        f.write(text)

rename_speakers("interview.md", {
    "SPEAKER_00": "Host",
    "SPEAKER_01": "Dr. Smith",
})

For interview shows where the host is consistent across episodes, the host's voice embedding can be cached and matched automatically across new episodes — "speaker enrollment." Pyannote supports this via its embedding API; the host's saved embedding is compared against the new audio's clusters and the matching cluster gets the host label automatically.

Markdown representation patterns

Three common patterns for representing multiple speakers in Markdown:

Inline bold labels (most common)

**Sarah:** [00:01:14] Welcome back to the podcast.

**Guest:** [00:01:18] Thanks for having me.

Compact, readable, easy to grep, easy for LLMs to parse. The default for most production transcripts.

Block-quote per speaker

> **Sarah** [00:01:14]: Welcome back to the podcast.

> **Guest** [00:01:18]: Thanks for having me.

More visually distinct in rendered Markdown but adds vertical space and complicates LLM parsing slightly. Useful for short transcripts where visual separation matters.

Section per speaker turn

### Sarah [00:01:14]
Welcome back to the podcast.

### Guest [00:01:18]
Thanks for having me.

Most navigable for very long transcripts but creates an excessive number of headings. Useful only for transcripts where each turn is substantial (formal interviews) rather than rapid back-and-forth.

For most use cases, inline bold labels are the right default. mdisbetter.com's audio-to-markdown output uses this format.

Cross-feature: where diarization meets transcription accuracy

The transcript's word-level accuracy and the speaker labels' accuracy degrade for related but distinct reasons. Both depend on audio quality (treated in detail at audio quality vs transcription accuracy). Diarization specifically also depends on speaker count and voice distinctiveness. A clean studio recording of a 6-person panel will have excellent word-level transcription but mediocre diarization. A noisy phone call between 2 people will have mediocre transcription but acceptable diarization.

For background on the underlying transcription model, see how AI transcription actually works. For why structuring transcripts in Markdown (with speaker labels among other elements) outperforms plain text, see Markdown vs plain text for transcripts.

Practical recommendations

For the user-facing tool that wraps this pipeline, see audio-to-markdown. For the broader workflow context including diarization quality expectations, see the industry-specific guides linked above.

Frequently asked questions

Can I tell the diarizer how many speakers to expect?
Yes, and you should when you know. Pyannote and most other diarization systems accept a 'num_speakers' parameter that constrains the clustering to that exact count. When you know the speaker count (a 2-person interview, a known panel size), passing it improves accuracy by removing the model's guesswork. When you don't know, the system estimates it; the estimate is right most of the time but can be off by 1-2 speakers in either direction on ambiguous audio. The mdisbetter web tool auto-estimates; for command-line pyannote/WhisperX you can pass the count explicitly.
Why does the same speaker sometimes get split into two labels?
Two reasons. First, if the speaker's voice changes over the recording (cold, fatigue, emotion, mic position change), their embedding drifts and the clustering may treat the later utterances as a different speaker. Second, if there's a long gap between the speaker's first and second appearance, re-clustering may fail to connect them. Both failures are correctable in post-processing — listen to a sample of each label, identify which labels actually correspond to the same person, and merge with find-and-replace. For high-stakes recordings, multi-track recording (one mic per speaker, separate channels) sidesteps this entirely.
Does diarization work for languages other than English?
Yes, diarization is largely language-independent because it operates on voice characteristics rather than linguistic content. The same pyannote/ECAPA-TDNN models that work on English work on Mandarin, Spanish, French, Arabic, etc. The accuracy may shift slightly by language because the training data for the embedding models skews toward English, but the variation is much smaller than the variation by speaker count and audio quality. For multilingual recordings (speakers using different languages), diarization typically still works because it ignores the language being spoken — voice fingerprints don't depend on which language is in use.