May 10, 2026 · 8 min read · MDisBetter

Transcribe an Interview with Speaker Labels (Free Guide)

Interviews live or die on attribution. "Someone said X" is useless; "the CEO said X, and the CFO immediately countered with Y" is the actual content. Speaker diarization — the AI capability that separates who said what — has gotten reliably good. Here's how to use it for journalism, qualitative research, HR, or any interview where speaker attribution matters.

What speaker diarization actually does

Speaker diarization is the part of the speech-to-text pipeline that answers "who is speaking right now?". It segments the audio into speaker turns, clusters voices that sound the same as belonging to one speaker, and produces output with explicit speaker boundaries.

Modern diarization works by analyzing voice characteristics — pitch range, formant patterns, speaking rate, microphone-specific qualities — and grouping audio segments that match. The model doesn't know who anyone is by name; it knows there are N distinct voices and labels them "Speaker 1", "Speaker 2", and so on, in order of first appearance.

Accuracy on diarization specifically (separate from word-level transcription accuracy):

Two clearly distinct voices, good audio: 95-99% correct attribution
3-4 speakers, good audio: 88-95%
Similar voices (e.g., two women similar age, or two men similar register): drops to 80-90%
Cross-talk and interruption: degrades quickly
Phone-quality audio: 5-10 percentage points worse than studio

The honest implication: diarization is right far more often than wrong on typical interview audio, but you should always verify attribution on quotes you'll publish. The transcript is a starting point, not a final source of truth on every word's owner.

How mdisbetter labels speakers

The audio-to-markdown output uses Markdown bold labels at the start of each turn:

**Speaker 1:** So tell me about the early days. What was the founding story?

**Speaker 2:** It was 2019. We were two people in a coworking space in Berlin. Honestly we had no business plan, just a problem we wanted to solve.

**Speaker 1:** And the problem was?

**Speaker 2:** Anyone working in machine learning at the time was wasting half their week on data plumbing instead of modeling.

The labels are consistent throughout the document — Speaker 1 is the same person every time they appear. This consistency is what makes the rename step a simple find-and-replace.

How to rename speakers in the .md after download

After downloading the transcript:

Read the first few exchanges to identify which speaker is which ("Speaker 1 is the interviewer asking questions; Speaker 2 is the founder answering").
Open the file in your editor of choice (VS Code, Obsidian, Typora, plain text editor).
Find-and-replace each speaker label.

The replacements:

**Speaker 1:** → **Maria Chen (interviewer):**
**Speaker 2:** → **Klaus Weber (founder):**
**Speaker 3:** → **Anna Schmidt (CEO):**

For three-speaker interviews, do all three replacements. Total time: under a minute.

Important detail: include the role/title in the label, not just the name. "Maria Chen (interviewer)" beats "Maria" because you instantly see the conversational dynamic. For long interviews, this matters when re-reading months later.

Use cases by profession

Journalism

The transcript is a working draft, not a publishable artifact. Workflow:

Record the interview (in person, by phone, or video call — all work).
Transcribe with speaker labels.
Read the transcript with the audio open in another window.
Mark quotes you might use. Verify each by ear before publishing.
Pull quotes into the article draft, attributing by name.

The discipline of ear-verification is critical. Even at 95-99% word accuracy, an interview running an hour produces 5,000-9,000 words; that's 50-180 word-level errors. Most are inconsequential ("said" vs "says"); some change meaning. Verify before publishing.

For long-form profiles, the transcript with speaker labels also enables a useful workflow: ask Claude or ChatGPT to extract the 10 most quotable passages from the subject (Speaker 2 in our example), with one sentence of context per quote. The model uses the labels to filter — you get a curated list to choose from rather than re-reading the full transcript.

Qualitative research

For researcher interviews (UX, customer discovery, academic), the structured transcript is the input to coding analysis. Workflow:

Conduct the interview.
Transcribe with speaker labels.
Add YAML frontmatter with participant ID, study, date, condition (if applicable).
Save into your study folder. Do not include real names in transcripts kept long-term — replace with participant IDs (P01, P02, etc.) at the rename step.
Code the transcript using whatever method your study uses (manual coding in NVivo, ATLAS.ti, Dovetail; or AI-assisted theme extraction).

The privacy consideration matters: if your IRB or ethics protocol requires de-identified data, do the de-identification at the rename step. Replace participant names, mentioned colleagues, identifiable companies, and specific locations with placeholders before saving the long-term file.

HR (exit interviews, performance discussions)

HR interviews are the most sensitive use case. The transcript is high-stakes both for the employee and the organization. Considerations:

Consent and disclosure: explicit, documented at the start of every recorded conversation.
Storage: encrypted, access-controlled, retention-policy-compliant.
Local-only transcription may be required by policy. Use local Whisper with diarization rather than the web tool. The setup pays back the first time policy compliance is questioned.
Speaker labels: usually "HR" and the employee's name (or initial) suffices.

For HR specifically, treat the transcript as part of the employee record with the same handling as any other personnel document.

What if the diarization gets confused?

Common diarization failures and fixes:

Two similar voices merged into one speaker. Symptom: "Speaker 2" sometimes says things you know were said by a different person. Fix during cleanup: read carefully and split where attribution is wrong. Often the audio cue (slight pause, intonation change) tells you where the actual speaker change happened.

One speaker split into two. Symptom: the same person appears as "Speaker 1" early in the recording and "Speaker 3" later (because their voice quality changed — they moved closer to the mic, the audio compression kicked in, etc.). Fix: globally replace Speaker 3 with Speaker 1 if you're confident they're the same person.

Cross-talk segments wrongly attributed. Symptom: during overlap, the model picks one speaker and attributes the whole exchange to them. Fix: usually unrecoverable from the transcript alone; play the audio to disambiguate critical exchanges.

The cross-feature pattern: source documents

Interviews often reference documents — a report the subject sent, a paper they're criticizing, a contract they're describing. To have a complete source pack for the article or analysis, convert any cited PDFs through pdf-to-markdown and any cited URLs through url-to-markdown. The interview transcript plus reference documents in a single Markdown corpus enables AI synthesis you couldn't do across formats: "Compare what the interviewee said about the policy with what the policy actually says."

Pull quotes with AI assistance

Once you have a clean labeled transcript, an LLM can quickly surface high-value quotes. Useful prompt:

Read this interview transcript. From [Speaker name]'s contributions only, identify:

1. The 5 most quotable passages — sharp, complete-thought, under 30 words each.
2. The 3 most surprising claims they made (with verbatim quote).
3. Any specific named entities they cited (companies, people, papers, products) — list as a separate section.

For each quote, also include 1 sentence of context: what was being discussed when they said it.

The output is a curated extraction that takes 30 seconds vs the 30-60 minutes of manual scanning a 1-hour transcript would take. Verify any quote you'll publish against the audio.

Storage and organization

For an interview-heavy practice (journalism, research), the file structure that scales:

interviews/
  2026-05/
    2026-05-08-klaus-weber.md
    2026-05-08-klaus-weber.m4a
    2026-05-09-anna-schmidt.md
    2026-05-09-anna-schmidt.m4a

YAML frontmatter on each .md with at minimum: date, subject, role/affiliation, duration, project (if part of a series). Adding tags for topics covered makes thematic queries possible later. For the broader file-organization pattern, see you can't search audio recordings.

Working with translators

If your interview is in a language you don't speak fluently, transcribe in the original language first (the audio-to-markdown converter auto-detects language for the major languages), then translate the transcript. Translation works much better on text than on audio because:

You can review and correct the source transcript.
Translation tools and human translators both work natively on text.
The translated version preserves the speaker structure of the original.

The two-step workflow (transcribe in source language, translate to target language) produces materially better results than direct audio-to-translated-text approaches.

Common pitfalls

Forgetting to add role labels. A transcript with "**Maria:**" and "**Klaus:**" reads fine for you today; six months later when you've forgotten which Maria, the role label ("Maria Chen, interviewer" / "Klaus Weber, founder") matters. Always include role.

Trusting the diarization completely. Even at 95% accuracy, that means roughly 1 in 20 turns is mis-attributed. For high-stakes attribution (publishable quotes), spot-check.

Losing the audio. Keep the audio file as the source of truth. The transcript is derivative; if you ever need to verify a contested quote, the audio settles it. Storage is cheap; original recordings are irreplaceable.

Recommendation

The workflow is the same regardless of profession: record cleanly, transcribe with diarization, rename speakers immediately, save both audio and Markdown together with frontmatter. The structured Markdown becomes the working document for everything downstream — quote extraction, analysis, AI synthesis, sharing with collaborators. The full interview library, structured this way, becomes a queryable archive after your first dozen interviews; the compounding value past that grows steeply.

Frequently asked questions

What's the minimum audio quality needed for reliable speaker diarization?

Phone-call quality (8kHz mono) works but accuracy on diarization specifically is reduced. Studio-quality (16-48kHz with separate mics per speaker, ideal) gives 95-99% diarization accuracy. The realistic minimum for usable interview transcripts is a single decent USB mic in a reasonably quiet room — diarization accuracy lands around 90-95% in that setup.

Can the diarization handle phone interviews where I'm on speaker?

Yes, but expect lower accuracy. Speakerphone audio mixes both voices into a single audio track at lower fidelity, which makes voice separation harder. For high-stakes phone interviews, record both sides on separate channels if possible (some VoIP platforms support this) — separate channels give nearly perfect diarization.

How do I handle a 4-way panel discussion where speakers interrupt each other?

Diarization on dense panel discussions is the hardest case for AI. Realistic expectation: 80-90% accuracy with frequent attribution errors during cross-talk. Mitigation: have each panelist on a separate microphone if possible. For published transcripts of panel content, plan for significant manual cleanup time — the AI handles most of it but the cross-talk segments need human attention.