Pricing Dashboard Sign up
Recent
· 10 min read · MDisBetter

Building a Searchable Audio Archive with AI Transcription

Most people who've used a smartphone for more than a few years are sitting on hundreds of hours of personal audio that's effectively a write-only archive. Voice memos taken at conferences and never replayed. Voicemails from family members that you'd never delete but also never re-listen to. Recorded interviews from old work projects. Meeting recordings from companies you no longer work for. Podcast episodes you produced or appeared on. The audio exists; you can't search it. The fix is the same workflow at any scale: transcribe everything to structured Markdown, organize the .md files by date or topic in a folder hierarchy, and use any text search tool — ripgrep, Obsidian, or vector embeddings — to make decades of audio finally searchable.

The audio archive problem

Audio is the most accumulating-but-unsearchable category of personal and professional data. The reasons are structural:

The result is the audio-archive paradox: the recordings exist because you valued them at recording time, but their value declines toward zero because you can never find anything in them. Modern AI transcription removes the cost and time barriers; the remaining work is organizing the resulting transcripts into a searchable corpus.

The corpus structure

One folder per recording category, with consistent date-stamped filenames. The structure scales from dozens to thousands of files without changing.

Audio Archive/
  Meetings/
    2024/
      2024-01-15 - Team standup.mp3
      2024-01-15 - Team standup.md
      2024-01-22 - Quarterly review.mp3
      2024-01-22 - Quarterly review.md
    2025/
    2026/
  Interviews/
    2024/
      2024-03-10 - Interview - Sarah Johnson.mp3
      2024-03-10 - Interview - Sarah Johnson.md
  Podcasts/
    Show-Notes-Generated/
      ep-001.md
      ep-002.md
  Voicemails/
    2024-Q1.md  (concatenated)
    2024-Q2.md
  Personal-VoiceMemos/
    2024-04-12 - Conference notes.mp3
    2024-04-12 - Conference notes.md
    2024-04-15 - Random idea on flight.mp3
    2024-04-15 - Random idea on flight.md

The audio file and its transcript live side by side. The audio is the source of record (don't delete the originals); the Markdown is the searchable index. Both are easily backed up because they're plain files on disk.

The frontmatter convention

Each transcript starts with a YAML frontmatter block holding the metadata you want to filter on later:

---
title: Quarterly review meeting
date: 2024-01-22
type: meeting
speakers: [Alice, Bob, Carol, Dave]
duration_minutes: 67
source_file: 2024-01-22 - Quarterly review.mp3
transcribed_with: whisper-large-v3 (local)
tags: [q1-2024, planning, leadership]
---

## Opening discussion

**Alice:** [00:00:14] Welcome everyone. Let's start with the Q1 numbers.
...

The frontmatter is what enables structured queries later. Want every meeting where Bob spoke in 2024? Filter on type=meeting and Bob in speakers. Want every transcript over an hour long? Filter on duration_minutes > 60. Tools like Obsidian's Dataview, or ripgrep with a YAML-aware filter, can answer these queries instantly across thousands of files.

The batch transcription script

For an existing archive of audio files, the one-time transcription pass is a Python loop over OpenAI's open-weights Whisper. Runs locally, no audio leaves the machine, free.

import whisper
import json
from pathlib import Path
from datetime import datetime

model = whisper.load_model("large-v3")  # ~3 GB, best accuracy

def transcribe_to_markdown(audio_path, output_dir=None):
    audio_path = Path(audio_path)
    output_dir = Path(output_dir) if output_dir else audio_path.parent
    md_path = output_dir / (audio_path.stem + ".md")
    
    # Skip if already transcribed
    if md_path.exists():
        print(f"Skip (exists): {md_path.name}")
        return md_path
    
    print(f"Transcribing: {audio_path.name}")
    result = model.transcribe(str(audio_path))
    
    # Build frontmatter
    duration = result["segments"][-1]["end"] if result["segments"] else 0
    fm = {
        "title": audio_path.stem,
        "date": datetime.fromtimestamp(audio_path.stat().st_mtime).strftime("%Y-%m-%d"),
        "source_file": audio_path.name,
        "duration_minutes": round(duration / 60, 1),
        "language": result["language"],
        "transcribed_with": "whisper-large-v3",
    }
    
    with open(md_path, "w", encoding="utf-8") as f:
        f.write("---\n")
        for k, v in fm.items():
            f.write(f"{k}: {v}\n")
        f.write("---\n\n")
        f.write(f"# {audio_path.stem}\n\n")
        for seg in result["segments"]:
            mins = int(seg["start"] // 60)
            secs = int(seg["start"] % 60)
            f.write(f"[{mins:02d}:{secs:02d}] {seg['text'].strip()}\n\n")
    
    return md_path

# Walk an entire archive
ROOT = Path("Audio Archive")
AUDIO_EXTS = {".mp3", ".m4a", ".wav", ".aac", ".flac", ".ogg", ".opus"}

for audio_path in ROOT.rglob("*"):
    if audio_path.suffix.lower() in AUDIO_EXTS:
        try:
            transcribe_to_markdown(audio_path)
        except Exception as e:
            print(f"Failed: {audio_path.name} ({e})")

This single script walks the archive, transcribes every audio file it finds, writes a sibling .md file with frontmatter, and skips files already transcribed. Run it once for the initial backfill of the archive; run it on a schedule afterward to handle new recordings.

For a 200-hour archive, total processing time is roughly 200 hours on CPU or 20-40 hours on a consumer GPU — typically run overnight in chunks. Once done, the archive is fully searchable.

Adding speaker diarization for multi-speaker recordings

For meeting recordings and interviews, speaker labels matter. Extending the script to use WhisperX (which bundles pyannote diarization):

import whisperx
import torch
from pathlib import Path

device = "cuda" if torch.cuda.is_available() else "cpu"
compute_type = "float16" if device == "cuda" else "int8"

asr_model = whisperx.load_model("large-v3", device, compute_type=compute_type)
align_model, align_meta = whisperx.load_align_model("en", device)
diarize_pipeline = whisperx.DiarizationPipeline(
    use_auth_token="YOUR_HF_TOKEN", device=device
)

def transcribe_with_speakers(audio_path):
    audio = whisperx.load_audio(str(audio_path))
    result = asr_model.transcribe(audio, batch_size=16)
    result = whisperx.align(result["segments"], align_model, align_meta, audio, device)
    diarize_segments = diarize_pipeline(str(audio_path))
    result = whisperx.assign_word_speakers(diarize_segments, result)
    return result

# Then write the result with **Speaker N:** labels at each turn change
# (see the dedicated diarization article for full output formatting)

The diarization step adds 20-30% to processing time but is essential for any multi-speaker archive. The pyannote details and the speaker-naming workflow are covered in speaker identification: how it works.

Search level 1: ripgrep for keyword search

Once the archive is Markdown, ripgrep gives you instant keyword search across the entire corpus. From the archive root:

# Find every transcript mentioning a specific person
rg -i "sarah johnson" --type md

# Find every transcript discussing a specific topic
rg -i "product roadmap" --type md -C 3

# Find recordings from a specific date range
rg -l "date: 2024-Q[12]" --type md

# Find every meeting where you yourself spoke about a topic
rg -B 2 "\*\*Will:\*\*.*pricing" --type md

For an archive of 1,000+ transcripts on a modern SSD, ripgrep returns results in under a second. This is the search experience that didn't exist when the audio files were the only artifact.

Search level 2: Obsidian for the navigable archive

Pointing Obsidian at the archive folder turns the corpus into a navigable knowledge base. The features that matter:

For most personal-archive use cases, Obsidian on top of the Markdown files is the right interface. The vault structure mirrors the folder structure; nothing about the archive is locked into Obsidian — switch tools and the files are still standard Markdown.

Search level 3: vector embeddings for semantic search

For larger archives or more nuanced queries ("what did we discuss about the strategic concerns Carol raised in Q3"), keyword search hits its ceiling. Vector embeddings make semantic search possible: the query is embedded, the corpus is embedded, and the closest matches are returned regardless of exact word match.

The minimal embedding pipeline:

import chromadb
from sentence_transformers import SentenceTransformer
from pathlib import Path
import re

model = SentenceTransformer("all-MiniLM-L6-v2")  # small, fast, good for English
client = chromadb.PersistentClient(path="./audio_archive_db")
collection = client.get_or_create_collection("transcripts")

def chunk_transcript(md_path, chunk_size=500):
    """Chunk by paragraph, max ~500 words per chunk."""
    text = open(md_path, encoding="utf-8").read()
    # Strip frontmatter
    text = re.sub(r"^---\n.*?\n---\n", "", text, flags=re.DOTALL)
    paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
    chunks = []
    current = []
    current_words = 0
    for p in paragraphs:
        words = len(p.split())
        if current_words + words > chunk_size and current:
            chunks.append("\n\n".join(current))
            current = [p]
            current_words = words
        else:
            current.append(p)
            current_words += words
    if current:
        chunks.append("\n\n".join(current))
    return chunks

for md_path in Path("Audio Archive").rglob("*.md"):
    chunks = chunk_transcript(md_path)
    for i, chunk in enumerate(chunks):
        embedding = model.encode(chunk).tolist()
        collection.add(
            documents=[chunk],
            embeddings=[embedding],
            metadatas=[{"source": str(md_path), "chunk": i}],
            ids=[f"{md_path.stem}_{i}"],
        )

# Query
def semantic_search(query, n=5):
    embedding = model.encode(query).tolist()
    return collection.query(query_embeddings=[embedding], n_results=n)

print(semantic_search("strategic concerns about pricing changes"))

The embeddings live in a local Chroma database; queries return relevant chunks with their source-file metadata, so you can navigate from a search hit back to the full transcript and the original audio. For larger archives or production use, swap Chroma for Qdrant or LanceDB; swap MiniLM for a larger embedding model like bge-large-en-v1.5.

For the parallel pattern on web-derived knowledge, see building a web knowledge base for AI — the same chunk-and-embed approach applies to URL-extracted Markdown corpora.

Voicemails: the special case

Voicemails are the highest-leverage subset of the archive for most people. They're typically short (under 2 minutes), they're personal (family, friends, colleagues), and they're the recordings most likely to acquire emotional value over time. Transcribing every voicemail makes the archive searchable by who said what when.

For iPhone users, voicemails can be exported via the Voice Memos app or via iTunes/Finder backup extraction. For Android, the carrier's voicemail app typically allows export, or third-party apps like YouMail save voicemails as standard audio files. Once exported, run them through the same batch transcription script. A typical archive of several hundred voicemails transcribes in under an hour.

For voicemails specifically, the speaker is usually the same per file (whoever left the message), so diarization is unnecessary. The frontmatter can capture the caller's name, allowing search by who left the message and what they said.

Long-term archive durability

One reason Markdown is the right format for an audio archive: it will still be readable in 30 years. Plain text files in a folder you control will open in any text editor that exists in 2056. Vendor-specific formats (Notion exports, proprietary CAQDAS files, Apple Voice Memo exports) may not.

For long-term durability, the recommended setup:

Three-decade-durable, vendor-independent, fully searchable. The Markdown substrate is the load-bearing piece.

The pipeline summary

Walk the audio archive → transcribe each file with local Whisper (or upload to audio-to-markdown for non-sensitive material) → save Markdown sibling files with frontmatter → search with ripgrep, navigate in Obsidian, semantic-search via embeddings as the use case demands. The investment is one-time backfill plus ongoing transcription of new recordings; the payoff is the audio archive becoming finally searchable. For the technical underpinnings of the transcription model itself, see how AI transcription actually works; for why structured Markdown beats plain text as the storage format, see Markdown vs plain text for transcripts; for the parallel pattern in journalism workflows, see audio to Markdown for journalists.

Frequently asked questions

How long does it take to transcribe a 100-hour archive?
On a modern consumer GPU (RTX 3060/4060 or better), Whisper large-v3 runs at roughly 5-10x real-time, so a 100-hour archive takes 10-20 hours of compute. On CPU only (modern Mac M-series or recent Intel/AMD), expect roughly real-time, so 100 hours of audio takes 100 hours of compute — typically run overnight in chunks over a week. Smaller Whisper models (medium, small) run 2-4x faster with modest accuracy reduction; for archive backfill where you want everything searchable but accuracy matters less than coverage, the medium model is a reasonable trade.
How much disk space does the resulting archive take?
Markdown transcripts are tiny compared to the source audio. A 60-minute transcript is typically 30-80 KB; the source MP3 might be 50-100 MB. The complete transcript collection for a 1,000-hour audio archive lives in roughly 30-80 MB of Markdown — small enough to keep on every device you own and to back up to multiple locations trivially. The source audio is the storage-heavy part; the transcripts are practically free in storage terms.
Can I add transcripts retroactively to old recordings I no longer have device access to?
Yes, as long as you have the audio files in any standard format (MP3, M4A, WAV, FLAC, OGG, AAC). The batch script doesn't care where the files came from — phone export, computer backup, downloaded podcast files, ripped from old cassette tape recordings, archived voicemails from a former carrier. The age of the recording doesn't matter; the audio quality matters (older recordings tend to have lower accuracy due to compression artifacts and degraded source quality, but the transcripts are still useful for keyword search even if the word-level accuracy is in the 70-85% range).