Building a Searchable Audio Archive with AI Transcription
Most people who've used a smartphone for more than a few years are sitting on hundreds of hours of personal audio that's effectively a write-only archive. Voice memos taken at conferences and never replayed. Voicemails from family members that you'd never delete but also never re-listen to. Recorded interviews from old work projects. Meeting recordings from companies you no longer work for. Podcast episodes you produced or appeared on. The audio exists; you can't search it. The fix is the same workflow at any scale: transcribe everything to structured Markdown, organize the .md files by date or topic in a folder hierarchy, and use any text search tool — ripgrep, Obsidian, or vector embeddings — to make decades of audio finally searchable.
The audio archive problem
Audio is the most accumulating-but-unsearchable category of personal and professional data. The reasons are structural:
- Audio files are large (1-100 MB each) so they tend to live across multiple devices and storage tiers — phone, laptop, external drive, cloud backup
- The content of audio is invisible to file search — you can search filenames but not what was said
- Re-listening is real-time — to find one specific moment in a 90-minute recording, you scrub for several minutes
- The cost of full transcription used to be prohibitive ($1-2/min for human transcription) so the default was to never transcribe
The result is the audio-archive paradox: the recordings exist because you valued them at recording time, but their value declines toward zero because you can never find anything in them. Modern AI transcription removes the cost and time barriers; the remaining work is organizing the resulting transcripts into a searchable corpus.
The corpus structure
One folder per recording category, with consistent date-stamped filenames. The structure scales from dozens to thousands of files without changing.
Audio Archive/
Meetings/
2024/
2024-01-15 - Team standup.mp3
2024-01-15 - Team standup.md
2024-01-22 - Quarterly review.mp3
2024-01-22 - Quarterly review.md
2025/
2026/
Interviews/
2024/
2024-03-10 - Interview - Sarah Johnson.mp3
2024-03-10 - Interview - Sarah Johnson.md
Podcasts/
Show-Notes-Generated/
ep-001.md
ep-002.md
Voicemails/
2024-Q1.md (concatenated)
2024-Q2.md
Personal-VoiceMemos/
2024-04-12 - Conference notes.mp3
2024-04-12 - Conference notes.md
2024-04-15 - Random idea on flight.mp3
2024-04-15 - Random idea on flight.mdThe audio file and its transcript live side by side. The audio is the source of record (don't delete the originals); the Markdown is the searchable index. Both are easily backed up because they're plain files on disk.
The frontmatter convention
Each transcript starts with a YAML frontmatter block holding the metadata you want to filter on later:
---
title: Quarterly review meeting
date: 2024-01-22
type: meeting
speakers: [Alice, Bob, Carol, Dave]
duration_minutes: 67
source_file: 2024-01-22 - Quarterly review.mp3
transcribed_with: whisper-large-v3 (local)
tags: [q1-2024, planning, leadership]
---
## Opening discussion
**Alice:** [00:00:14] Welcome everyone. Let's start with the Q1 numbers.
...The frontmatter is what enables structured queries later. Want every meeting where Bob spoke in 2024? Filter on type=meeting and Bob in speakers. Want every transcript over an hour long? Filter on duration_minutes > 60. Tools like Obsidian's Dataview, or ripgrep with a YAML-aware filter, can answer these queries instantly across thousands of files.
The batch transcription script
For an existing archive of audio files, the one-time transcription pass is a Python loop over OpenAI's open-weights Whisper. Runs locally, no audio leaves the machine, free.
import whisper
import json
from pathlib import Path
from datetime import datetime
model = whisper.load_model("large-v3") # ~3 GB, best accuracy
def transcribe_to_markdown(audio_path, output_dir=None):
audio_path = Path(audio_path)
output_dir = Path(output_dir) if output_dir else audio_path.parent
md_path = output_dir / (audio_path.stem + ".md")
# Skip if already transcribed
if md_path.exists():
print(f"Skip (exists): {md_path.name}")
return md_path
print(f"Transcribing: {audio_path.name}")
result = model.transcribe(str(audio_path))
# Build frontmatter
duration = result["segments"][-1]["end"] if result["segments"] else 0
fm = {
"title": audio_path.stem,
"date": datetime.fromtimestamp(audio_path.stat().st_mtime).strftime("%Y-%m-%d"),
"source_file": audio_path.name,
"duration_minutes": round(duration / 60, 1),
"language": result["language"],
"transcribed_with": "whisper-large-v3",
}
with open(md_path, "w", encoding="utf-8") as f:
f.write("---\n")
for k, v in fm.items():
f.write(f"{k}: {v}\n")
f.write("---\n\n")
f.write(f"# {audio_path.stem}\n\n")
for seg in result["segments"]:
mins = int(seg["start"] // 60)
secs = int(seg["start"] % 60)
f.write(f"[{mins:02d}:{secs:02d}] {seg['text'].strip()}\n\n")
return md_path
# Walk an entire archive
ROOT = Path("Audio Archive")
AUDIO_EXTS = {".mp3", ".m4a", ".wav", ".aac", ".flac", ".ogg", ".opus"}
for audio_path in ROOT.rglob("*"):
if audio_path.suffix.lower() in AUDIO_EXTS:
try:
transcribe_to_markdown(audio_path)
except Exception as e:
print(f"Failed: {audio_path.name} ({e})")This single script walks the archive, transcribes every audio file it finds, writes a sibling .md file with frontmatter, and skips files already transcribed. Run it once for the initial backfill of the archive; run it on a schedule afterward to handle new recordings.
For a 200-hour archive, total processing time is roughly 200 hours on CPU or 20-40 hours on a consumer GPU — typically run overnight in chunks. Once done, the archive is fully searchable.
Adding speaker diarization for multi-speaker recordings
For meeting recordings and interviews, speaker labels matter. Extending the script to use WhisperX (which bundles pyannote diarization):
import whisperx
import torch
from pathlib import Path
device = "cuda" if torch.cuda.is_available() else "cpu"
compute_type = "float16" if device == "cuda" else "int8"
asr_model = whisperx.load_model("large-v3", device, compute_type=compute_type)
align_model, align_meta = whisperx.load_align_model("en", device)
diarize_pipeline = whisperx.DiarizationPipeline(
use_auth_token="YOUR_HF_TOKEN", device=device
)
def transcribe_with_speakers(audio_path):
audio = whisperx.load_audio(str(audio_path))
result = asr_model.transcribe(audio, batch_size=16)
result = whisperx.align(result["segments"], align_model, align_meta, audio, device)
diarize_segments = diarize_pipeline(str(audio_path))
result = whisperx.assign_word_speakers(diarize_segments, result)
return result
# Then write the result with **Speaker N:** labels at each turn change
# (see the dedicated diarization article for full output formatting)The diarization step adds 20-30% to processing time but is essential for any multi-speaker archive. The pyannote details and the speaker-naming workflow are covered in speaker identification: how it works.
Search level 1: ripgrep for keyword search
Once the archive is Markdown, ripgrep gives you instant keyword search across the entire corpus. From the archive root:
# Find every transcript mentioning a specific person
rg -i "sarah johnson" --type md
# Find every transcript discussing a specific topic
rg -i "product roadmap" --type md -C 3
# Find recordings from a specific date range
rg -l "date: 2024-Q[12]" --type md
# Find every meeting where you yourself spoke about a topic
rg -B 2 "\*\*Will:\*\*.*pricing" --type mdFor an archive of 1,000+ transcripts on a modern SSD, ripgrep returns results in under a second. This is the search experience that didn't exist when the audio files were the only artifact.
Search level 2: Obsidian for the navigable archive
Pointing Obsidian at the archive folder turns the corpus into a navigable knowledge base. The features that matter:
- Full-text search across all .md files with results in real time
- Backlink graph — when transcripts link to each other or to topic notes, the connections become visible
- Tags pane — tags from frontmatter become navigable filters
- Dataview plugin — query the archive like a database ("every meeting in 2024 where Bob and Carol both spoke")
- Quick switcher — Ctrl-O to filename-search any transcript by date or title
For most personal-archive use cases, Obsidian on top of the Markdown files is the right interface. The vault structure mirrors the folder structure; nothing about the archive is locked into Obsidian — switch tools and the files are still standard Markdown.
Search level 3: vector embeddings for semantic search
For larger archives or more nuanced queries ("what did we discuss about the strategic concerns Carol raised in Q3"), keyword search hits its ceiling. Vector embeddings make semantic search possible: the query is embedded, the corpus is embedded, and the closest matches are returned regardless of exact word match.
The minimal embedding pipeline:
import chromadb
from sentence_transformers import SentenceTransformer
from pathlib import Path
import re
model = SentenceTransformer("all-MiniLM-L6-v2") # small, fast, good for English
client = chromadb.PersistentClient(path="./audio_archive_db")
collection = client.get_or_create_collection("transcripts")
def chunk_transcript(md_path, chunk_size=500):
"""Chunk by paragraph, max ~500 words per chunk."""
text = open(md_path, encoding="utf-8").read()
# Strip frontmatter
text = re.sub(r"^---\n.*?\n---\n", "", text, flags=re.DOTALL)
paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
chunks = []
current = []
current_words = 0
for p in paragraphs:
words = len(p.split())
if current_words + words > chunk_size and current:
chunks.append("\n\n".join(current))
current = [p]
current_words = words
else:
current.append(p)
current_words += words
if current:
chunks.append("\n\n".join(current))
return chunks
for md_path in Path("Audio Archive").rglob("*.md"):
chunks = chunk_transcript(md_path)
for i, chunk in enumerate(chunks):
embedding = model.encode(chunk).tolist()
collection.add(
documents=[chunk],
embeddings=[embedding],
metadatas=[{"source": str(md_path), "chunk": i}],
ids=[f"{md_path.stem}_{i}"],
)
# Query
def semantic_search(query, n=5):
embedding = model.encode(query).tolist()
return collection.query(query_embeddings=[embedding], n_results=n)
print(semantic_search("strategic concerns about pricing changes"))The embeddings live in a local Chroma database; queries return relevant chunks with their source-file metadata, so you can navigate from a search hit back to the full transcript and the original audio. For larger archives or production use, swap Chroma for Qdrant or LanceDB; swap MiniLM for a larger embedding model like bge-large-en-v1.5.
For the parallel pattern on web-derived knowledge, see building a web knowledge base for AI — the same chunk-and-embed approach applies to URL-extracted Markdown corpora.
Voicemails: the special case
Voicemails are the highest-leverage subset of the archive for most people. They're typically short (under 2 minutes), they're personal (family, friends, colleagues), and they're the recordings most likely to acquire emotional value over time. Transcribing every voicemail makes the archive searchable by who said what when.
For iPhone users, voicemails can be exported via the Voice Memos app or via iTunes/Finder backup extraction. For Android, the carrier's voicemail app typically allows export, or third-party apps like YouMail save voicemails as standard audio files. Once exported, run them through the same batch transcription script. A typical archive of several hundred voicemails transcribes in under an hour.
For voicemails specifically, the speaker is usually the same per file (whoever left the message), so diarization is unnecessary. The frontmatter can capture the caller's name, allowing search by who left the message and what they said.
Long-term archive durability
One reason Markdown is the right format for an audio archive: it will still be readable in 30 years. Plain text files in a folder you control will open in any text editor that exists in 2056. Vendor-specific formats (Notion exports, proprietary CAQDAS files, Apple Voice Memo exports) may not.
For long-term durability, the recommended setup:
- Original audio files: archive on at least two storage media (local drive + cloud backup, or external drive + offsite copy)
- Markdown transcripts: archive in the same locations, plus a Git repository for version control if changes are made over time
- Embedding database: regenerable from the transcripts, so doesn't need its own backup discipline
- Tooling state (Obsidian config, Chroma database): rebuildable; back up the source files, not the indexes
Three-decade-durable, vendor-independent, fully searchable. The Markdown substrate is the load-bearing piece.
The pipeline summary
Walk the audio archive → transcribe each file with local Whisper (or upload to audio-to-markdown for non-sensitive material) → save Markdown sibling files with frontmatter → search with ripgrep, navigate in Obsidian, semantic-search via embeddings as the use case demands. The investment is one-time backfill plus ongoing transcription of new recordings; the payoff is the audio archive becoming finally searchable. For the technical underpinnings of the transcription model itself, see how AI transcription actually works; for why structured Markdown beats plain text as the storage format, see Markdown vs plain text for transcripts; for the parallel pattern in journalism workflows, see audio to Markdown for journalists.