Technical Articles

Technical

Audio Quality vs Transcription Accuracy: Complete Guide

Signal-to-noise ratio, microphone choice (USB headsets to SM7B), room treatment, and pre-processing — what actually moves transcription accuracy from 70% to 99%.

10 min read May 2026

Technical

Building an Enterprise Document Migration Pipeline: Word to Markdown

Architecture for migrating thousands of Word documents to Markdown at enterprise scale. Audit, categorise, prioritise, batch-convert with Pandoc CLI, quality-check, organise, publish. Real bash and Python snippets, realistic timelines.

12 min read May 2026

Technical

How We Built MDisBetter's PDF Converter: Lessons Learned

Engineering retrospective: the architecture decisions, the failure modes we hit, the accuracy improvements that actually moved the needle.

8 min read May 2026

Technical

Building a Searchable Audio Archive with AI Transcription

Decades of voicemails, meetings, podcasts, and interviews — unindexed and unsearchable. Convert everything to Markdown, organize by date and speaker, search with ripgrep or Obsidian, and optionally embed for semantic retrieval. Includes local Whisper batch script.

10 min read May 2026

Technical

Building a Searchable Video Library with AI Transcription

Practical guide: identify video sources, transcribe (web tool for one-offs, yt-dlp + Whisper local for batch), organize with frontmatter metadata, full-text search with ripgrep or Obsidian, optional semantic search.

11 min read May 2026

Technical

Building a Web Knowledge Base for AI: Architecture Guide

End-to-end architecture for converting web sources into a queryable AI knowledge base. Source identification, conversion, chunking, embedding, vector storage, and update strategy — with code and tool recommendations.

11 min read May 2026

Technical

Converting JavaScript-Heavy Pages to Markdown: Technical Deep Dive

Static fetch vs headless browser, Playwright/Puppeteer mechanics, wait conditions, performance and cost tradeoffs. How modern URL-to-Markdown tools handle JS-rendered SPAs.

9 min read May 2026

Technical

How AI Transcription Actually Works (Whisper, ASR, and Beyond)

Technical deep dive: from HMM-era speech recognition through encoder-decoder transformers and Whisper's 680k-hour training set, with notes on why structured Markdown output matters for downstream LLM use.

10 min read May 2026

Technical

How the DOCX Format Works Internally (And Why Conversion Is Hard)

Technical deep-dive: a .docx file is a ZIP archive of XML files. Walk through document.xml, styles.xml, and the OOXML structure, and see why naive text extraction loses heading semantics and why styles.xml is the secret to good Word-to-Markdown conversion.

11 min read May 2026

Technical

How PDF Works Internally (And Why Text Extraction Always Breaks)

A technical deep-dive into the PDF file format: content streams, glyph positioning, why extraction is lossy, and what this means for AI workflows.

9 min read May 2026

Technical

How YouTube Transcript Extraction Actually Works

Technical deep dive: YouTube's caption system explained — auto-generated ASR vs creator-uploaded tracks, why auto-captions are unreliable, and why fresh AI re-transcription beats them on accuracy.

11 min read May 2026

Technical

How HTML to Markdown Conversion Actually Works (Under the Hood)

Technical deep dive: DOM parsing, tree-walking, element-by-element conversion rules, and why naive html2text falls short on modern web pages.

9 min read May 2026

Technical

Mammoth vs Pandoc vs AI: Word to Markdown Conversion Deep Dive

Technical comparison of the three approaches to Word-to-Markdown conversion: Mammoth.js (semantic, JS library), Pandoc (structural, multi-format CLI), and AI-powered (context-aware). When to use each, with realistic accuracy and tradeoff numbers.

11 min read May 2026

Technical

Markdown vs HTML for LLMs: Token Count Comparison (Real Numbers)

We measured token counts for HTML and Markdown versions of 5 representative web pages with tiktoken. Markdown saves 60-85% of tokens. GPT-4o cost math included.

8 min read May 2026

Technical

Markdown vs Plain Text for Transcripts: Why Structure Matters

Side-by-side: what plain-text transcripts lose, what Markdown preserves (speakers, sections, timestamps, emphasis), and the measurable LLM-extraction quality difference between the two formats.

9 min read May 2026

Technical

Markdown vs SRT vs VTT: Which Transcript Format for What?

Side-by-side comparison: SRT and VTT are subtitle formats for video player display; plain text is unstructured; Markdown gives you structure plus readability plus AI-readiness. When to use each.

10 min read May 2026

Technical

Markdown Chunking Strategies for RAG: Headers vs Tokens vs Paragraphs

Three chunking strategies for RAG pipelines: header-based, token-based, paragraph-based. When each wins, with code examples and evaluation metrics.

9 min read May 2026

Technical

Speaker Identification in Transcription: How It Works

Diarization explained: pyannote.audio vs proprietary engines, accuracy by speaker count, when it fails, and how Markdown represents multiple speakers cleanly.

9 min read May 2026

Technical

Token Count: PDF vs Markdown on 20 Real Documents (Hard Numbers)

Methodology and results from a 20-document benchmark measuring token usage on raw PDF vs Markdown for ChatGPT, Claude, and Gemini. With cost implications.

8 min read May 2026

Technical

Using Video Content in RAG Pipelines: Architecture Guide

Technical guide: integrate video content into RAG. Pipeline = video → transcript → Markdown → chunk by H2/H3 → embed → vector DB → retrieve. Multi-hour content handling, parent-document linking, real Python with sentence-transformers + ChromaDB.

12 min read May 2026

Technical

Speaker Identification in Video Transcription: How It Works

Technical deep dive: how diarization combines visual cues (face tracking, lip detection) with audio signals to label speakers in video. Realistic accuracy by speaker count and failure modes.

11 min read May 2026

Technical

Content Extraction: Readability vs Trafilatura vs AI-Powered

Technical deep-dive on the main-content extraction problem. Mozilla Readability, Trafilatura, and LLM-based extraction compared — strengths, weaknesses, and when to use each.

10 min read May 2026

Technical

Why Word Tables Are the Hardest Conversion Problem (Technical)

Technical deep-dive on table conversion: Word's table model supports nested tables, merged cells, multi-row headers, and complex spans. Markdown's table model is flat rows-and-columns. What's possible, what breaks, and the best-effort strategies to bridge the gap.

10 min read May 2026

Technical articles

Audio Quality vs Transcription Accuracy: Complete Guide

Building an Enterprise Document Migration Pipeline: Word to Markdown

How We Built MDisBetter's PDF Converter: Lessons Learned

Building a Searchable Audio Archive with AI Transcription

Building a Searchable Video Library with AI Transcription

Building a Web Knowledge Base for AI: Architecture Guide

Converting JavaScript-Heavy Pages to Markdown: Technical Deep Dive

How AI Transcription Actually Works (Whisper, ASR, and Beyond)

How the DOCX Format Works Internally (And Why Conversion Is Hard)

How PDF Works Internally (And Why Text Extraction Always Breaks)

How YouTube Transcript Extraction Actually Works

How HTML to Markdown Conversion Actually Works (Under the Hood)

Mammoth vs Pandoc vs AI: Word to Markdown Conversion Deep Dive

Markdown vs HTML for LLMs: Token Count Comparison (Real Numbers)

Markdown vs Plain Text for Transcripts: Why Structure Matters

Markdown vs SRT vs VTT: Which Transcript Format for What?

Markdown Chunking Strategies for RAG: Headers vs Tokens vs Paragraphs

Speaker Identification in Transcription: How It Works

Token Count: PDF vs Markdown on 20 Real Documents (Hard Numbers)

Using Video Content in RAG Pipelines: Architecture Guide

Speaker Identification in Video Transcription: How It Works

Content Extraction: Readability vs Trafilatura vs AI-Powered

Why Word Tables Are the Hardest Conversion Problem (Technical)