URL to Markdown for Developer Documentation
Developer documentation lives in a dozen incompatible places: Confluence pages your last platform team set up, a Notion workspace from the brief year you tried Notion, READMEs scattered across 40 repos, an internal wiki nobody updates, and the public docs site nobody reads. Every time you migrate, every time you wire up an AI assistant, every time someone asks "where is the runbook for X" — you hit the same problem. The content exists, but not in a format anything can actually consume. Markdown is the one format that every static site generator, every LLM, every diff tool, and every grep query speaks fluently. Here's how to get there from any URL.
The docs-as-code argument, in 30 seconds
If your documentation lives in a WYSIWYG tool (Confluence, Notion, Google Docs, SharePoint), you've already lost three things engineers care about: version control, code review, and grep. You can't git blame a Confluence page. You can't open a PR against a Notion doc. You can't rg "deprecated" across a SharePoint site. Docs-as-code — Markdown files in a Git repo, rendered by Hugo/Docusaurus/MkDocs/Astro — gives you all three back.
The blocker is rarely "convince the team." The blocker is the migration. You have 800 Confluence pages and no realistic way to move them. URL-to-Markdown collapses the migration cost to the runtime of a script.
Migration patterns by source platform
Confluence to docs-as-code
Confluence's REST API gives you the page tree (/rest/api/content) and rendered HTML for each page. The naive approach is html2md on the rendered output — but Confluence HTML is full of macro div soup, table garbage, and emoji-as-image hacks that pollute the result. Feed the rendered URL through our URL-to-Markdown converter instead and you get clean, readable output that's actually mergeable into a Hugo content directory. The full walkthrough is in URL to Markdown for website content migration.
Notion to MDX
Notion exports are notoriously messy — the official Markdown export loses callouts, breaks tables, and produces filenames with UUIDs appended. Pulling each page via its public URL through a converter is often cleaner than the official export, especially for pages with embedded databases or sync blocks.
Public docs sites you don't own
Sometimes you need a third-party API's docs in your own knowledge base — for offline reading, for internal redistribution, for feeding to an AI assistant. URL-to-Markdown handles this in one call per page. Crawl the sitemap, convert each URL, commit to your repo. Done.
The internal docs portal pattern
Most engineering orgs end up wanting one searchable surface for everything: API references, architecture decision records (ADRs), onboarding guides, runbooks, postmortems, RFCs. The cleanest pattern is:
- Markdown files in a monorepo (or a single
docsrepo) - Backstage, Docusaurus, or a custom Astro site renders the tree
- Algolia DocSearch (or Meilisearch self-hosted) indexes everything
- CI rebuilds on every merge to
main
The hard part is step 1 when half your existing docs aren't Markdown. URL-to-Markdown is the bridge: point it at every legacy URL once, commit the output, deprecate the old surface.
Feeding internal docs to AI coding assistants
Cursor, Copilot Workspace, Claude Code, Continue, Cody — every modern AI coding assistant gets dramatically more useful when it has your team's docs in context. The default UX is some flavor of "point me at a URL" or "index this folder." Both work better when the source is Markdown.
HTML scraping inside the assistant tends to drag in nav, footers, sidebar TOCs, cookie banners, and analytics scripts — all of which eat tokens and dilute retrieval relevance. Pre-converting your URLs to Markdown and indexing the clean output gives you noticeably better answers per token spent. For RAG pipelines specifically, see our RAG pipeline guide — the same patterns apply when your sources are URLs instead of PDFs.
Concrete workflow: migrate a Confluence space to Docusaurus
Suppose you've got a 300-page Confluence space called ENG and you want it rendered by Docusaurus, served at docs.yourcompany.com, searchable via Algolia, and version-controlled in github.com/yourcompany/docs. End-to-end:
import os, requests
from mdisbetter import url_to_markdown # pseudo-code; replace with API call
CONFLUENCE_BASE = "https://yourcompany.atlassian.net/wiki"
SPACE_KEY = "ENG"
OUT_DIR = "./docs"
# 1. List all pages in the space
pages = requests.get(
f"{CONFLUENCE_BASE}/rest/api/content",
params={"spaceKey": SPACE_KEY, "limit": 500},
auth=(USER, TOKEN)
).json()["results"]
# 2. Convert each page's public URL to Markdown
for p in pages:
url = f"{CONFLUENCE_BASE}/spaces/{SPACE_KEY}/pages/{p['id']}"
md = url_to_markdown(url, include_frontmatter=True)
slug = p["title"].lower().replace(" ", "-")
path = os.path.join(OUT_DIR, f"{slug}.md")
with open(path, "w", encoding="utf-8") as f:
f.write(md)
# 3. git add docs/ && git commit && git push
# 4. Docusaurus picks it up on next build
You now have 300 reviewable, diffable, greppable Markdown files. ADR-001 lives next to the API reference. The runbook for the payments service is one cmd-P away in any editor. New hires read onboarding docs in the same UI they read code. Algolia indexes the build artifact and search returns answers from the right page in 200ms.
Frontmatter conventions worth adopting
Once you're on Markdown, frontmatter becomes your metadata layer. A schema worth standardizing on:
---
title: Payment Service Runbook
slug: runbooks/payments
owner: payments-team
status: stable # draft | stable | deprecated
last_reviewed: 2026-04-12
related:
- runbooks/billing
- rfcs/0042-idempotency-keys
tags: [payments, on-call, runbook]
---
This unlocks a few things: a CI check that fails any doc whose last_reviewed is older than 12 months, a sidebar that groups by tags, ownership routing for stale-docs Slack reminders, and a static analysis step that flags broken cross-references in related.
Handling code blocks, diagrams, and embeds
Developer docs are heavy on three things HTML-based wikis handle badly and Markdown handles natively (or via well-known extensions):
- Code blocks: triple-backtick fenced blocks with language hints, rendered by Prism or Shiki. Syntax highlighting that survives migration is non-negotiable.
- Diagrams: Mermaid blocks. The diagram source lives in the Markdown, gets rendered at build time, and stays in version control. No more "the architecture diagram is on someone's laptop."
- Embeds: For interactive elements (React playgrounds, API explorers), MDX gives you JSX inside Markdown. Docusaurus and Astro both support this natively.
API reference docs: a special case
For OpenAPI/Swagger specs, generate Markdown from the spec instead of converting the rendered Swagger UI. widdershins (OpenAPI to Markdown) is the standard tool. URL-to-Markdown is for the surrounding narrative docs — getting started guides, authentication tutorials, conceptual explanations — that wrap the auto-generated reference.
Stale-docs detection as a CI check
Once your docs are Markdown with structured frontmatter, you can write a CI job that fails the build (or just opens a Slack ticket) for any document whose last_reviewed is older than your team's freshness threshold. A 30-line script in your docs repo:
from datetime import date, timedelta
import yaml, sys
from pathlib import Path
THRESHOLD = timedelta(days=365)
stale = []
for md in Path('docs').rglob('*.md'):
text = md.read_text(encoding='utf-8')
if not text.startswith('---'):
continue
fm = yaml.safe_load(text.split('---')[1])
reviewed = fm.get('last_reviewed')
if reviewed and (date.today() - reviewed) > THRESHOLD:
stale.append((md, reviewed, fm.get('owner','unknown')))
for s in stale:
print(f'STALE: {s[0]} (owner: {s[2]}, last reviewed {s[1]})')
sys.exit(1 if stale else 0)
Run it nightly. The output routes to the owner team based on the owner frontmatter field. Stale docs become a tracked metric instead of a constant background failure.
Onboarding new engineers from the docs tree
One concrete payoff that justifies the migration faster than any other: new-hire onboarding. Engineers in their first week clone the docs repo alongside the main monorepo, open it in their editor of choice, and have grep, fuzzy-find, and AI assistant all working over the same content surface they'll use to read code. The friction of "go log into Confluence, search this term, read this page, switch tabs back to the IDE" disappears. Several teams report new-hire ramp time dropping by 20-30% just from this workflow change — not because the docs got better, but because reading them stopped requiring tab-switching.
Versioned docs for shipped releases
For SDKs and APIs that ship multiple supported major versions, the docs need to fork. Docusaurus and VitePress both support versioned doc trees natively — typically by snapshotting docs/ into versioned_docs/version-X.Y/ at release time. URL-to-Markdown helps when you're bootstrapping the version-history archive: convert your existing live docs at v1.docs.yourcompany.com and v2.docs.yourcompany.com into the versioned tree in one pass, then take over forward maintenance from the converted snapshot.
Why this lasts
The half-life of a documentation platform is about 4 years. Confluence Cloud, Notion, GitBook, ReadMe, Stoplight, Mintlify — every few years a shinier option appears, and migration is always painful enough that teams stay on legacy platforms long after they should leave. Markdown is the constant. If your docs are Markdown in Git, switching from Docusaurus to Astro is a weekend. Switching from Confluence to Docusaurus is a quarter. The difference is the format. Engineering teams that internalize this stop treating their docs platform as load-bearing infrastructure and start treating it as a render layer over a content tree they actually own.