How to Convert an Entire Documentation Site to Markdown
Modern docs sites are amazing for browsing and terrible for everything else. You can't grep them. You can't feed them to an AI. You can't read them offline. You can't easily compare two pages side-by-side. The fix is to download the whole site as Markdown, once, and then do all those things on plain text. This guide walks through doing it for a real docs site (Stripe API, FastAPI, Django, your pick), end-to-end, using open-source tooling so the script is yours to own and modify.
Why convert a whole docs site
Common reasons:
- Feed to an LLM. Drop the whole site into Claude's 1M-token context for high-quality, full-corpus question answering. No more "the answer is in some doc page somewhere" and no more hallucinations from training-cutoff staleness.
- Build a RAG knowledge base. Index the Markdown chunks once, retrieve relevant ones per query. Especially useful for internal team docs, support knowledge bases, and product documentation.
- Read offline. Long flights, weak hotel Wi-Fi, deep airplane mode. Markdown reads cleanly in any text editor, on any device.
- Migrate to docs-as-code. Moving a wiki to MkDocs, Docusaurus, or Astro Starlight starts with extracting current content as Markdown.
- Audit and search. Run grep, ripgrep, or any text-tool across the entire site. Find every mention of a deprecated feature in 200ms.
The two paths
For a small site (under 50 pages) or one-off conversions: open /convert/url-to-markdown, paste each URL, click Convert, save. The MDisBetter web tool handles JS-rendered docs sites that defeat plain Trafilatura, and there's no setup. For 100+ pages or a recurring sync job, the OSS scripted path below is dramatically better — it's reproducible, diffable, and runs on your infrastructure.
The strategy: sitemap.xml
Every well-built docs site exposes a sitemap.xml at its root. This file is gold: it lists every public URL on the site. Stripe, FastAPI, Django, MDN, Python docs, AWS docs — they all have one.
Look for it at one of:
https://docs.example.com/sitemap.xml
https://docs.example.com/sitemap_index.xml
https://example.com/sitemap.xml
/robots.txt # often points to the sitemap locationFor our walkthrough, let's use FastAPI's docs: https://fastapi.tiangolo.com/sitemap.xml. It exposes ~250 URLs across the tutorial, advanced guide, and reference.
Install once: pip install requests trafilatura httpx tqdm.
Step 1: Fetch and parse the sitemap
A few lines of Python:
import requests
import xml.etree.ElementTree as ET
SITEMAP = 'https://fastapi.tiangolo.com/sitemap.xml'
NS = {'sm': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
xml = requests.get(SITEMAP).text
root = ET.fromstring(xml)
urls = [loc.text for loc in root.findall('.//sm:loc', NS)]
print(f'Found {len(urls)} URLs')For nested sitemap indexes (sites with multiple sub-sitemaps), recurse one level:
def collect_urls(sitemap_url):
xml = requests.get(sitemap_url).text
root = ET.fromstring(xml)
urls = []
for loc in root.findall('.//sm:loc', NS):
if loc.text.endswith('.xml'):
urls.extend(collect_urls(loc.text))
else:
urls.append(loc.text)
return urlsStep 2: Filter to the URLs you actually want
Most sitemaps include irrelevant pages — blog posts, marketing landing pages, login flows, language variants. Filter down to what you care about:
urls = [
u for u in urls
if '/tutorial/' in u or '/advanced/' in u or '/reference/' in u
]
print(f'After filter: {len(urls)} URLs')Spend two minutes on this filter. Skipping 80% of irrelevant URLs saves 80% of time and disk space downstream.
Step 3: Convert each URL with Trafilatura + httpx
Trafilatura is the open-source readability extractor we recommend — it does boilerplate stripping, link preservation, and Markdown serialization in one call. Combine it with httpx for concurrent fetching:
import asyncio
import httpx
import trafilatura
from pathlib import Path
from urllib.parse import urlparse
from tqdm.asyncio import tqdm_asyncio
OUT = Path('./fastapi-docs')
OUT.mkdir(exist_ok=True)
SEM = asyncio.Semaphore(10)
def url_to_path(url):
p = urlparse(url).path.strip('/').replace('/', '_') or 'index'
return OUT / f'{p}.md'
async def convert(client, url):
out = url_to_path(url)
if out.exists():
return # skip already done
async with SEM:
try:
r = await client.get(url, timeout=60, follow_redirects=True,
headers={'User-Agent': 'Mozilla/5.0 (compatible; doc-grabber/1.0)'})
except httpx.RequestError as e:
print(f'NET FAIL {url}: {e}')
return
if r.status_code != 200:
print(f'HTTP {r.status_code} {url}')
return
md = trafilatura.extract(
r.text,
output_format='markdown',
include_links=True,
include_tables=True,
favor_precision=True,
)
if not md:
print(f'EXTRACT FAIL {url}')
return
out.write_text(f'\n\n{md}', encoding='utf-8')
async def main():
async with httpx.AsyncClient() as client:
await tqdm_asyncio.gather(*[convert(client, u) for u in urls])
asyncio.run(main())Notice three details:
- Resume support. The
if out.exists()check means you can re-run the script after a partial failure and only the missing pages convert. - Source URL header. Each output Markdown file starts with the source URL as a comment. Indispensable for traceability when an LLM cites it.
- Filename derived from URL path. Predictable mapping back from filename to source URL.
What if the docs site is JS-rendered?
Trafilatura is fast and good at static HTML. For sites that render content client-side (some Stoplight, ReadMe.io, Mintlify, Docusaurus configurations), Trafilatura sees the empty shell. Two responses:
- Add Playwright as a fallback in the same script — render the URL in a headless Chromium, then run
trafilatura.extracton the post-JS HTML. Adds ~3-5 sec/URL but works on every site. - For a handful of pages, paste them into the MDisBetter web tool at /convert/url-to-markdown — the converter renders JS automatically. Save the resulting .md into the same output folder.
For a deep dive on the Playwright approach, see handling JavaScript-rendered pages for Markdown.
Step 4: Verify and clean up
Open 3-5 random output files. Things to spot-check:
- Code blocks have correct language hints (
```pythonnot just```) - Headings preserve their hierarchy (no "all-H1" or "all-H2" output)
- Internal links resolve (relative to the docs root or absolute back to the source)
- No menu/sidebar leaked into the body content
If you see leakage, Trafilatura's auto-detection got the wrong region. Targeted fix: pre-extract the content container with BeautifulSoup, then run Trafilatura over just that fragment. For FastAPI, article.md-content__inner isolates the content perfectly:
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'lxml')
main = soup.select_one('article.md-content__inner')
if main:
md = trafilatura.extract(str(main), output_format='markdown',
include_links=True, include_tables=True)Output structure
For a 250-page docs site:
fastapi-docs/
index.md
tutorial_first-steps.md
tutorial_path-params.md
tutorial_query-params.md
...
advanced_security_oauth2-scopes.md
reference_apirouter.md
reference_dependencies.md
...Total size: typically 2-10 MB of Markdown for a comprehensive docs site. Compares well to the ~50-200 MB you'd get if you tried to mirror the same site as HTML+assets with wget.
Optional: combine into a single file
For LLM ingestion, one giant Markdown file is often more useful than 250 separate files:
cd fastapi-docs
cat $(ls -1 | sort) > ../fastapi-complete.mdOr in Python with a clean separator:
from pathlib import Path
files = sorted(Path('fastapi-docs').glob('*.md'))
with open('fastapi-complete.md', 'w', encoding='utf-8') as out:
for f in files:
out.write(f.read_text(encoding='utf-8'))
out.write('\n\n---\n\n')The result is a single Markdown file you can drop into Claude's context (with the 1M-token model, the entire FastAPI docs fit comfortably with room for many follow-up questions).
Real numbers from FastAPI
- URLs in sitemap: 252
- URLs after filtering to docs only: 198
- Conversion time at concurrency=10: ~4 minutes
- Total Markdown size: 4.1 MB
- Token count (Claude tokenizer): ~1.0M tokens — fits in one Claude 1M-context call
What about non-sitemap sites?
If the site has no sitemap, you have two options:
Option A: discover URLs via crawling
Start at the root, follow internal links to depth N, deduplicate. A simple BFS works:
from urllib.parse import urljoin, urlparse
import re
seen, queue = set(), ['https://example.com/docs/']
pattern = re.compile(r'href=["\']([^"\']+)["\']')
while queue:
url = queue.pop(0)
if url in seen: continue
seen.add(url)
html = requests.get(url).text
for href in pattern.findall(html):
full = urljoin(url, href)
if urlparse(full).netloc == 'example.com' and '/docs/' in full:
queue.append(full.split('#')[0])
urls = list(seen)Option B: pull from the table of contents
Many docs sites have a TOC page with links to every section. Extract those links, treat as your URL list. Simpler than crawling.
Working with PDFs instead?
If your docs are distributed as PDFs (technical specs, RFC documents, legal frameworks), use our PDF to Markdown converter with the same batch pattern. See batch convert 100+ PDFs to Markdown for the corresponding pipeline. The output Markdown integrates seamlessly with whatever you built from the URL conversion above.
Handling versioned docs
Most large docs sites version their pages: /v1/..., /v2/..., or /3.10/..., /3.11/..., /3.12/... for Python. The sitemap usually exposes all versions. For most consumers you only want the latest:
latest = '/3/' # latest stable for Python docs
urls = [u for u in urls if latest in u]
If you genuinely need multiple versions side-by-side (for migration guides, deprecation analysis), keep them and namespace the output folder by version: out/v3.11/, out/v3.12/. The downstream consumer can then load only the version it needs.
Cross-version diffing
Once you have two versions in clean Markdown, diffing them tells you exactly what changed between releases. Standard diff works:
diff -ru out/v3.11/ out/v3.12/ > changes-v311-v312.diff
For nicer output, point an LLM at the diff and ask for a human-readable changelog:
Here is a diff between v3.11 and v3.12 of the Python library docs.
Produce a developer-facing changelog grouped by module. Highlight breaking
changes, deprecations, and new APIs. Skip prose-only edits.
Three minutes of LLM time produces a changelog that would take a human a full day to compile manually.
Output quality checklist
After conversion finishes, run this 30-second sanity check:
find out/ -size -1k— files smaller than 1KB. Likely failures: error pages, login walls, near-empty stub pages. Investigate or delete.grep -l "You need to enable JavaScript" out/*.md— pages that needed JS rendering but didn't get it. Re-run those URLs through Playwright (or the MDisBetter web tool, one at a time).- Spot-check 5 random files: open them in a Markdown viewer (VS Code preview, Obsidian, GitHub gist) and verify the layout matches the source page.
Storage and serving the converted corpus
Once you have a clean folder of Markdown, decide how downstream consumers will access it:
- Git repo — Commit the corpus to GitHub, GitLab, or self-hosted Git. Diffs over time become legible. Pull requests for content updates work naturally. Best for small-to-medium corpora (under ~10K files).
- Object storage (S3, R2, GCS) — Sync the folder to a bucket. Cheap, scales to millions of files, easy to expose via CDN. Best for large corpora consumed by many services.
- Vector DB (Pinecone, Qdrant, Chroma) — Skip storing raw Markdown; chunk and embed directly into a vector DB. Best when the only consumer is RAG.
- Static site generator (MkDocs, Hugo, Astro) — Render the Markdown into a browseable site. Best when humans will read it directly.
Recommendation
For static docs sites with sitemaps: this 30-line script handles it. For sites without sitemaps: spend an extra hour on a tiny crawler. For continuously-updated docs: schedule the script to run nightly and re-convert only changed URLs (the lastmod field in sitemaps tells you when each URL was last updated). For the freshest possible LLM context, set up a CI job that converts the latest docs every morning and exposes the result via your internal storage. See also scrape a website to Markdown for RAG for the indexing side.