Pricing Dashboard Sign up
Recent
· 9 min read · MDisBetter

How to Convert an Entire Documentation Site to Markdown

Modern docs sites are amazing for browsing and terrible for everything else. You can't grep them. You can't feed them to an AI. You can't read them offline. You can't easily compare two pages side-by-side. The fix is to download the whole site as Markdown, once, and then do all those things on plain text. This guide walks through doing it for a real docs site (Stripe API, FastAPI, Django, your pick), end-to-end, using open-source tooling so the script is yours to own and modify.

Why convert a whole docs site

Common reasons:

The two paths

For a small site (under 50 pages) or one-off conversions: open /convert/url-to-markdown, paste each URL, click Convert, save. The MDisBetter web tool handles JS-rendered docs sites that defeat plain Trafilatura, and there's no setup. For 100+ pages or a recurring sync job, the OSS scripted path below is dramatically better — it's reproducible, diffable, and runs on your infrastructure.

The strategy: sitemap.xml

Every well-built docs site exposes a sitemap.xml at its root. This file is gold: it lists every public URL on the site. Stripe, FastAPI, Django, MDN, Python docs, AWS docs — they all have one.

Look for it at one of:

https://docs.example.com/sitemap.xml
https://docs.example.com/sitemap_index.xml
https://example.com/sitemap.xml
/robots.txt   # often points to the sitemap location

For our walkthrough, let's use FastAPI's docs: https://fastapi.tiangolo.com/sitemap.xml. It exposes ~250 URLs across the tutorial, advanced guide, and reference.

Install once: pip install requests trafilatura httpx tqdm.

Step 1: Fetch and parse the sitemap

A few lines of Python:

import requests
import xml.etree.ElementTree as ET

SITEMAP = 'https://fastapi.tiangolo.com/sitemap.xml'
NS = {'sm': 'http://www.sitemaps.org/schemas/sitemap/0.9'}

xml = requests.get(SITEMAP).text
root = ET.fromstring(xml)
urls = [loc.text for loc in root.findall('.//sm:loc', NS)]
print(f'Found {len(urls)} URLs')

For nested sitemap indexes (sites with multiple sub-sitemaps), recurse one level:

def collect_urls(sitemap_url):
    xml = requests.get(sitemap_url).text
    root = ET.fromstring(xml)
    urls = []
    for loc in root.findall('.//sm:loc', NS):
        if loc.text.endswith('.xml'):
            urls.extend(collect_urls(loc.text))
        else:
            urls.append(loc.text)
    return urls

Step 2: Filter to the URLs you actually want

Most sitemaps include irrelevant pages — blog posts, marketing landing pages, login flows, language variants. Filter down to what you care about:

urls = [
    u for u in urls
    if '/tutorial/' in u or '/advanced/' in u or '/reference/' in u
]
print(f'After filter: {len(urls)} URLs')

Spend two minutes on this filter. Skipping 80% of irrelevant URLs saves 80% of time and disk space downstream.

Step 3: Convert each URL with Trafilatura + httpx

Trafilatura is the open-source readability extractor we recommend — it does boilerplate stripping, link preservation, and Markdown serialization in one call. Combine it with httpx for concurrent fetching:

import asyncio
import httpx
import trafilatura
from pathlib import Path
from urllib.parse import urlparse
from tqdm.asyncio import tqdm_asyncio

OUT = Path('./fastapi-docs')
OUT.mkdir(exist_ok=True)
SEM = asyncio.Semaphore(10)

def url_to_path(url):
    p = urlparse(url).path.strip('/').replace('/', '_') or 'index'
    return OUT / f'{p}.md'

async def convert(client, url):
    out = url_to_path(url)
    if out.exists():
        return  # skip already done
    async with SEM:
        try:
            r = await client.get(url, timeout=60, follow_redirects=True,
                                 headers={'User-Agent': 'Mozilla/5.0 (compatible; doc-grabber/1.0)'})
        except httpx.RequestError as e:
            print(f'NET FAIL {url}: {e}')
            return
    if r.status_code != 200:
        print(f'HTTP {r.status_code} {url}')
        return
    md = trafilatura.extract(
        r.text,
        output_format='markdown',
        include_links=True,
        include_tables=True,
        favor_precision=True,
    )
    if not md:
        print(f'EXTRACT FAIL {url}')
        return
    out.write_text(f'\n\n{md}', encoding='utf-8')

async def main():
    async with httpx.AsyncClient() as client:
        await tqdm_asyncio.gather(*[convert(client, u) for u in urls])

asyncio.run(main())

Notice three details:

  1. Resume support. The if out.exists() check means you can re-run the script after a partial failure and only the missing pages convert.
  2. Source URL header. Each output Markdown file starts with the source URL as a comment. Indispensable for traceability when an LLM cites it.
  3. Filename derived from URL path. Predictable mapping back from filename to source URL.

What if the docs site is JS-rendered?

Trafilatura is fast and good at static HTML. For sites that render content client-side (some Stoplight, ReadMe.io, Mintlify, Docusaurus configurations), Trafilatura sees the empty shell. Two responses:

For a deep dive on the Playwright approach, see handling JavaScript-rendered pages for Markdown.

Step 4: Verify and clean up

Open 3-5 random output files. Things to spot-check:

If you see leakage, Trafilatura's auto-detection got the wrong region. Targeted fix: pre-extract the content container with BeautifulSoup, then run Trafilatura over just that fragment. For FastAPI, article.md-content__inner isolates the content perfectly:

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'lxml')
main = soup.select_one('article.md-content__inner')
if main:
    md = trafilatura.extract(str(main), output_format='markdown',
                             include_links=True, include_tables=True)

Output structure

For a 250-page docs site:

fastapi-docs/
  index.md
  tutorial_first-steps.md
  tutorial_path-params.md
  tutorial_query-params.md
  ...
  advanced_security_oauth2-scopes.md
  reference_apirouter.md
  reference_dependencies.md
  ...

Total size: typically 2-10 MB of Markdown for a comprehensive docs site. Compares well to the ~50-200 MB you'd get if you tried to mirror the same site as HTML+assets with wget.

Optional: combine into a single file

For LLM ingestion, one giant Markdown file is often more useful than 250 separate files:

cd fastapi-docs
cat $(ls -1 | sort) > ../fastapi-complete.md

Or in Python with a clean separator:

from pathlib import Path
files = sorted(Path('fastapi-docs').glob('*.md'))
with open('fastapi-complete.md', 'w', encoding='utf-8') as out:
    for f in files:
        out.write(f.read_text(encoding='utf-8'))
        out.write('\n\n---\n\n')

The result is a single Markdown file you can drop into Claude's context (with the 1M-token model, the entire FastAPI docs fit comfortably with room for many follow-up questions).

Real numbers from FastAPI

What about non-sitemap sites?

If the site has no sitemap, you have two options:

Option A: discover URLs via crawling

Start at the root, follow internal links to depth N, deduplicate. A simple BFS works:

from urllib.parse import urljoin, urlparse
import re

seen, queue = set(), ['https://example.com/docs/']
pattern = re.compile(r'href=["\']([^"\']+)["\']')

while queue:
    url = queue.pop(0)
    if url in seen: continue
    seen.add(url)
    html = requests.get(url).text
    for href in pattern.findall(html):
        full = urljoin(url, href)
        if urlparse(full).netloc == 'example.com' and '/docs/' in full:
            queue.append(full.split('#')[0])

urls = list(seen)

Option B: pull from the table of contents

Many docs sites have a TOC page with links to every section. Extract those links, treat as your URL list. Simpler than crawling.

Working with PDFs instead?

If your docs are distributed as PDFs (technical specs, RFC documents, legal frameworks), use our PDF to Markdown converter with the same batch pattern. See batch convert 100+ PDFs to Markdown for the corresponding pipeline. The output Markdown integrates seamlessly with whatever you built from the URL conversion above.

Handling versioned docs

Most large docs sites version their pages: /v1/..., /v2/..., or /3.10/..., /3.11/..., /3.12/... for Python. The sitemap usually exposes all versions. For most consumers you only want the latest:

latest = '/3/'  # latest stable for Python docs
urls = [u for u in urls if latest in u]

If you genuinely need multiple versions side-by-side (for migration guides, deprecation analysis), keep them and namespace the output folder by version: out/v3.11/, out/v3.12/. The downstream consumer can then load only the version it needs.

Cross-version diffing

Once you have two versions in clean Markdown, diffing them tells you exactly what changed between releases. Standard diff works:

diff -ru out/v3.11/ out/v3.12/ > changes-v311-v312.diff

For nicer output, point an LLM at the diff and ask for a human-readable changelog:

Here is a diff between v3.11 and v3.12 of the Python library docs.
Produce a developer-facing changelog grouped by module. Highlight breaking
changes, deprecations, and new APIs. Skip prose-only edits.

Three minutes of LLM time produces a changelog that would take a human a full day to compile manually.

Output quality checklist

After conversion finishes, run this 30-second sanity check:

Storage and serving the converted corpus

Once you have a clean folder of Markdown, decide how downstream consumers will access it:

Recommendation

For static docs sites with sitemaps: this 30-line script handles it. For sites without sitemaps: spend an extra hour on a tiny crawler. For continuously-updated docs: schedule the script to run nightly and re-convert only changed URLs (the lastmod field in sitemaps tells you when each URL was last updated). For the freshest possible LLM context, set up a CI job that converts the latest docs every morning and exposes the result via your internal storage. See also scrape a website to Markdown for RAG for the indexing side.

Frequently asked questions

What's the best way to handle docs sites that change frequently?
Use the sitemap's <lastmod> field. Compare against your previous run's stored timestamps and only re-convert URLs where the date is newer. This pattern keeps your local Markdown corpus fresh with minimal extraction work. Schedule the job nightly via cron, GitHub Actions, or any task scheduler.
Can I convert a docs site that requires a login?
The MDisBetter web tool fetches anonymously, so authenticated docs sites won't work directly. The right path is a self-hosted script: pass session cookies or bearer tokens via httpx headers, then run Trafilatura over the response. For internal team wikis (Confluence, Notion), the better approach is usually their official export feature plus our HTML-to-Markdown converter on the exported HTML.
How do I preserve the original navigation hierarchy in the output?
Use the URL path as the filename (the script above does this). Slashes become underscores, so /tutorial/path-params/ becomes tutorial_path-params.md. For a more structured layout, replace slashes with directory separators and create matching subdirectories — that preserves browseability.
Does MDisBetter offer a built-in crawler I could use instead?
No — the MDisBetter web tool converts one URL at a time and is best for ad-hoc work. For full-site crawls of the kind described here, the OSS path (Trafilatura + a 30-line sitemap parser) is the right tool: free, programmable, easy to schedule, and you own the script forever.