Why docs sites don't convert cleanly with generic scrapers
A naïve HTML-to-Markdown pass on a ReadTheDocs page produces a 4,000-line file: half of it is the navigation tree expanded inline, a quarter is footer links, and somewhere in there is the actual content. Worse, the same nav tree gets duplicated on every page you scrape, so an archive of 200 pages is mostly identical chrome. Our converter detects the main content region using semantic markers (<main>, <article>, role="main") and template-aware heuristics for the major frameworks, then emits the article body only.
Framework-aware extraction
We recognise the common docs frameworks — Sphinx/ReadTheDocs, Docusaurus, MkDocs (Material), GitBook, Mintlify, Nextra, VitePress, Bookdown — and apply per-framework selectors so extraction is reliable. Code blocks keep their language hint (```python), admonitions ("Note", "Warning") become Markdown blockquotes with the label preserved, and internal cross-links are rewritten to relative .md paths so the archive is browsable offline.