URL to Markdown for Technical Writers — Web to Docs Migration
The migration from a legacy docs CMS (Confluence, MediaWiki, ancient SharePoint, custom WordPress, Webflow) to docs-as-code (MkDocs, Docusaurus, Hugo, GitBook) is the most demoralising project in tech writing. Hundreds of pages, none in a usable format. Crawl the source URL, convert to Markdown in batch, drop into your new repo. The migration runs in days instead of months.
Why this is hard without the right tool
- Decade-old docs scattered across Confluence, MediaWiki, SharePoint, and a custom CMS
- Need to migrate to docs-as-code (MkDocs, Docusaurus, Hugo) but the export from the old system is HTML soup
- Content scattered across team wikis with no unified structure
- Manual page-by-page migration would take a tech writer six months full-time
- Old CMS exports lose semantic structure — every paragraph becomes a hardcoded font-size in inline styles
Recommended workflow
- Pull the source site's URL list (from
sitemap.xml, an internal index page, or a hand-curated CSV) - For a one-off small migration: paste each URL into /convert/url-to-markdown and download. For hundreds of pages: roll a small Python crawler (Trafilatura + sitemap parser, ~40 lines) that emits one
.mdper page, mirroring URL structure - Run
markdownlintover the output to normalise style across hundreds of pages - Drop the folder into your docs-as-code repo (MkDocs / Docusaurus / Hugo all accept Markdown directly)
- Build, review, fix navigation and internal links, ship the new docs site
Code examples
Self-rolled crawl + lint migration script (no MDisBetter API needed)
# Step 1: crawl + convert with Trafilatura's CLI (OSS, MIT-licensed)
# Install once: pip install trafilatura
mkdir -p ./docs/migrated
trafilatura --sitemap https://docs-old.example.com/sitemap.xml \
--output-dir ./docs/migrated \
--output-format markdown \
--include-links --include-tables
# Step 2: normalise style across all migrated pages
markdownlint --fix ./docs/migrated/
# Step 3: build with your static site generator
mkdocs serve # or: docusaurus start, hugo server
# For ad-hoc single pages, mdisbetter.com/convert/url-to-markdown handles
# the JS-rendered cases (Docusaurus, GitBook) without setting up Playwright locally.
Frequently asked questions
How do I migrate Confluence to MkDocs / Docusaurus?
Confluence has a public REST API for page export, but the HTML it returns is full of Confluence-specific markup. Easier path: crawl the public-facing rendered pages with Trafilatura (or with <code>requests</code> + your session cookie for auth'd pages). The Markdown output drops straight into MkDocs / Docusaurus / Hugo. A 500-page Confluence space typically processes in an hour of script time plus a day of nav cleanup. For pages that won't render under Trafilatura, paste them one-at-a-time into <a href="/convert/url-to-markdown">mdisbetter.com/convert/url-to-markdown</a>.
Will internal links survive the migration?
Internal HTTP links convert as Markdown links with the original URL preserved. After migration, run a find-and-replace across the new repo to rewrite old-domain URLs to relative paths in the new docs site. Most static site generators have plugins (mkdocs-redirects, docusaurus-plugin-redirect) to keep old URLs working via 301 during the transition period.
How do I handle MediaWiki / Wikipedia-syntax docs?
MediaWiki publishes rendered HTML at the public URL — convert that, not the wiki source. The Markdown output is cleaner than what you'd get from a MediaWiki-to-Markdown syntax converter, since it goes through the rendered semantic structure rather than fighting wiki markup.
What about SharePoint pages with complex layouts?
SharePoint's rendered HTML is verbose but tractable. For one-off pages, the MDisBetter web tool produces clean Markdown directly. For private/auth'd SharePoint, you'll need a self-hosted script using <code>requests</code> with your SharePoint session cookie or bearer token, then html2text on the response — MDisBetter's web tool fetches anonymously. Modern SharePoint pages with multi-column layouts flatten into single-column Markdown either way, usually an improvement.
Can I keep the old docs site running during migration?
Yes — that's the recommended approach. Crawl-and-convert is read-only on the source. Run both sites in parallel during the migration period, redirect old URLs to new ones once content parity is achieved, decommission the old CMS only when traffic confirms the new site works. The whole transition typically spans 2-8 weeks depending on docs volume.