URL to Markdown for Website Content Migration
Every CMS migration starts as a one-week project and ends as a three-month slog. The mismatch is always in the same place: the content. The new platform looks great, the design system is built, the deployment pipeline is wired up — and then someone realizes they have 600 WordPress posts to move and the official export gives them a wall of XML with inline CSS, broken shortcodes, and image references that point to a server you're about to decommission. URL-to-Markdown reframes the problem. Instead of wrestling with platform-specific export formats, you treat your existing site as the source of truth and convert page-by-page from its public URLs. The output is clean Markdown with frontmatter, ready to drop into a Hugo, Astro, Eleventy, Ghost, or Docusaurus content directory.
Why "just use the export" doesn't work
Every CMS has an export. None of them produce content you can actually use directly. A representative tour:
- WordPress WXR (the .xml export): includes everything — drafts, revisions, attachments, custom fields, a serialized soup of shortcodes and Gutenberg blocks. Importing it into a static-site generator means writing a parser that handles your specific plugin stack. The parser breaks on the first post that uses a Gutenberg block your script doesn't know about.
- Squarespace export: a WordPress-flavored XML that drops most of your formatting. Lists become paragraphs. Image captions disappear. Block layouts collapse.
- Ghost export (.json): cleaner than the others, but only useful if you're moving Ghost-to-Ghost. Importing into Hugo still requires a transformation step.
- Webflow export: pure HTML and CSS, designed for hosting elsewhere as static files — not for migrating the content into a different CMS.
- Notion export: famously broken Markdown with UUID-suffixed filenames, missing callouts, and tables that don't roundtrip.
- Medium export: HTML files in a zip, with Medium's specific class soup baked in.
In every case the rendered URL is cleaner than the export. The URL is what your readers see; it's the content as the original CMS itself decided to present it. Converting from the URL bypasses the export format entirely and gives you exactly what's on the page.
The migration workflow, end to end
Step 1: Crawl the source site for all URLs
The cleanest source of URLs is the site's own sitemap. Most CMSes generate /sitemap.xml automatically. Parse it for the full URL list:
import requests, xml.etree.ElementTree as ET
r = requests.get("https://oldsite.com/sitemap.xml").text
ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
urls = [u.text for u in ET.fromstring(r).findall(".//sm:loc", ns)]
print(f"Found {len(urls)} URLs to migrate")
For sites without a clean sitemap, fall back to crawling: a small script that follows internal links from the homepage, depth-limited to avoid traps. scrapy is the standard tool; for one-off migrations, a 50-line BeautifulSoup loop is enough.
Step 2: Convert every URL to Markdown
Pipe the URL list through our URL-to-Markdown converter. The batch API takes URLs in, returns Markdown out, and includes frontmatter generated from the page metadata (title, description, OpenGraph tags, publish date, author):
from mdisbetter import batch_convert # pseudo-code; replace with actual API call
results = batch_convert(
urls=urls,
include_frontmatter=True,
extract_metadata=["title", "description", "og:image", "article:published_time", "article:author"],
rewrite_internal_links=True
)
for url, md in results.items():
slug = url.rstrip("/").split("/")[-1]
Path(f"./content/posts/{slug}.md").write_text(md, encoding="utf-8")
For 500-page sites, expect 8-15 minutes runtime. Output is one clean Markdown file per URL, dropped into your new SSG's content directory.
Step 3: Internal link rewriting
This is the step every naive migration script forgets and every senior engineer remembers. Your old URLs were https://oldsite.com/2023/04/my-post-title. Your new URLs will be https://newsite.com/posts/my-post-title. Every internal link in the converted Markdown points to the old structure.
If you're migrating to the same domain (just a platform change), this is partially handled by your redirects (next step). If the URL structure also changes (typical: dropping date prefixes, switching from /category/slug to flat /slug), the links inside the Markdown need rewriting before they ship. The converter's rewrite_internal_links option handles the most common transforms; for custom URL schemes, a regex pass over the Markdown corpus finishes the job.
Step 4: Generate the redirect map
Every old URL needs a 301 redirect to its new location, or you lose every backlink and every search ranking the old site accumulated. Generate the map alongside the conversion:
redirects = []
for url in urls:
old_path = urlparse(url).path
new_path = compute_new_path(old_path) # your URL transform logic
if old_path != new_path:
redirects.append((old_path, new_path))
# Output as Netlify _redirects, Vercel vercel.json, or nginx conf depending on host
with open("_redirects", "w") as f:
for old, new in redirects:
f.write(f"{old} {new} 301\n")
Test the redirect map against the top 100 URLs by traffic before launch (your old analytics has the list). A missed redirect on a high-traffic page is the single most expensive migration error.
Step 5: Image migration
The converter captures image URLs as they appear on the rendered page. For images hosted on your old CMS's media library, you have two options:
- Mirror: download every referenced image to a new
/static/images/folder, rewrite Markdown links to the new local paths. Belt-and-suspenders durability; you no longer depend on the old host. - Hotlink: leave images pointing at the old domain (or a CDN), as long as that origin will stay alive. Simpler, but creates a dependency on the legacy host.
For most migrations, mirroring is worth the extra hour. The download script is a parallel requests.get loop over the image URLs you extracted from the Markdown.
Step 6: Frontmatter normalization
Different SSGs expect slightly different frontmatter shapes. Hugo wants date, Astro wants pubDate, Jekyll wants published. The converter emits a generic schema; a 20-line normalization pass adapts it to your target SSG. Standardize on a frontmatter contract during migration — it pays off later when you build content audit and dashboard tooling on top of the corpus.
Migrating mixed-source sites
Many real migrations are not single-CMS. The corporate site is WordPress, the docs are on Confluence, the help center is on Zendesk, and there's a folder of legacy whitepapers in PDF. URL-to-Markdown handles the web sources; for the PDF half of the corpus, the parallel pipeline in PDF to Markdown produces output in the same format. You end up with one homogeneous Markdown content tree regardless of where each piece originated. Detail on the PDF leg of mixed-source migrations is in PDF to Markdown for business.
Concrete example: WordPress to Astro
You're moving a 7-year-old marketing site with 420 blog posts from WordPress (hosted on WP Engine) to Astro (deployed on Cloudflare Pages). End-to-end:
- Day 1 (audit): pull
sitemap.xml. 420 posts + 30 static pages = 450 URLs. Cross-check against analytics for top-traffic pages. - Day 1 (convert): run batch URL-to-Markdown over all 450. ~12 minutes. Output: 450 .md files in
src/content/posts/andsrc/content/pages/. - Day 2 (normalize): frontmatter normalization for Astro Content Collections schema. Internal link rewrite from
/2024/03/slugto/posts/slug. Image mirror to/public/images/. - Day 2 (build):
npm run build. Astro renders the corpus. Local preview looks correct. - Day 3 (redirects): generate
_redirectsfor Cloudflare Pages with all 450 old→new mappings. Test top 50 by traffic. - Day 3 (cutover): DNS swap to Cloudflare. WP Engine origin kept warm for 30 days as fallback.
- Day 4-7 (cleanup): monitor 404 logs, fix the inevitable handful of edge cases. Spot-check 50 random posts for formatting regressions.
One week, end to end, for a migration that would have taken six weeks with the official WordPress export and a custom Gutenberg parser.
What about content you actually want to leave behind?
Migrations are also pruning opportunities. Before converting, run analytics over the URL list and tag everything with no traffic in the last 12 months. For most marketing sites, 30-50% of historic content is dead weight that should not migrate. Convert the keepers; redirect the cuts to the closest topical alternative (or to the homepage as a last resort). A leaner, faster, more focused site is half the value of doing the migration in the first place.
Post-migration QA: the spot-check protocol
No automated migration is bit-perfect. The realistic QA strategy is sampling, not exhaustive review. After cutover, the discipline is:
- Top-50 by traffic: read each rendered page on the new platform side-by-side with the (still-up) old platform. Look for missing sections, broken images, code-block formatting regressions, table issues.
- Random 5% sample: random-pick 5% of the corpus regardless of traffic. This catches systematic issues that affect the long tail (a Gutenberg block your converter handled poorly, an embed pattern that consistently breaks).
- 404 log review: 14 days post-cutover, pull every 404 from server logs and triage. Most are bots and legacy long-dead URLs; the rest are missed redirects you patch one-by-one.
- Search query log review: if the old site had on-site search, check the most common queries and confirm the new platform's search returns useful results for the same queries.
A two-engineer team can complete this QA pass in 3-4 days for a 500-page migration. The cost of skipping it is shaped like an iceberg: 2% of pages have visible issues, 5% of users encounter one, and the support tickets trickle in for months.
Why this approach lasts
The migration you do this year will not be your last. In 4-6 years, you'll likely move again — to a platform that doesn't yet exist. If your content was migrated into Markdown, the next migration is almost free: Markdown corpus → next platform's content directory, with whatever frontmatter normalization that platform requires. The expensive migration is the one out of a proprietary format. Once you're in Markdown, you're in the format every future platform will support.