Pricing Dashboard Sign up
Recent
· 10 min read · MDisBetter

URL to Markdown for Website Content Migration

Every CMS migration starts as a one-week project and ends as a three-month slog. The mismatch is always in the same place: the content. The new platform looks great, the design system is built, the deployment pipeline is wired up — and then someone realizes they have 600 WordPress posts to move and the official export gives them a wall of XML with inline CSS, broken shortcodes, and image references that point to a server you're about to decommission. URL-to-Markdown reframes the problem. Instead of wrestling with platform-specific export formats, you treat your existing site as the source of truth and convert page-by-page from its public URLs. The output is clean Markdown with frontmatter, ready to drop into a Hugo, Astro, Eleventy, Ghost, or Docusaurus content directory.

Why "just use the export" doesn't work

Every CMS has an export. None of them produce content you can actually use directly. A representative tour:

In every case the rendered URL is cleaner than the export. The URL is what your readers see; it's the content as the original CMS itself decided to present it. Converting from the URL bypasses the export format entirely and gives you exactly what's on the page.

The migration workflow, end to end

Step 1: Crawl the source site for all URLs

The cleanest source of URLs is the site's own sitemap. Most CMSes generate /sitemap.xml automatically. Parse it for the full URL list:

import requests, xml.etree.ElementTree as ET

r = requests.get("https://oldsite.com/sitemap.xml").text
ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
urls = [u.text for u in ET.fromstring(r).findall(".//sm:loc", ns)]
print(f"Found {len(urls)} URLs to migrate")

For sites without a clean sitemap, fall back to crawling: a small script that follows internal links from the homepage, depth-limited to avoid traps. scrapy is the standard tool; for one-off migrations, a 50-line BeautifulSoup loop is enough.

Step 2: Convert every URL to Markdown

Pipe the URL list through our URL-to-Markdown converter. The batch API takes URLs in, returns Markdown out, and includes frontmatter generated from the page metadata (title, description, OpenGraph tags, publish date, author):

from mdisbetter import batch_convert  # pseudo-code; replace with actual API call

results = batch_convert(
    urls=urls,
    include_frontmatter=True,
    extract_metadata=["title", "description", "og:image", "article:published_time", "article:author"],
    rewrite_internal_links=True
)

for url, md in results.items():
    slug = url.rstrip("/").split("/")[-1]
    Path(f"./content/posts/{slug}.md").write_text(md, encoding="utf-8")

For 500-page sites, expect 8-15 minutes runtime. Output is one clean Markdown file per URL, dropped into your new SSG's content directory.

Step 3: Internal link rewriting

This is the step every naive migration script forgets and every senior engineer remembers. Your old URLs were https://oldsite.com/2023/04/my-post-title. Your new URLs will be https://newsite.com/posts/my-post-title. Every internal link in the converted Markdown points to the old structure.

If you're migrating to the same domain (just a platform change), this is partially handled by your redirects (next step). If the URL structure also changes (typical: dropping date prefixes, switching from /category/slug to flat /slug), the links inside the Markdown need rewriting before they ship. The converter's rewrite_internal_links option handles the most common transforms; for custom URL schemes, a regex pass over the Markdown corpus finishes the job.

Step 4: Generate the redirect map

Every old URL needs a 301 redirect to its new location, or you lose every backlink and every search ranking the old site accumulated. Generate the map alongside the conversion:

redirects = []
for url in urls:
    old_path = urlparse(url).path
    new_path = compute_new_path(old_path)  # your URL transform logic
    if old_path != new_path:
        redirects.append((old_path, new_path))

# Output as Netlify _redirects, Vercel vercel.json, or nginx conf depending on host
with open("_redirects", "w") as f:
    for old, new in redirects:
        f.write(f"{old} {new} 301\n")

Test the redirect map against the top 100 URLs by traffic before launch (your old analytics has the list). A missed redirect on a high-traffic page is the single most expensive migration error.

Step 5: Image migration

The converter captures image URLs as they appear on the rendered page. For images hosted on your old CMS's media library, you have two options:

For most migrations, mirroring is worth the extra hour. The download script is a parallel requests.get loop over the image URLs you extracted from the Markdown.

Step 6: Frontmatter normalization

Different SSGs expect slightly different frontmatter shapes. Hugo wants date, Astro wants pubDate, Jekyll wants published. The converter emits a generic schema; a 20-line normalization pass adapts it to your target SSG. Standardize on a frontmatter contract during migration — it pays off later when you build content audit and dashboard tooling on top of the corpus.

Migrating mixed-source sites

Many real migrations are not single-CMS. The corporate site is WordPress, the docs are on Confluence, the help center is on Zendesk, and there's a folder of legacy whitepapers in PDF. URL-to-Markdown handles the web sources; for the PDF half of the corpus, the parallel pipeline in PDF to Markdown produces output in the same format. You end up with one homogeneous Markdown content tree regardless of where each piece originated. Detail on the PDF leg of mixed-source migrations is in PDF to Markdown for business.

Concrete example: WordPress to Astro

You're moving a 7-year-old marketing site with 420 blog posts from WordPress (hosted on WP Engine) to Astro (deployed on Cloudflare Pages). End-to-end:

  1. Day 1 (audit): pull sitemap.xml. 420 posts + 30 static pages = 450 URLs. Cross-check against analytics for top-traffic pages.
  2. Day 1 (convert): run batch URL-to-Markdown over all 450. ~12 minutes. Output: 450 .md files in src/content/posts/ and src/content/pages/.
  3. Day 2 (normalize): frontmatter normalization for Astro Content Collections schema. Internal link rewrite from /2024/03/slug to /posts/slug. Image mirror to /public/images/.
  4. Day 2 (build): npm run build. Astro renders the corpus. Local preview looks correct.
  5. Day 3 (redirects): generate _redirects for Cloudflare Pages with all 450 old→new mappings. Test top 50 by traffic.
  6. Day 3 (cutover): DNS swap to Cloudflare. WP Engine origin kept warm for 30 days as fallback.
  7. Day 4-7 (cleanup): monitor 404 logs, fix the inevitable handful of edge cases. Spot-check 50 random posts for formatting regressions.

One week, end to end, for a migration that would have taken six weeks with the official WordPress export and a custom Gutenberg parser.

What about content you actually want to leave behind?

Migrations are also pruning opportunities. Before converting, run analytics over the URL list and tag everything with no traffic in the last 12 months. For most marketing sites, 30-50% of historic content is dead weight that should not migrate. Convert the keepers; redirect the cuts to the closest topical alternative (or to the homepage as a last resort). A leaner, faster, more focused site is half the value of doing the migration in the first place.

Post-migration QA: the spot-check protocol

No automated migration is bit-perfect. The realistic QA strategy is sampling, not exhaustive review. After cutover, the discipline is:

  1. Top-50 by traffic: read each rendered page on the new platform side-by-side with the (still-up) old platform. Look for missing sections, broken images, code-block formatting regressions, table issues.
  2. Random 5% sample: random-pick 5% of the corpus regardless of traffic. This catches systematic issues that affect the long tail (a Gutenberg block your converter handled poorly, an embed pattern that consistently breaks).
  3. 404 log review: 14 days post-cutover, pull every 404 from server logs and triage. Most are bots and legacy long-dead URLs; the rest are missed redirects you patch one-by-one.
  4. Search query log review: if the old site had on-site search, check the most common queries and confirm the new platform's search returns useful results for the same queries.

A two-engineer team can complete this QA pass in 3-4 days for a 500-page migration. The cost of skipping it is shaped like an iceberg: 2% of pages have visible issues, 5% of users encounter one, and the support tickets trickle in for months.

Why this approach lasts

The migration you do this year will not be your last. In 4-6 years, you'll likely move again — to a platform that doesn't yet exist. If your content was migrated into Markdown, the next migration is almost free: Markdown corpus → next platform's content directory, with whatever frontmatter normalization that platform requires. The expensive migration is the one out of a proprietary format. Once you're in Markdown, you're in the format every future platform will support.

Frequently asked questions

How do I handle custom shortcodes from my old WordPress site?
The converter strips most shortcodes to their rendered output (which is usually what you want — a [button] shortcode becomes a Markdown link, a [gallery] becomes a sequence of image references). For shortcodes with no useful rendered fallback (forms, dynamic widgets), the rendered page shows whatever the plugin output, and you'll need a per-page editorial pass for those specific instances. In practice, this affects under 5% of pages on a typical marketing site.
What about comments? Most CMS exports include them.
URL-to-Markdown captures the article body, not the comment thread. For migrations where comment history matters (community sites, niche blogs with deep discussions), export comments separately from the source CMS and store them as a sidecar JSON next to each Markdown file. Static-site comment systems (Giscus, utterances, Disqus, self-hosted Isso) can backfill from this.
Can I run the migration in stages, page-type by page-type?
Yes, and we recommend it for sites over ~200 pages. Common staging: convert and ship the blog first (highest URL count, most diff-able), then static pages, then docs/help center last. Each stage is independent — you can ship the blog on the new platform while marketing pages still serve from the old CMS, with reverse-proxy routing handling the split.