Pricing Dashboard Sign up
Recent
· 9 min read · MDisBetter

Converting JavaScript-Heavy Pages to Markdown: Technical Deep Dive

For half of the web, fetching a URL and converting the HTML works fine. For the other half — every React, Vue, Svelte, Angular, or Next.js app — the raw HTML is a near-empty shell. The actual content is rendered by JavaScript after page load. Converting these pages to Markdown requires running a real browser, waiting for the JS to do its work, then extracting the post-render DOM. Here's the technical detail of how that works and what it costs.

The two-world problem

Web pages today live in one of two worlds:

For CSR pages, fetching the URL with curl or a basic HTTP client returns something like:

<!doctype html>
<html>
  <head><title>App</title></head>
  <body>
    <div id="root"></div>
    <script src="/bundle.js"></script>
  </body>
</html>

The actual content lives inside #root after the bundle runs. Without running JS, your converter sees an empty div and emits empty Markdown.

Headless browsers: the only solution

To get the post-render content, you need to run a real browser. "Headless" means without a visible window — same browser engine, controlled programmatically, no GUI. Two dominant tools:

Both spawn a real browser process, navigate to the URL, execute JS, and expose the post-render DOM via page.content() or similar. Pseudocode:

browser = playwright.chromium.launch()
page = browser.new_page()
page.goto("https://react.dev/learn")
page.wait_for_load_state("networkidle")
html = page.content()  # post-render HTML
browser.close()

# Now feed `html` into your HTML-to-Markdown pipeline

This is the core of every modern URL-to-Markdown service. The differentiation lives in the wait strategy and the cleanup after.

Wait conditions: knowing when content is ready

Headless browsers expose multiple events you can wait for. Picking the right one matters — too aggressive and you capture an empty page; too patient and you waste time.

Wait strategyWhat it meansWhen to use
loadDOMContentLoaded + all stylesheets/imagesAlmost never (too early for SPAs)
domcontentloadedInitial HTML parsedAlmost never for SPAs
networkidleNo network requests for 500msDefault for SPAs; works for most pages
wait_for_selectorSpecific element appearsBest when you know what to look for
wait_for_functionCustom JS predicateComplex cases (e.g., "text length > 1000")

For a generic URL-to-Markdown service, you can't know in advance what selector to wait for — you'd need a custom rule per site. So the default is networkidle, sometimes augmented with a fallback timeout (cap at 10-15 seconds).

Some pages render content but keep firing analytics or polling requests. networkidle never fires. The fix is a hard timeout: "wait for networkidle, but if it doesn't happen in 10 seconds, capture whatever's there." This is what production scrapers like Firecrawl and our pipeline implement.

The infinite-scroll problem

Some pages (Reddit, Twitter, infinite-scroll feeds) only render the top portion on initial load. Below-the-fold content loads when you scroll. To capture it, you have to programmatically scroll:

previous_height = 0
while True:
    current_height = page.evaluate("document.body.scrollHeight")
    if current_height == previous_height:
        break
    page.evaluate(f"window.scrollTo(0, {current_height})")
    page.wait_for_timeout(1000)
    previous_height = current_height

This is expensive (every scroll triggers more loading) and can run forever if the feed is genuinely infinite. Production tools cap the number of scrolls or the total wait time. For converter use cases, capturing the first ~3 scrolls is usually enough.

Shadow DOM and iframes

Two more JS rendering challenges:

Shadow DOM

Web components encapsulate their internal DOM in a shadow root. Standard DOM queries don't traverse into shadow roots — you have to explicitly pierce them. Reddit's new UI is the canonical example; many corporate dashboards do this too.

// Naive: misses content inside web components
page.locator("article").text_content()

// Correct: pierce shadow roots
page.evaluate("""
  () => {
    function deepText(node) {
      if (node.shadowRoot) return deepText(node.shadowRoot);
      return Array.from(node.childNodes).map(deepText).join('');
    }
    return deepText(document.body);
  }
""")

Most converters miss this and produce empty output for shadow-DOM-heavy pages. Production services handle it.

Iframes

Embedded iframes load separate documents that the parent page's converter can't naively access. Cross-origin iframes are sandboxed; same-origin iframes are accessible but require explicit traversal. For URL-to-Markdown, iframes are usually noise (ads, embeds) and best ignored — but YouTube embeds, CodeSandbox embeds, and similar contain content the user might want.

Performance: what JS rendering costs

Spinning up a headless browser is expensive. Approximate costs per request on a single-CPU server:

OperationTimeMemory
HTTP fetch (no JS)50-500ms~5MB
Headless browser cold start500-2000ms200-500MB
Browser warm + page load2-8s for typical SPA200-500MB
+ wait for networkidle+1-5ssame
+ scroll-to-bottom+5-30ssame

For a single URL: 3-10 seconds end-to-end is normal for JS-rendered pages. For batch jobs, browser pooling (keep N browsers warm, dispatch URLs across them) brings per-page time down to ~2-5 seconds amortized.

Memory is the binding constraint at scale. Each browser instance holds 200-500MB. A server with 16GB RAM can comfortably run 20-30 concurrent browsers. Beyond that, you scale horizontally — every production scraping service runs browser farms.

Cost implications

The cost difference between static fetch and JS rendering is roughly 100x in compute resources. A static URL fetch costs essentially nothing — a few CPU milliseconds and negligible RAM. A JS-rendered conversion ties up a browser for several seconds with hundreds of MB of RAM.

This is why URL-to-Markdown services price differently for static vs JS-rendered. Some services (Jina Reader) include JS rendering in the free tier; others charge per-request based on whether rendering was needed. Production-grade pipelines let you pick:

The MDisBetter web tool uses the cost-aware fallback by default — try static first, escalate to a headless browser only if the static result looks empty. If you're rolling your own with Playwright + Trafilatura, the same heuristic is a few lines: extract once with Trafilatura on the static HTML; if it returns very little content, re-render with Playwright and re-extract.

Anti-bot defenses

Some sites actively block headless browsers. Common defenses:

Production scrapers counter with stealth plugins (puppeteer-extra-plugin-stealth, playwright-stealth), residential proxies, and request-pacing. URL-to-Markdown services that work on protected sites are doing real engineering here. Free tools that don't: they fail silently on these sites.

Worth noting: respecting robots.txt and rate limits is the right default. Aggressive scraping invites blocking and breaks the implicit web-citizen contract.

What this means for picking a converter

If your URLs are JS-heavy, the converter you pick has to ship a headless browser layer. Local OSS libraries that work statically (html2text, Pandoc, Trafilatura on the static HTML) don't render JS — but you can pair them with a headless browser yourself. Two viable paths:

Path A: Self-roll with Playwright + Trafilatura

For full programmatic control, install Playwright + Trafilatura, render each URL in a headless Chromium, then run extraction on the post-JS HTML. ~30 lines of Python. Yours to tune (wait strategies, shadow-DOM piercing, scroll-to-bottom, stealth plugins). Free. See the SPA conversion guide for the runnable recipe.

Path B: Hosted services

For ad-hoc URLs you don't want to script, hosted services do the headless rendering server-side. The MDisBetter URL-to-Markdown web tool auto-detects when JS rendering is needed and falls back to headless Chromium without any toggle from you. Firecrawl and Jina Reader are programmatic alternatives if you need API-level access (MDisBetter is web-tool only). For full-site crawls of difficult, anti-bot-protected sites, Firecrawl invests most heavily in JS-rendering robustness — see MDisBetter vs Firecrawl for the comparison.

For most users, the right call is the MDisBetter web tool for one-off URLs (zero setup) and the Playwright + Trafilatura recipe for batch (full control, free).

Frequently asked questions

Can I run Playwright myself instead of using a service?
Absolutely. Playwright is open source and free. The downsides: you operate the infrastructure (memory, CPU, browser updates), you handle the wait-strategy tuning, you build the anti-bot bypasses yourself. For a personal pipeline at low volume, self-hosting Playwright + html2text is a perfectly valid choice. For production scale, hosted services are usually cheaper end-to-end.
Is there a way to know if a page needs JS rendering before fetching?
Indirectly. You can fetch statically first, check if the HTML contains substantial text (e.g., > 500 visible characters in <body>), and only escalate to JS rendering if it doesn't. This is the cost-aware fallback most services implement. Pure detection without fetching isn't possible — you have to try.
Why does Cloudflare block my scraper but not my browser?
Cloudflare runs JavaScript challenges that detect headless browsers via fingerprinting (missing browser features, default automation flags, request patterns). Real browsers pass; default-config Playwright/Puppeteer fail. Stealth plugins help but it's an arms race. For sites that actively defend, residential proxies plus stealth configurations are usually needed.