Converting JavaScript-Heavy Pages to Markdown: Technical Deep Dive
For half of the web, fetching a URL and converting the HTML works fine. For the other half — every React, Vue, Svelte, Angular, or Next.js app — the raw HTML is a near-empty shell. The actual content is rendered by JavaScript after page load. Converting these pages to Markdown requires running a real browser, waiting for the JS to do its work, then extracting the post-render DOM. Here's the technical detail of how that works and what it costs.
The two-world problem
Web pages today live in one of two worlds:
- Server-rendered (SSR): the HTML returned from the server contains the actual content. Examples: most blogs, Wikipedia, traditional news sites, GitHub README pages. Converting is straightforward — fetch, parse, walk.
- Client-rendered (CSR/SPA): the HTML returned from the server is a shell containing a script tag that loads a JavaScript bundle. The bundle then fetches data and renders content into the DOM client-side. Examples: react.dev, Twitter, Reddit's new UI, Notion published pages, most modern web apps.
For CSR pages, fetching the URL with curl or a basic HTTP client returns something like:
<!doctype html>
<html>
<head><title>App</title></head>
<body>
<div id="root"></div>
<script src="/bundle.js"></script>
</body>
</html>The actual content lives inside #root after the bundle runs. Without running JS, your converter sees an empty div and emits empty Markdown.
Headless browsers: the only solution
To get the post-render content, you need to run a real browser. "Headless" means without a visible window — same browser engine, controlled programmatically, no GUI. Two dominant tools:
- Playwright (Microsoft): cross-browser (Chromium, Firefox, WebKit), modern API, strong wait primitives
- Puppeteer (Google): Chromium-only, mature, lower-level API
Both spawn a real browser process, navigate to the URL, execute JS, and expose the post-render DOM via page.content() or similar. Pseudocode:
browser = playwright.chromium.launch()
page = browser.new_page()
page.goto("https://react.dev/learn")
page.wait_for_load_state("networkidle")
html = page.content() # post-render HTML
browser.close()
# Now feed `html` into your HTML-to-Markdown pipelineThis is the core of every modern URL-to-Markdown service. The differentiation lives in the wait strategy and the cleanup after.
Wait conditions: knowing when content is ready
Headless browsers expose multiple events you can wait for. Picking the right one matters — too aggressive and you capture an empty page; too patient and you waste time.
| Wait strategy | What it means | When to use |
|---|---|---|
| load | DOMContentLoaded + all stylesheets/images | Almost never (too early for SPAs) |
| domcontentloaded | Initial HTML parsed | Almost never for SPAs |
| networkidle | No network requests for 500ms | Default for SPAs; works for most pages |
| wait_for_selector | Specific element appears | Best when you know what to look for |
| wait_for_function | Custom JS predicate | Complex cases (e.g., "text length > 1000") |
For a generic URL-to-Markdown service, you can't know in advance what selector to wait for — you'd need a custom rule per site. So the default is networkidle, sometimes augmented with a fallback timeout (cap at 10-15 seconds).
Some pages render content but keep firing analytics or polling requests. networkidle never fires. The fix is a hard timeout: "wait for networkidle, but if it doesn't happen in 10 seconds, capture whatever's there." This is what production scrapers like Firecrawl and our pipeline implement.
The infinite-scroll problem
Some pages (Reddit, Twitter, infinite-scroll feeds) only render the top portion on initial load. Below-the-fold content loads when you scroll. To capture it, you have to programmatically scroll:
previous_height = 0
while True:
current_height = page.evaluate("document.body.scrollHeight")
if current_height == previous_height:
break
page.evaluate(f"window.scrollTo(0, {current_height})")
page.wait_for_timeout(1000)
previous_height = current_heightThis is expensive (every scroll triggers more loading) and can run forever if the feed is genuinely infinite. Production tools cap the number of scrolls or the total wait time. For converter use cases, capturing the first ~3 scrolls is usually enough.
Shadow DOM and iframes
Two more JS rendering challenges:
Shadow DOM
Web components encapsulate their internal DOM in a shadow root. Standard DOM queries don't traverse into shadow roots — you have to explicitly pierce them. Reddit's new UI is the canonical example; many corporate dashboards do this too.
// Naive: misses content inside web components
page.locator("article").text_content()
// Correct: pierce shadow roots
page.evaluate("""
() => {
function deepText(node) {
if (node.shadowRoot) return deepText(node.shadowRoot);
return Array.from(node.childNodes).map(deepText).join('');
}
return deepText(document.body);
}
""")Most converters miss this and produce empty output for shadow-DOM-heavy pages. Production services handle it.
Iframes
Embedded iframes load separate documents that the parent page's converter can't naively access. Cross-origin iframes are sandboxed; same-origin iframes are accessible but require explicit traversal. For URL-to-Markdown, iframes are usually noise (ads, embeds) and best ignored — but YouTube embeds, CodeSandbox embeds, and similar contain content the user might want.
Performance: what JS rendering costs
Spinning up a headless browser is expensive. Approximate costs per request on a single-CPU server:
| Operation | Time | Memory |
|---|---|---|
| HTTP fetch (no JS) | 50-500ms | ~5MB |
| Headless browser cold start | 500-2000ms | 200-500MB |
| Browser warm + page load | 2-8s for typical SPA | 200-500MB |
| + wait for networkidle | +1-5s | same |
| + scroll-to-bottom | +5-30s | same |
For a single URL: 3-10 seconds end-to-end is normal for JS-rendered pages. For batch jobs, browser pooling (keep N browsers warm, dispatch URLs across them) brings per-page time down to ~2-5 seconds amortized.
Memory is the binding constraint at scale. Each browser instance holds 200-500MB. A server with 16GB RAM can comfortably run 20-30 concurrent browsers. Beyond that, you scale horizontally — every production scraping service runs browser farms.
Cost implications
The cost difference between static fetch and JS rendering is roughly 100x in compute resources. A static URL fetch costs essentially nothing — a few CPU milliseconds and negligible RAM. A JS-rendered conversion ties up a browser for several seconds with hundreds of MB of RAM.
This is why URL-to-Markdown services price differently for static vs JS-rendered. Some services (Jina Reader) include JS rendering in the free tier; others charge per-request based on whether rendering was needed. Production-grade pipelines let you pick:
- Always render JS: highest quality, highest cost
- Render only if static fetch returns < N characters of content: cost-aware fallback
- Never render JS: cheapest, fails on SPAs
The MDisBetter web tool uses the cost-aware fallback by default — try static first, escalate to a headless browser only if the static result looks empty. If you're rolling your own with Playwright + Trafilatura, the same heuristic is a few lines: extract once with Trafilatura on the static HTML; if it returns very little content, re-render with Playwright and re-extract.
Anti-bot defenses
Some sites actively block headless browsers. Common defenses:
- User-agent sniffing: reject requests with default Playwright/Puppeteer UA strings
- JavaScript challenges: Cloudflare, Akamai, DataDome serve interstitial pages requiring real-browser fingerprints
- Behavioral detection: mouse-movement absence, suspicious request timing
- CAPTCHA: ultimate fallback when other defenses suspect a bot
Production scrapers counter with stealth plugins (puppeteer-extra-plugin-stealth, playwright-stealth), residential proxies, and request-pacing. URL-to-Markdown services that work on protected sites are doing real engineering here. Free tools that don't: they fail silently on these sites.
Worth noting: respecting robots.txt and rate limits is the right default. Aggressive scraping invites blocking and breaks the implicit web-citizen contract.
What this means for picking a converter
If your URLs are JS-heavy, the converter you pick has to ship a headless browser layer. Local OSS libraries that work statically (html2text, Pandoc, Trafilatura on the static HTML) don't render JS — but you can pair them with a headless browser yourself. Two viable paths:
Path A: Self-roll with Playwright + Trafilatura
For full programmatic control, install Playwright + Trafilatura, render each URL in a headless Chromium, then run extraction on the post-JS HTML. ~30 lines of Python. Yours to tune (wait strategies, shadow-DOM piercing, scroll-to-bottom, stealth plugins). Free. See the SPA conversion guide for the runnable recipe.
Path B: Hosted services
For ad-hoc URLs you don't want to script, hosted services do the headless rendering server-side. The MDisBetter URL-to-Markdown web tool auto-detects when JS rendering is needed and falls back to headless Chromium without any toggle from you. Firecrawl and Jina Reader are programmatic alternatives if you need API-level access (MDisBetter is web-tool only). For full-site crawls of difficult, anti-bot-protected sites, Firecrawl invests most heavily in JS-rendering robustness — see MDisBetter vs Firecrawl for the comparison.
For most users, the right call is the MDisBetter web tool for one-off URLs (zero setup) and the Playwright + Trafilatura recipe for batch (full control, free).