Convert JavaScript-Rendered Pages to Markdown (SPA Guide)
You convert a URL to Markdown and the output is empty. Or it's a one-line message: "You need to enable JavaScript to view this site." Or just a navbar with no body content. The page works perfectly when you open it in your browser, but every conversion tool returns nothing useful. The cause is one of the most common shapes of modern web app — the Single-Page Application (SPA) — and the fix is enabling JavaScript rendering. Here is what's happening, why it matters, and the two practical paths: the MDisBetter web tool (which auto-handles this) for ad-hoc URLs, and a Playwright-based OSS pipeline for batch.
Why static fetch fails on SPAs
When a normal converter fetches https://example-spa.com/article/42, it issues an HTTP GET and parses the response body as HTML. For a traditional server-rendered site (WordPress, Django templates, plain HTML), the response body contains the full article content directly: headings, paragraphs, images, all there in the HTML.
For an SPA built with React, Vue, Angular, Svelte, or similar, the response body is essentially this:
<!DOCTYPE html>
<html>
<head><title>Loading...</title></head>
<body>
<div id="root"></div>
<script src="/static/js/main.abc123.js"></script>
</body>
</html>That's it. There is no article in this HTML. The article gets rendered when the JavaScript bundle (main.abc123.js) loads, executes, fetches the article data from an API endpoint (often /api/articles/42), and dynamically inserts the rendered HTML into the <div id="root">. A normal HTTP fetch never sees any of that — it sees only the empty shell.
What you see vs. what the fetcher sees
In your browser:
- Browser requests the URL → receives the empty shell HTML
- Browser downloads and runs
main.abc123.js - JS makes API calls, populates the DOM
- You see the rendered article
A static fetcher stops at step 1. Without a JavaScript engine, steps 2-4 never happen, and the fetcher's view of the page is the empty shell.
The fix: headless browser rendering
The solution is to do exactly what a real browser does: load the page, wait for the JavaScript to execute, then capture the resulting DOM. Tools that do this — Playwright, Puppeteer, Selenium — drive a real browser engine (Chromium, Firefox) in the background. The fetched "HTML" is the post-JS DOM, which contains the actual rendered content.
The MDisBetter web tool runs that pipeline server-side automatically when needed. The conversion looks like:
- Load URL in headless Chromium
- Wait until network is idle (no pending API calls)
- Wait an extra ~500ms for any final renders
- Capture the post-JS DOM
- Run readability extraction on the captured DOM
- Convert to Markdown
Slower than static fetch (5-15 seconds vs 0.5-2 seconds) but works on essentially any SPA.
Path 1: the MDisBetter web tool (ad-hoc URLs)
For one-off conversions of SPA pages — a single React-based docs page, a Notion public page, a Stoplight API reference — open /convert/url-to-markdown, paste the URL, click Convert. The converter does a fast static fetch first; if the resulting body has fewer than ~500 characters of meaningful text, it transparently retries with headless browser rendering. You don't have to think about it.
This handles 90%+ of SPA cases correctly with no configuration. The downside: the auto-fallback adds 5-10 seconds when triggered, and the web tool is one-URL-at-a-time. For batches, use Path 2.
Path 2: Playwright + html2text for batch (OSS)
MDisBetter doesn't currently expose a programmatic API or CLI for URL-to-Markdown. For batch conversion of 50+ SPA pages, the right answer is a small Python script using Playwright (drives real Chromium) plus a Markdown serializer (html2text or markdownify). Both are MIT-licensed and free.
Install once:
pip install playwright html2text
python -m playwright install chromiumMinimal converter:
import asyncio
from pathlib import Path
import re
from playwright.async_api import async_playwright
import html2text
OUT = Path('./out')
OUT.mkdir(exist_ok=True)
CONCURRENCY = 5 # browser pages are heavier than HTTP requests
h = html2text.HTML2Text()
h.body_width = 0
h.ignore_images = False
h.ignore_links = False
def url_to_path(url):
slug = re.sub(r'[^A-Za-z0-9]+', '_',
url.replace('https://', '').replace('http://', ''))[:100]
return OUT / f'{slug}.md'
async def convert(browser, url, sem):
out = url_to_path(url)
if out.exists():
return
async with sem:
ctx = await browser.new_context(
user_agent='Mozilla/5.0 (compatible; SPA-grabber/1.0)'
)
page = await ctx.new_page()
try:
await page.goto(url, wait_until='networkidle', timeout=30000)
html = await page.content()
finally:
await ctx.close()
md = h.handle(html)
out.write_text(md, encoding='utf-8')
async def main(urls):
sem = asyncio.Semaphore(CONCURRENCY)
async with async_playwright() as p:
browser = await p.chromium.launch()
await asyncio.gather(*[convert(browser, u, sem) for u in urls])
await browser.close()
urls = Path('urls.txt').read_text().splitlines()
asyncio.run(main([u.strip() for u in urls if u.strip()]))~30 lines, handles every SPA shape, runs entirely on your laptop. 100 URLs in 5-10 minutes.
Adding readability extraction (better quality)
html2text on the full DOM keeps menus, footers, and sidebars. For RAG-quality output, add a readability pass between Playwright and html2text. Trafilatura works on the post-JS HTML the same way it works on static HTML:
import trafilatura
# replace the html2text step with:
md = trafilatura.extract(
html,
output_format='markdown',
include_links=True,
include_tables=True,
favor_precision=True,
)You get clean article-only Markdown without the surrounding chrome.
Adding wait conditions
Some SPAs render in stages — the chrome appears fast, the actual content loads later. Three Playwright wait strategies:
# 1. Wait for network idle (default in the script above) — catches most cases
await page.goto(url, wait_until='networkidle', timeout=30000)
# 2. Wait for a specific selector — most reliable when you know the page shape
await page.goto(url, wait_until='domcontentloaded')
await page.wait_for_selector('article.content', timeout=15000)
# 3. Wait a fixed delay — last resort for poorly-instrumented apps
await page.goto(url, wait_until='domcontentloaded')
await page.wait_for_timeout(3000)Real-world examples
Examples that NEED JS rendering
- Netlify-hosted React apps. Most modern marketing sites and SaaS dashboards.
create-react-appdefaults produce this shape. - Vue/Nuxt SPAs in client-rendered mode. (Server-rendered Nuxt is fine; check the source HTML.)
- Angular apps. Almost always client-rendered.
- Single-page docs sites. Examples: many startup product docs that use Stoplight, ReadMe, or in-house React-based doc systems.
- Forum/community platforms. Discord pages, Discourse forum threads (some), Reddit's new design (the old
old.reddit.comis server-rendered). - Twitter/X. Entirely client-rendered.
Examples that DO NOT need JS rendering
- Wikipedia. Server-rendered MediaWiki. Static fetch is perfect.
- Most blog platforms. WordPress, Ghost, Hugo, Jekyll, Eleventy all server-render by default.
- Many docs sites with SSG. Stripe docs, FastAPI docs, Django docs, Python docs are statically generated. Fast static fetch works.
- News sites. NYT, Guardian, BBC, etc. all server-render the article body even if the surrounding chrome is interactive.
- GitHub README pages. Server-rendered.
How to tell which mode a page needs
Open the URL in your browser. Right-click → View Source. Search for the article's first paragraph in the source. If you find it, the page is server-rendered (static fetch works). If the source has none of the article's content (just a <div id="root"> and some script tags), it's an SPA (needs JS rendering).
Faster shortcut: disable JavaScript in your browser (DevTools → Settings → Debugger → Disable JavaScript), reload the page. If the content disappears, you need JS rendering. If it stays, you don't.
Performance and resource implications
| Approach | Time per URL | Resource cost |
|---|---|---|
| Static fetch + Trafilatura | 0.5-2 seconds | Tiny — fits on any laptop |
| MDisBetter web tool (auto) | 1-12 seconds | None on your side — runs server-side |
| Playwright + Trafilatura | 3-8 seconds | ~200 MB RAM per browser context |
For 1000 SPA URLs: Playwright at concurrency=5 finishes in 10-20 minutes on a modern laptop. The MDisBetter web tool is the right call for one URL at a time but not viable for 1000 manual pastes — that's why the OSS path exists.
Edge cases JS rendering doesn't fix
Authentication-required pages
Headless rendering loads the page anonymously. If the content requires login, the rendered DOM will show the login screen, not the article. With Playwright, set the auth cookie before navigating:
await ctx.add_cookies([{
'name': 'session', 'value': '...', 'domain': '.example.com', 'path': '/',
}])
await page.goto(url)The MDisBetter web tool fetches anonymously and doesn't accept session cookies — auth'd content is the OSS-script case.
Bot detection (Cloudflare, Datadome, PerimeterX)
Some sites detect headless browsers and serve a challenge page instead of content. Symptom: the JS-rendered output is the challenge page, not the article. Playwright's stealth mode (playwright-stealth) defeats some checks but not all. There's rarely a clean technical workaround — the publisher is signaling they don't want automated access.
Lazy-loaded infinite scroll
Pages where content loads only as you scroll (Twitter, Instagram, news feeds) won't be fully captured by a single page-load. With Playwright, scroll to the bottom in a loop until the page stops growing:
prev_height = 0
while True:
height = await page.evaluate('document.body.scrollHeight')
if height == prev_height:
break
prev_height = height
await page.evaluate('window.scrollTo(0, document.body.scrollHeight)')
await page.wait_for_timeout(1500)Content behind "click to expand"
Collapsed sections that only render content on click are a hybrid case. The DOM contains the content even when collapsed (just with display: none), so Playwright captures it correctly. If a site lazy-loads the content only after the click, add a Playwright page.click() step before grabbing page.content().
Custom selectors after rendering
For SPAs that wrap content in dynamic class names (e.g., article._abc123_main where the suffix changes per build), use stable attribute selectors with BeautifulSoup after Playwright:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
main = soup.select_one('[data-testid="article-body"], [role="article"], main article')
if main:
md = trafilatura.extract(str(main), output_format='markdown')Most modern frameworks expose role or data-* attributes that are stable across builds. Inspect the rendered DOM in your browser DevTools to find them.
Worked example: scraping a React-based docs site
Many docs sites (Stoplight, Mintlify, ReadMe.io) are React SPAs. Combining the Playwright recipe with the sitemap pattern from convert documentation site to Markdown:
import asyncio
from pathlib import Path
import requests
import xml.etree.ElementTree as ET
from playwright.async_api import async_playwright
import trafilatura
NS = {'sm': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
SITEMAP = 'https://docs.example.com/sitemap.xml'
OUT = Path('./spa-docs')
OUT.mkdir(exist_ok=True)
SEM = asyncio.Semaphore(5)
xml = requests.get(SITEMAP).text
urls = [loc.text for loc in ET.fromstring(xml).findall('.//sm:loc', NS)]
async def convert(browser, url):
out = OUT / (url.replace('/', '_').replace(':', '') + '.md')
if out.exists():
return
async with SEM:
ctx = await browser.new_context()
page = await ctx.new_page()
try:
await page.goto(url, wait_until='networkidle', timeout=60000)
html = await page.content()
finally:
await ctx.close()
md = trafilatura.extract(html, output_format='markdown',
include_links=True, include_tables=True)
if md:
out.write_text(md, encoding='utf-8')
print(f'OK {url}')
async def main():
async with async_playwright() as p:
browser = await p.chromium.launch()
await asyncio.gather(*[convert(browser, u) for u in urls])
await browser.close()
asyncio.run(main())Concurrency=5 is intentional: each Playwright context uses ~200 MB RAM, so 5 concurrent contexts max out around 1 GB — fine for any modern laptop. Bump higher on a beefier box.
What about PDF?
PDFs don't have a JS-vs-static distinction — they're already a self-contained file format. Conversion is dictated by the PDF's internal structure, not by any rendering decision. See PDF to Markdown. The two converters share output format, so SPA-derived Markdown and PDF-derived Markdown integrate cleanly in the same downstream consumer (vault, RAG corpus, knowledge base).
Recommendation
For one-off SPA URLs: use the MDisBetter web tool — it auto-handles JS rendering and you don't have to install anything. For batch SPA conversion (50+ URLs, recurring jobs, RAG ingestion): the Playwright + Trafilatura script above. Build it once, reuse forever — it's the same 30-50 lines whether you're doing 100 URLs or 100,000. See also how to convert any URL to Markdown for the basics and handling JavaScript-rendered pages for Markdown for additional Playwright patterns.