URL to Markdown Benchmark: 8 Tools Tested on Real Pages
Most "URL to Markdown" benchmarks pick easy pages and grade on a pass/fail axis. Real conversion happens on a long tail of awkward inputs: documentation sites with code blocks, news articles wrapped in three layers of ad scaffolding, JS-rendered SPAs that fetch nothing useful on first paint, and Reddit threads where the actual content lives behind shadow DOM and infinite scroll. We took eight tools, ran them across six representative URLs, and scored each output honestly.
Test methodology
Six URLs picked to stress different failure modes:
- Wikipedia article (en.wikipedia.org/wiki/Markdown) — long-form prose, heavy internal linking, infoboxes, references
- Stripe documentation (docs.stripe.com/api) — code blocks in many languages, deep navigation, syntax-highlighted samples
- New York Times article — paywall hint, ad blocks, related-stories cards mixed inline, image captions
- React documentation (react.dev) — JS-rendered, interactive code editors, MDX components
- GitHub README (a popular OSS project) — already Markdown rendered to HTML, badges, anchor TOCs, embedded SVGs
- Reddit thread — heavy JS, nested comment trees, vote widgets, removed posts
Each tool was scored 0-5 across four axes: Cleanliness (no junk, no nav cruft), Structure preservation (headings, lists, tables intact), JS handling (does the actual content come through on JS-rendered pages), Code block formatting (fenced blocks with correct language hints).
Disclosure: we built one of the tools tested. We tried to be honest below, including where competitors win.
Tools tested
- MDisBetter URL to Markdown
- Firecrawl (firecrawl.dev) — paid, full crawl + extract platform
- Jina Reader (r.jina.ai) — free URL prefix API
- Microlink (microlink.io) — paid, scraping + screenshot platform
- MarkdownDown (urltomarkdown.com) — free hosted utility
- Browsely (browsely.ai) — paid, AI-driven scraper
- Simplescraper (simplescraper.io) — paid, no-code scraper with Markdown export
- html2text (Python library) — local, free, classic baseline
Results table
| Tool | Cleanliness | Structure | JS handling | Code blocks | Total /20 |
|---|---|---|---|---|---|
| MDisBetter | 5 | 5 | 4 | 5 | 19 |
| Firecrawl | 5 | 5 | 5 | 4 | 19 |
| Jina Reader | 4 | 4 | 4 | 4 | 16 |
| Microlink | 4 | 3 | 4 | 3 | 14 |
| Browsely | 4 | 4 | 4 | 3 | 15 |
| MarkdownDown | 3 | 3 | 2 | 3 | 11 |
| Simplescraper | 3 | 3 | 3 | 2 | 11 |
| html2text | 2 | 3 | 1 | 2 | 8 |
Page-by-page breakdown
Wikipedia
The easiest page in the set. All eight tools produced usable Markdown. Differentiators: how cleanly each handled the right-side infobox, the references section, and the citation links. MDisBetter, Firecrawl, and Jina kept the article body clean while preserving the references as a list. html2text dumped everything inline including the navigation chrome.
Stripe API docs
This is where code-block handling shows up. Stripe ships syntax-highlighted samples in five languages per endpoint. MDisBetter and Firecrawl correctly emitted fenced code blocks with language hints (```python, ```ruby). Jina got the code blocks but lost the language tags. Microlink and Simplescraper flattened code into plain paragraphs — a serious downgrade for any LLM workflow consuming the output.
New York Times article
The hardest cleanliness test. The page is wrapped in newsletters, related-story cards, ad scaffolding, video players, and "continue reading" gates. MDisBetter, Firecrawl, and Browsely correctly identified the article body and stripped the chrome. Jina did well but kept some of the photo captions in a weird inline format. MarkdownDown and html2text emitted a wall of mixed content where the article body was buried under nav and ad text.
React documentation (react.dev)
This is the JS-rendered test. The page is a React SPA — fetching the raw HTML returns near-empty markup. Tools that don't run a browser fail here. Firecrawl wins outright (their headless browser layer is mature and configurable). MDisBetter handles it well via its rendering pipeline. Jina also handles JS via their reader. html2text gets nothing useful because it never ran the JS.
GitHub README
Pages already rendered from Markdown should be the easiest. MDisBetter and Firecrawl produced near-perfect round-trips. Jina kept all content but converted some Markdown-native elements (badges, anchor TOCs) back into HTML-ish artifacts. html2text actually does fine here because the underlying HTML is semantic.
Reddit thread
The hardest test. Reddit's JS-rendered comment tree, with shadow DOM and lazy loading, breaks most tools. Firecrawl handled it best — depth control plus aggressive waiting. MDisBetter got the post and top-level comments. Jina got the post but flattened the comment hierarchy. The non-JS tools got essentially nothing.
Honest tradeoffs
Where MDisBetter wins: multi-format breadth (URL is one of 20+ tools we ship), code-block fidelity across documentation pages, free web tool with no signup. We don't ship a programmatic API or CLI for URL-to-Markdown today, so this is a web-tool-vs-web-tool comparison — for scripted automation, see Jina Reader or Firecrawl below. Full positioning in our 2026 ranked review.
Where Firecrawl wins: full-site crawling at scale, more aggressive JS handling with depth and wait controls, better engineering for spider-style use cases. If your job is "crawl this entire docs site," Firecrawl is purpose-built. We compare them directly in MDisBetter vs Firecrawl.
Where Jina Reader wins: API simplicity (prefix any URL with r.jina.ai/), generous free tier, good baseline quality. Hard to beat for one-line developer integration. Our Jina comparison goes deeper.
Where html2text wins: local execution, zero dependency on any service, free forever. Quality is the worst of the bunch on modern web pages, but if your input is server-rendered HTML you control, it's a perfectly fine choice.
The verdict
For most users converting one or many URLs to LLM-ready Markdown, MDisBetter, Firecrawl, and Jina Reader are the three serious options. The choice between them is positioning more than quality:
- You want the cleanest output across diverse page types via a free web tool with no account: MDisBetter
- You're crawling entire sites and need depth/queue control via a programmatic API: Firecrawl
- You want a one-line URL-prefix API with a generous free tier: Jina Reader
For RAG pipelines specifically, see the URL-to-Markdown for RAG guidance page and the runnable Trafilatura-based recipe in scrape a website to Markdown for RAG.