URL to Markdown Benchmark: 10 Tools Tested on 30 Real Pages
Our earlier 8-tool benchmark tested six representative pages. This expanded test goes wider: 10 tools (we added Trafilatura and Readability as OSS baselines), 30 real-world URLs across five categories. The goal is to find the per-category winners — there is no single tool that wins everywhere, and being honest about that is more useful than pretending otherwise.
Test methodology
Five page categories, six URLs each, 30 URLs total:
- Documentation — Stripe API docs, FastAPI tutorial, React docs, Tailwind docs, MDN JavaScript reference, Kubernetes concepts
- News — NYT article, Reuters article, BBC News article, The Verge tech post, Hacker News story page, Bloomberg piece
- Wiki — Wikipedia (English long-form), Wiktionary entry, Fandom wiki page, MediaWiki manual page, OSDev wiki, Arch Linux wiki
- Forum — Reddit thread, Stack Overflow Q&A, Hacker News comments, GitHub Issue thread, Discourse forum thread, Discord public archive page
- SPA / JS-rendered — Vercel dashboard public page, Notion public page, Linear public roadmap, Figma community file, Vue 3 docs, Next.js docs
Each tool scored 0-5 on five axes:
- Cleanliness — no nav, footer, ads, or sidebar cruft in the output
- Structure — headings, lists, blockquotes preserved
- JS handling — does the actual content come through on JS-rendered pages
- Code blocks — fenced blocks with correct language hints
- Tables — pipe-table rendering for tabular content
Maximum total: 25/25. Disclosure: we built one of the tools tested. Where competitors win, we say so.
Tools tested
- MDisBetter URL to Markdown — web tool, free, no signup
- Firecrawl — paid, full crawl + extract platform
- Jina Reader — free URL-prefix API
- Microlink — paid, scraping + screenshot platform
- MarkdownDown — free hosted utility
- Browsely — paid, browser extension + AI sidebar with conversion
- Simplescraper — paid, no-code scraper with Markdown export
- html2text — Python library, local, free
- Trafilatura — Python library, local, free, considered OSS gold standard
- Readability (Mozilla port) — JS library, local, free, the algorithm behind Firefox Reader View
Aggregate results
| Tool | Clean | Struct | JS | Code | Tables | Total /25 |
|---|---|---|---|---|---|---|
| Firecrawl | 5 | 5 | 5 | 4 | 5 | 24 |
| MDisBetter | 5 | 5 | 4 | 5 | 4 | 23 |
| Jina Reader | 4 | 4 | 5 | 4 | 4 | 21 |
| Trafilatura | 4 | 5 | 2 | 4 | 4 | 19 |
| Browsely | 4 | 4 | 5 | 3 | 3 | 19 |
| Microlink | 4 | 3 | 4 | 3 | 3 | 17 |
| Readability | 4 | 4 | 2 | 3 | 3 | 16 |
| MarkdownDown | 3 | 3 | 2 | 3 | 2 | 13 |
| Simplescraper | 3 | 3 | 3 | 2 | 2 | 13 |
| html2text | 2 | 3 | 1 | 2 | 2 | 10 |
Per-category winners
| Category | Winner | Runner-up | Why |
|---|---|---|---|
| Documentation | MDisBetter | Firecrawl | Better code-block fidelity (language hints preserved across all six docs sites) |
| News | Firecrawl | MDisBetter | Most aggressive ad/related-story stripping; both excellent |
| Wiki | Trafilatura | MDisBetter | Designed precisely for Wikipedia-style server-rendered prose |
| Forum | Firecrawl | Browsely | Best comment-tree preservation; depth + wait controls help |
| SPA | Jina Reader | Firecrawl | Headless rendering tuned well; Jina ties Firecrawl on most SPAs |
Documentation pages — MDisBetter wins
Six docs sites tested. The differentiator: code blocks. Stripe API docs ship samples in five languages per endpoint. FastAPI mixes Python with shell commands. React docs use MDX. Tailwind ships HTML samples inline.
MDisBetter and Firecrawl both correctly identified the article body and stripped chrome on all six. The split was on code blocks. MDisBetter preserved language hints (```python, ```bash, ```jsx) on 28/30 code blocks across the six pages. Firecrawl preserved on 24/30 — they sometimes drop the language tag on inline samples. Jina Reader preserved on 22/30. html2text and Readability collapsed several code blocks into plain paragraphs.
Tables in API reference docs (parameter tables): all top tools preserved structure; Microlink and Simplescraper sometimes flattened them.
News pages — Firecrawl wins by a hair
The hardest cleanliness test in the set. NYT, Bloomberg, BBC all wrap articles in newsletters, related-story cards, ad scaffolding, video players, and "continue reading" gates.
Firecrawl scored highest on ad-stripping aggression — their boilerplate detection caught more edge cases (small inline newsletter signup forms, single-line ads disguised as paragraphs). MDisBetter was a close second; the difference was about three pieces of stray copy across six pages. Browsely also did well here. The non-JS tools (html2text, Readability, Trafilatura on rendered HTML) struggled — they kept too much of the page chrome.
Worth noting: NYT and Bloomberg have soft paywalls. None of the tools bypassed them; they all extracted the visible-to-anonymous portion of the article. Trying to extract behind a paywall is the user's responsibility, not the tool's.
Wiki pages — Trafilatura wins
Wikipedia, Wiktionary, MediaWiki — these are server-rendered HTML with predictable structure. Trafilatura was literally designed against Wikipedia-shaped pages, and it shows. Cleaner infobox handling, cleaner reference list extraction, cleaner internal-link preservation.
MDisBetter was strong here too — basically tied on the Wikipedia article itself. Trafilatura edged ahead on the more obscure wiki engines (Fandom, OSDev). Firecrawl scored well but their JS-handling overhead is overkill for static wiki pages.
If your job is exclusively wiki extraction at scale, Trafilatura locally is the right call. For mixed corpora, MDisBetter handles wiki well alongside everything else.
Forum pages — Firecrawl wins
The hardest category. Reddit's JS-rendered comment tree, Stack Overflow's tabbed answers, Hacker News's flat-but-deep comment threads, GitHub Issue threads with embedded code blocks and image attachments, Discourse's lazy-loaded reply trees.
Firecrawl handled the comment hierarchy best — depth control plus aggressive waiting on JS-heavy pages produced near-complete extraction on all six. MDisBetter got the post and top-level comments but flattened deeper nesting on Reddit specifically. Browsely did well, especially on Stack Overflow and GitHub Issues. Jina got the post but lost most of the comment hierarchy on Reddit.
The non-JS tools all failed on Reddit (returned essentially nothing useful) but did fine on Stack Overflow and Hacker News (server-rendered).
SPA / JS-rendered pages — Jina Reader wins
Surprising result. The SPA category was where we expected Firecrawl to dominate (their headless browser layer is mature). Instead Jina Reader edged ahead on three of six SPAs, tied on two, lost one. The Jina Reader rendering pipeline appears to have been tuned well in 2025, and it shows.
MDisBetter handled SPAs well via its rendering pipeline but scored slightly behind both. Firecrawl ranked second overall. The non-JS tools (Trafilatura, Readability, html2text) all returned the empty SPA shell — useless without a browser-render step in front.
Honest tradeoffs
Where MDisBetter wins
Code-block fidelity in documentation sites. Free web tool with no signup or quota. Multi-format breadth — URL is one of 20+ converters in our suite. We don't ship a programmatic API for URL-to-Markdown; for scripted automation use Trafilatura locally or Jina Reader.
Where Firecrawl wins
Full-site crawling at scale, JS-handling depth/wait controls, news ad-stripping, forum comment-tree preservation. If your job is "crawl this entire site at scale via API," Firecrawl is purpose-built. See MDisBetter vs Firecrawl for the head-to-head.
Where Jina Reader wins
SPA handling, developer simplicity (URL-prefix API), free generous tier. The simplest possible programmatic integration for URL-to-Markdown.
Where Trafilatura wins
Wikipedia-shaped server-rendered prose, local execution, zero cost. The right OSS choice when you want to script extraction over hundreds of static pages and don't need JS rendering.
Where Browsely wins
In-browser AI workflows. The browser extension lets you convert + chat with a page from any tab without switching tools. See our Browsely comparison for when this matters.
Where html2text wins
Pure-Python, zero dependencies on a service, perfectly fine for clean server-rendered HTML you control. Quality on modern web is the worst of the bunch — but for the right input, it's the simplest tool.
Where Readability wins
The algorithm behind Firefox Reader View. Decent quality, runs locally in any JS environment, well-known and battle-tested. Same caveat as Trafilatura: no JS rendering means SPAs fail.
Picking by use case
One-off URL conversion via the browser, no signup: MDisBetter.
Scripted scrape of one URL at a time, simple API: Jina Reader.
Full-site crawl with depth control: Firecrawl.
Local batch over server-rendered prose: Trafilatura.
In-browser conversion + AI chat in the same tab: Browsely.
Pure local with no service dependency: html2text or Readability.
RAG pipeline construction: see the URL-to-Markdown for RAG guide and the runnable RAG tutorial.
What about PDFs in the same workflow?
Most serious extraction workflows mix web URLs with PDF documents (whitepapers, RFCs, design docs, financial filings). The 10 tools above are URL-only or URL-primary; for the PDF half of the workflow see best free PDF to Markdown converters and the dedicated PDF-to-Markdown tool. Many production pipelines use MDisBetter for both halves because the output Markdown composes cleanly downstream — same chunker, same embedder, same retrieval.
Notes on reproducibility
Web pages change. The Stripe API docs page tested in May 2026 may not be the same in November 2026. We tried to pick stable URLs (long-lived documentation pages, archive-friendly news articles, pinned forum threads) but absolute reproducibility is impossible against the live web.
If you want to verify these numbers, the right pattern is: pick three URLs from each category that match your typical workload, run them through the top-three tools (MDisBetter, Firecrawl, Jina), and score the outputs against the five axes. Within an hour you'll have your own data on which tool fits your specific needs. Generic benchmarks (this one included) point in a direction; your-corpus benchmarks tell you the answer.
How often will these scores change?
The leading three (MDisBetter, Firecrawl, Jina) push improvements monthly. Specific scores can shift by 1-2 points over a quarter. The broad ranking (Firecrawl, MDisBetter, Jina at the top; Trafilatura/Browsely middle; html2text last) has been stable for a year and is unlikely to change soon. We'll re-run this benchmark quarterly. See also our 2026 ranked review and the original 8-tool benchmark for additional context.