Pricing Dashboard Sign up
Recent
· 13 min read · MDisBetter

URL to Markdown Benchmark: 10 Tools Tested on 30 Real Pages

Our earlier 8-tool benchmark tested six representative pages. This expanded test goes wider: 10 tools (we added Trafilatura and Readability as OSS baselines), 30 real-world URLs across five categories. The goal is to find the per-category winners — there is no single tool that wins everywhere, and being honest about that is more useful than pretending otherwise.

Test methodology

Five page categories, six URLs each, 30 URLs total:

  1. Documentation — Stripe API docs, FastAPI tutorial, React docs, Tailwind docs, MDN JavaScript reference, Kubernetes concepts
  2. News — NYT article, Reuters article, BBC News article, The Verge tech post, Hacker News story page, Bloomberg piece
  3. Wiki — Wikipedia (English long-form), Wiktionary entry, Fandom wiki page, MediaWiki manual page, OSDev wiki, Arch Linux wiki
  4. Forum — Reddit thread, Stack Overflow Q&A, Hacker News comments, GitHub Issue thread, Discourse forum thread, Discord public archive page
  5. SPA / JS-rendered — Vercel dashboard public page, Notion public page, Linear public roadmap, Figma community file, Vue 3 docs, Next.js docs

Each tool scored 0-5 on five axes:

Maximum total: 25/25. Disclosure: we built one of the tools tested. Where competitors win, we say so.

Tools tested

Aggregate results

ToolCleanStructJSCodeTablesTotal /25
Firecrawl5554524
MDisBetter5545423
Jina Reader4454421
Trafilatura4524419
Browsely4453319
Microlink4343317
Readability4423316
MarkdownDown3323213
Simplescraper3332213
html2text2312210

Per-category winners

CategoryWinnerRunner-upWhy
DocumentationMDisBetterFirecrawlBetter code-block fidelity (language hints preserved across all six docs sites)
NewsFirecrawlMDisBetterMost aggressive ad/related-story stripping; both excellent
WikiTrafilaturaMDisBetterDesigned precisely for Wikipedia-style server-rendered prose
ForumFirecrawlBrowselyBest comment-tree preservation; depth + wait controls help
SPAJina ReaderFirecrawlHeadless rendering tuned well; Jina ties Firecrawl on most SPAs

Documentation pages — MDisBetter wins

Six docs sites tested. The differentiator: code blocks. Stripe API docs ship samples in five languages per endpoint. FastAPI mixes Python with shell commands. React docs use MDX. Tailwind ships HTML samples inline.

MDisBetter and Firecrawl both correctly identified the article body and stripped chrome on all six. The split was on code blocks. MDisBetter preserved language hints (```python, ```bash, ```jsx) on 28/30 code blocks across the six pages. Firecrawl preserved on 24/30 — they sometimes drop the language tag on inline samples. Jina Reader preserved on 22/30. html2text and Readability collapsed several code blocks into plain paragraphs.

Tables in API reference docs (parameter tables): all top tools preserved structure; Microlink and Simplescraper sometimes flattened them.

News pages — Firecrawl wins by a hair

The hardest cleanliness test in the set. NYT, Bloomberg, BBC all wrap articles in newsletters, related-story cards, ad scaffolding, video players, and "continue reading" gates.

Firecrawl scored highest on ad-stripping aggression — their boilerplate detection caught more edge cases (small inline newsletter signup forms, single-line ads disguised as paragraphs). MDisBetter was a close second; the difference was about three pieces of stray copy across six pages. Browsely also did well here. The non-JS tools (html2text, Readability, Trafilatura on rendered HTML) struggled — they kept too much of the page chrome.

Worth noting: NYT and Bloomberg have soft paywalls. None of the tools bypassed them; they all extracted the visible-to-anonymous portion of the article. Trying to extract behind a paywall is the user's responsibility, not the tool's.

Wiki pages — Trafilatura wins

Wikipedia, Wiktionary, MediaWiki — these are server-rendered HTML with predictable structure. Trafilatura was literally designed against Wikipedia-shaped pages, and it shows. Cleaner infobox handling, cleaner reference list extraction, cleaner internal-link preservation.

MDisBetter was strong here too — basically tied on the Wikipedia article itself. Trafilatura edged ahead on the more obscure wiki engines (Fandom, OSDev). Firecrawl scored well but their JS-handling overhead is overkill for static wiki pages.

If your job is exclusively wiki extraction at scale, Trafilatura locally is the right call. For mixed corpora, MDisBetter handles wiki well alongside everything else.

Forum pages — Firecrawl wins

The hardest category. Reddit's JS-rendered comment tree, Stack Overflow's tabbed answers, Hacker News's flat-but-deep comment threads, GitHub Issue threads with embedded code blocks and image attachments, Discourse's lazy-loaded reply trees.

Firecrawl handled the comment hierarchy best — depth control plus aggressive waiting on JS-heavy pages produced near-complete extraction on all six. MDisBetter got the post and top-level comments but flattened deeper nesting on Reddit specifically. Browsely did well, especially on Stack Overflow and GitHub Issues. Jina got the post but lost most of the comment hierarchy on Reddit.

The non-JS tools all failed on Reddit (returned essentially nothing useful) but did fine on Stack Overflow and Hacker News (server-rendered).

SPA / JS-rendered pages — Jina Reader wins

Surprising result. The SPA category was where we expected Firecrawl to dominate (their headless browser layer is mature). Instead Jina Reader edged ahead on three of six SPAs, tied on two, lost one. The Jina Reader rendering pipeline appears to have been tuned well in 2025, and it shows.

MDisBetter handled SPAs well via its rendering pipeline but scored slightly behind both. Firecrawl ranked second overall. The non-JS tools (Trafilatura, Readability, html2text) all returned the empty SPA shell — useless without a browser-render step in front.

Honest tradeoffs

Where MDisBetter wins

Code-block fidelity in documentation sites. Free web tool with no signup or quota. Multi-format breadth — URL is one of 20+ converters in our suite. We don't ship a programmatic API for URL-to-Markdown; for scripted automation use Trafilatura locally or Jina Reader.

Where Firecrawl wins

Full-site crawling at scale, JS-handling depth/wait controls, news ad-stripping, forum comment-tree preservation. If your job is "crawl this entire site at scale via API," Firecrawl is purpose-built. See MDisBetter vs Firecrawl for the head-to-head.

Where Jina Reader wins

SPA handling, developer simplicity (URL-prefix API), free generous tier. The simplest possible programmatic integration for URL-to-Markdown.

Where Trafilatura wins

Wikipedia-shaped server-rendered prose, local execution, zero cost. The right OSS choice when you want to script extraction over hundreds of static pages and don't need JS rendering.

Where Browsely wins

In-browser AI workflows. The browser extension lets you convert + chat with a page from any tab without switching tools. See our Browsely comparison for when this matters.

Where html2text wins

Pure-Python, zero dependencies on a service, perfectly fine for clean server-rendered HTML you control. Quality on modern web is the worst of the bunch — but for the right input, it's the simplest tool.

Where Readability wins

The algorithm behind Firefox Reader View. Decent quality, runs locally in any JS environment, well-known and battle-tested. Same caveat as Trafilatura: no JS rendering means SPAs fail.

Picking by use case

One-off URL conversion via the browser, no signup: MDisBetter.
Scripted scrape of one URL at a time, simple API: Jina Reader.
Full-site crawl with depth control: Firecrawl.
Local batch over server-rendered prose: Trafilatura.
In-browser conversion + AI chat in the same tab: Browsely.
Pure local with no service dependency: html2text or Readability.
RAG pipeline construction: see the URL-to-Markdown for RAG guide and the runnable RAG tutorial.

What about PDFs in the same workflow?

Most serious extraction workflows mix web URLs with PDF documents (whitepapers, RFCs, design docs, financial filings). The 10 tools above are URL-only or URL-primary; for the PDF half of the workflow see best free PDF to Markdown converters and the dedicated PDF-to-Markdown tool. Many production pipelines use MDisBetter for both halves because the output Markdown composes cleanly downstream — same chunker, same embedder, same retrieval.

Notes on reproducibility

Web pages change. The Stripe API docs page tested in May 2026 may not be the same in November 2026. We tried to pick stable URLs (long-lived documentation pages, archive-friendly news articles, pinned forum threads) but absolute reproducibility is impossible against the live web.

If you want to verify these numbers, the right pattern is: pick three URLs from each category that match your typical workload, run them through the top-three tools (MDisBetter, Firecrawl, Jina), and score the outputs against the five axes. Within an hour you'll have your own data on which tool fits your specific needs. Generic benchmarks (this one included) point in a direction; your-corpus benchmarks tell you the answer.

How often will these scores change?

The leading three (MDisBetter, Firecrawl, Jina) push improvements monthly. Specific scores can shift by 1-2 points over a quarter. The broad ranking (Firecrawl, MDisBetter, Jina at the top; Trafilatura/Browsely middle; html2text last) has been stable for a year and is unlikely to change soon. We'll re-run this benchmark quarterly. See also our 2026 ranked review and the original 8-tool benchmark for additional context.

Frequently asked questions

Why does Firecrawl beat MDisBetter on news pages but lose on documentation?
Different optimization targets. Firecrawl invests heavily in ad/boilerplate detection because that's the hardest news-page problem. MDisBetter invests heavily in code-block and language-hint preservation because that's the hardest docs-page problem. Both teams chose well for their primary use case; the per-category split reflects those choices.
Is Trafilatura really better than commercial tools on Wikipedia?
On Wikipedia specifically, yes. Trafilatura was designed against Wikipedia-shaped HTML — semantic structure, infoboxes, references, internal links. The commercial tools handle Wikipedia well but are tuned for the general web; Trafilatura is purpose-built for prose-heavy server-rendered pages. Outside that niche (SPAs, forums, anything JS-heavy), Trafilatura falls behind.
Why test 30 pages and not 100 or 1000?
Diminishing returns. Five categories with six URLs each gives enough variance within each category to spot tool-specific failure modes without weeks of manual scoring. We tried 100-URL versions; the ranking was identical to the 30-URL version with one-point shifts at most. Thirty pages is the sweet spot for honest, reproducible benchmarking with reasonable effort.