Pricing Dashboard Sign up
Recent
· 9 min read · MDisBetter

URL to Markdown Benchmark: 8 Tools Tested on Real Pages

Most "URL to Markdown" benchmarks pick easy pages and grade on a pass/fail axis. Real conversion happens on a long tail of awkward inputs: documentation sites with code blocks, news articles wrapped in three layers of ad scaffolding, JS-rendered SPAs that fetch nothing useful on first paint, and Reddit threads where the actual content lives behind shadow DOM and infinite scroll. We took eight tools, ran them across six representative URLs, and scored each output honestly.

Test methodology

Six URLs picked to stress different failure modes:

  1. Wikipedia article (en.wikipedia.org/wiki/Markdown) — long-form prose, heavy internal linking, infoboxes, references
  2. Stripe documentation (docs.stripe.com/api) — code blocks in many languages, deep navigation, syntax-highlighted samples
  3. New York Times article — paywall hint, ad blocks, related-stories cards mixed inline, image captions
  4. React documentation (react.dev) — JS-rendered, interactive code editors, MDX components
  5. GitHub README (a popular OSS project) — already Markdown rendered to HTML, badges, anchor TOCs, embedded SVGs
  6. Reddit thread — heavy JS, nested comment trees, vote widgets, removed posts

Each tool was scored 0-5 across four axes: Cleanliness (no junk, no nav cruft), Structure preservation (headings, lists, tables intact), JS handling (does the actual content come through on JS-rendered pages), Code block formatting (fenced blocks with correct language hints).

Disclosure: we built one of the tools tested. We tried to be honest below, including where competitors win.

Tools tested

Results table

ToolCleanlinessStructureJS handlingCode blocksTotal /20
MDisBetter554519
Firecrawl555419
Jina Reader444416
Microlink434314
Browsely444315
MarkdownDown332311
Simplescraper333211
html2text23128

Page-by-page breakdown

Wikipedia

The easiest page in the set. All eight tools produced usable Markdown. Differentiators: how cleanly each handled the right-side infobox, the references section, and the citation links. MDisBetter, Firecrawl, and Jina kept the article body clean while preserving the references as a list. html2text dumped everything inline including the navigation chrome.

Stripe API docs

This is where code-block handling shows up. Stripe ships syntax-highlighted samples in five languages per endpoint. MDisBetter and Firecrawl correctly emitted fenced code blocks with language hints (```python, ```ruby). Jina got the code blocks but lost the language tags. Microlink and Simplescraper flattened code into plain paragraphs — a serious downgrade for any LLM workflow consuming the output.

New York Times article

The hardest cleanliness test. The page is wrapped in newsletters, related-story cards, ad scaffolding, video players, and "continue reading" gates. MDisBetter, Firecrawl, and Browsely correctly identified the article body and stripped the chrome. Jina did well but kept some of the photo captions in a weird inline format. MarkdownDown and html2text emitted a wall of mixed content where the article body was buried under nav and ad text.

React documentation (react.dev)

This is the JS-rendered test. The page is a React SPA — fetching the raw HTML returns near-empty markup. Tools that don't run a browser fail here. Firecrawl wins outright (their headless browser layer is mature and configurable). MDisBetter handles it well via its rendering pipeline. Jina also handles JS via their reader. html2text gets nothing useful because it never ran the JS.

GitHub README

Pages already rendered from Markdown should be the easiest. MDisBetter and Firecrawl produced near-perfect round-trips. Jina kept all content but converted some Markdown-native elements (badges, anchor TOCs) back into HTML-ish artifacts. html2text actually does fine here because the underlying HTML is semantic.

Reddit thread

The hardest test. Reddit's JS-rendered comment tree, with shadow DOM and lazy loading, breaks most tools. Firecrawl handled it best — depth control plus aggressive waiting. MDisBetter got the post and top-level comments. Jina got the post but flattened the comment hierarchy. The non-JS tools got essentially nothing.

Honest tradeoffs

Where MDisBetter wins: multi-format breadth (URL is one of 20+ tools we ship), code-block fidelity across documentation pages, free web tool with no signup. We don't ship a programmatic API or CLI for URL-to-Markdown today, so this is a web-tool-vs-web-tool comparison — for scripted automation, see Jina Reader or Firecrawl below. Full positioning in our 2026 ranked review.

Where Firecrawl wins: full-site crawling at scale, more aggressive JS handling with depth and wait controls, better engineering for spider-style use cases. If your job is "crawl this entire docs site," Firecrawl is purpose-built. We compare them directly in MDisBetter vs Firecrawl.

Where Jina Reader wins: API simplicity (prefix any URL with r.jina.ai/), generous free tier, good baseline quality. Hard to beat for one-line developer integration. Our Jina comparison goes deeper.

Where html2text wins: local execution, zero dependency on any service, free forever. Quality is the worst of the bunch on modern web pages, but if your input is server-rendered HTML you control, it's a perfectly fine choice.

The verdict

For most users converting one or many URLs to LLM-ready Markdown, MDisBetter, Firecrawl, and Jina Reader are the three serious options. The choice between them is positioning more than quality:

For RAG pipelines specifically, see the URL-to-Markdown for RAG guidance page and the runnable Trafilatura-based recipe in scrape a website to Markdown for RAG.

Frequently asked questions

Why did MDisBetter and Firecrawl tie at 19/20?
Different strengths cancel out. MDisBetter scored higher on code-block formatting (better language-tag preservation across documentation sites). Firecrawl scored higher on JS handling (more aggressive waiting and depth control on heavy SPAs). Net total is identical; the right pick depends on your workload.
Did html2text really score so low, or is the test biased toward modern pages?
html2text is excellent for what it does — convert clean server-rendered HTML to Markdown locally. Modern web is mostly not that. On the test corpus (representative of real-world URL-to-Markdown work), JS-rendered pages and ad-laden sites dominate, and html2text was designed for an earlier era. Different test corpus, different score.
How often do these scores change as tools update?
Quarterly is a reasonable refresh cadence. The leading tools (MDisBetter, Firecrawl, Jina) all push improvements monthly. Specific scores can shift; the broad ranking has been stable for the past year and is likely to remain stable.