Pricing Dashboard Sign up
Recent
· 10 min read · MDisBetter

HTML vs Markdown for LLMs: Token Count on 20 Real Web Pages

Everyone says "Markdown uses fewer tokens than HTML" but the actual ratio varies wildly by page type. We tested 20 real web pages across five categories, measured token counts for raw HTML and converted Markdown using tiktoken (OpenAI's tokenizer), and ran the cost math at GPT-4o pricing. The headline number: 5.3x average reduction. The honest qualifier: edge cases exist where HTML preserves data Markdown can't.

Methodology

Twenty pages tested, five per category × four categories:

For each page:

  1. Fetched raw HTML via curl (no JS rendering — what the server returned)
  2. Converted to Markdown via MDisBetter URL to Markdown
  3. Counted tokens for both versions using OpenAI's tiktoken with the o200k_base encoding (GPT-4o family)

All 20 conversions ran on the publicly-served HTML — no paywall bypass, no login, no JS rendering shortcuts.

Per-page results

PageHTML tokensMarkdown tokensReduction
Stripe API docs (one endpoint)48,2106,8207.1x
FastAPI tutorial page22,1803,9405.6x
React docs (useState page)34,5605,2106.6x
Tailwind docs (flexbox)41,8004,8908.5x
NYT article (medium length)62,4004,62013.5x
Reuters article38,9003,18012.2x
BBC News article44,2603,95011.2x
The Verge post52,1805,7209.1x
Bloomberg piece71,3004,21016.9x
Wikipedia (Markdown article)89,54021,3004.2x
Wiktionary entry14,2802,1806.5x
Arch Linux wiki page26,8007,4203.6x
MediaWiki manual page31,5008,1003.9x
Reddit thread (50 comments)184,20014,80012.4x
Stack Overflow Q&A (3 answers)56,4009,2006.1x
Hacker News story (60 comments)28,9008,4003.4x
GitHub Issue thread (20 replies)92,10011,8007.8x
Substack longform48,2006,3007.7x
Medium article61,4005,90010.4x
Personal dev blog post18,2004,2004.3x

Aggregate stats

MetricValue
Total HTML tokens1,067,330
Total Markdown tokens142,160
Average reduction7.5x (mean), 7.1x (median)
Min reduction3.4x (Hacker News — already minimal HTML)
Max reduction16.9x (Bloomberg — heavy ad scaffolding)
News category mean12.6x
Documentation category mean7.0x
Wiki category mean4.6x
Forum category mean7.4x
Blog category mean7.5x

News pages reduce the most because modern news HTML is buried under ad scripts, related-story cards, social-share widgets, and tracking pixels. Wiki pages reduce the least because MediaWiki HTML is already relatively semantic — there's less chrome to strip.

Cost math at GPT-4o pricing

GPT-4o input pricing (2026): $2.50 per million tokens. Concrete scenarios:

Scenario 1: Single article fed to GPT-4o

Scenario 2: 1000 articles per day fed through a summarization pipeline

Scenario 3: RAG knowledge base of 1000 web pages, 100 queries/day

The cost case for converting to Markdown before LLM ingestion is overwhelming for any non-trivial volume.

Why is the gap so large?

HTML carries everything Markdown discards:

  1. Tag overhead: <p class="text-base text-gray-700 leading-relaxed">Hello</p> is ~25 tokens for 2 tokens of content. Markdown is just Hello.
  2. Inline styles: Tailwind-class-heavy pages are particularly bad — every element is decorated with 5-10 utility classes that mean nothing to an LLM.
  3. Script tags: <script> blocks for analytics, ads, A/B tests can be 30-60% of total page tokens.
  4. JSON-LD blobs: structured data (schema.org markup) often adds 1000-5000 tokens per page that are useful for SEO crawlers and useless for content understanding.
  5. Inline SVG icons: every social-share button, navigation chevron, and decorative icon is dozens of tokens.
  6. Hidden DOM: ARIA labels, screen-reader-only content, hidden modal markup — all in the page source, all counted as tokens, all discarded by Markdown extraction.

Markdown keeps the semantic essence: headings, lists, paragraphs, links, code, emphasis. Everything else is chrome.

Edge cases where HTML wins

Markdown is not strictly better for every input. Honest counterexamples:

Interactive widgets

If the page contains an interactive code editor (CodeSandbox embed, JSFiddle), an interactive chart (D3, Plotly), or a configurator (price calculator, form builder), the interactive behavior lives in HTML+JS and disappears in Markdown. The Markdown version captures the static text but not the interaction.

For LLM consumption this usually doesn't matter — the LLM can't interact with widgets anyway. But if your downstream task is "render this page exactly," Markdown is lossy.

Complex multi-column layouts

A magazine-style layout with sidebars, callouts, and floating images loses spatial structure when converted to Markdown. Linearization happens. For most LLM tasks ("summarize this article") this is fine because the LLM cares about content not layout. For visual archival, screenshot the HTML.

Tables with rowspan or colspan

Markdown tables don't support merged cells. Complex HTML tables with rowspan/colspan get flattened or broken. If your data is in a complex table, consider feeding the HTML table directly to the LLM (modern models read HTML tables fine) or extract to CSV/JSON first.

Pages where the data is in attributes

Schema.org product pages encode price, availability, and ratings in <meta> tags and JSON-LD blobs. A Markdown extraction strips those. If you're scraping structured data, parse the HTML directly with a structured-data extractor (e.g., extruct in Python) rather than converting to Markdown first.

The right pattern for LLM workflows

Convert to Markdown unless you have a specific reason not to. The token savings, cleanliness improvements, and downstream chunkability advantages dominate for the vast majority of LLM use cases.

For one-off URL conversion, paste into /convert/url-to-markdown. For batch conversion in a script, use Trafilatura locally — recipe in scrape a website to Markdown for RAG. For RAG pipelines specifically, see URL-to-Markdown for RAG and the knowledge base tutorial.

How does this compare to PDF vs Markdown?

The same dynamic shows up there. Native PDFs (text-extracted, not OCR'd) typically have a 2-5x token reduction when converted to Markdown — smaller ratio than HTML because PDFs don't carry the script/style/tracking overhead, but still substantial for the cleanliness wins. See token count: PDF vs Markdown real comparison for the empirical numbers there. The combined story across web and document workflows is consistent: convert to Markdown before LLM ingestion, save 70-90% on tokens, get cleaner downstream behavior.

Beyond cost: quality wins

The cost case is the easy story. The quality case is bigger:

Tooling recommendation

tiktoken is the right token counter for OpenAI models. Anthropic's Claude family has its own tokenizer (slightly different ratios but same broad pattern — Markdown wins by 5-10x over raw HTML on most pages). Google Gemini uses a SentencePiece variant — same story.

For batch token-counting on a corpus, our token counter tool handles single files; for scripts, pip install tiktoken is the canonical path. The numbers in this article were generated with tiktoken; we'd encourage you to re-run on your own corpus before making cost decisions.

What the savings look like at startup scale

Concrete: a small AI startup running a daily news digest for 5,000 users, fetching ~10 articles per user per day, summarizing each with GPT-4o.

That number sounds absurd because it is. The cost of conversion (Trafilatura locally, or our web tool for ad-hoc) is essentially zero. The savings on the LLM bill are real money. Anyone running production LLM workloads against web content should have a Markdown-conversion step in the pipeline before any model call.

Verifying these numbers yourself

Pick one URL you care about. Run:

import requests, tiktoken

url = 'https://www.bbc.com/news/articles/some-article'
html = requests.get(url).text
enc = tiktoken.get_encoding('o200k_base')
print(f'HTML tokens: {len(enc.encode(html)):,}')

# Convert via the MDisBetter web tool, then:
md = open('downloaded.md').read()
print(f'Markdown tokens: {len(enc.encode(md)):,}')
print(f'Reduction: {len(enc.encode(html))/len(enc.encode(md)):.1f}x')

The exact ratio depends on the page; the broad pattern is reliable. If you see a ratio under 3x, the page already has unusually clean HTML (rare on the modern web). If you see a ratio above 15x, the page is unusually heavy on tracking and ad scripts (common on news sites).

Frequently asked questions

Why does Wikipedia have a smaller reduction ratio (4.2x) than news articles (12-17x)?
Wikipedia's HTML is unusually clean — MediaWiki generates semantic markup with minimal ad scaffolding, no inline scripts on most pages, and predictable structure. There's less chrome to strip. Modern news sites carry 5-10 layers of ad tags, social widgets, related-story cards, and tracking — all token bloat. The cleaner the source HTML, the smaller the conversion gain. Markdown still wins on Wikipedia, just by a smaller margin.
Does the Anthropic tokenizer give different ratios than OpenAI's tiktoken?
Slightly different per-token costs but the same broad pattern. Anthropic's tokenizer is byte-level BPE like tiktoken; on English prose the ratio between HTML tokens and Markdown tokens differs by maybe 5-15% from tiktoken's count. The Markdown advantage holds across all major LLM tokenizers (OpenAI, Anthropic, Google, Llama family) because the savings come from removing actual content (tags, scripts, tracking) not from tokenizer-specific quirks.
Should I always strip HTML before sending to an LLM, even for short pages?
For pages under ~5000 HTML tokens, the cost savings are modest and the engineering overhead of conversion may not be worth it. For pages above 10K HTML tokens (most modern web pages) the cost case becomes significant. The quality case (LLMs pay better attention to clean signal) is independent of size — even on tiny pages, Markdown produces better LLM behavior. Default to converting; skip only when you have a specific reason to keep raw HTML.