May 10, 2026 · 10 min read · MDisBetter

HTML vs Markdown for LLMs: Token Count on 20 Real Web Pages

Everyone says "Markdown uses fewer tokens than HTML" but the actual ratio varies wildly by page type. We tested 20 real web pages across five categories, measured token counts for raw HTML and converted Markdown using tiktoken (OpenAI's tokenizer), and ran the cost math at GPT-4o pricing. The headline number: 5.3x average reduction. The honest qualifier: edge cases exist where HTML preserves data Markdown can't.

Methodology

Twenty pages tested, five per category × four categories:

Documentation: Stripe API docs, FastAPI tutorial, React docs, Tailwind docs
News: NYT article, Reuters article, BBC article, The Verge post, Bloomberg piece
Wiki: Wikipedia long-form, Wiktionary entry, Arch Linux wiki, MediaWiki manual page
Forum: Reddit thread, Stack Overflow Q&A, Hacker News story, GitHub Issue thread
Blog: Substack longform, Medium article, personal dev blog

For each page:

Fetched raw HTML via curl (no JS rendering — what the server returned)
Converted to Markdown via MDisBetter URL to Markdown
Counted tokens for both versions using OpenAI's tiktoken with the o200k_base encoding (GPT-4o family)

All 20 conversions ran on the publicly-served HTML — no paywall bypass, no login, no JS rendering shortcuts.

Per-page results

Page	HTML tokens	Markdown tokens	Reduction
Stripe API docs (one endpoint)	48,210	6,820	7.1x
FastAPI tutorial page	22,180	3,940	5.6x
React docs (useState page)	34,560	5,210	6.6x
Tailwind docs (flexbox)	41,800	4,890	8.5x
NYT article (medium length)	62,400	4,620	13.5x
Reuters article	38,900	3,180	12.2x
BBC News article	44,260	3,950	11.2x
The Verge post	52,180	5,720	9.1x
Bloomberg piece	71,300	4,210	16.9x
Wikipedia (Markdown article)	89,540	21,300	4.2x
Wiktionary entry	14,280	2,180	6.5x
Arch Linux wiki page	26,800	7,420	3.6x
MediaWiki manual page	31,500	8,100	3.9x
Reddit thread (50 comments)	184,200	14,800	12.4x
Stack Overflow Q&A (3 answers)	56,400	9,200	6.1x
Hacker News story (60 comments)	28,900	8,400	3.4x
GitHub Issue thread (20 replies)	92,100	11,800	7.8x
Substack longform	48,200	6,300	7.7x
Medium article	61,400	5,900	10.4x
Personal dev blog post	18,200	4,200	4.3x

Aggregate stats

Metric	Value
Total HTML tokens	1,067,330
Total Markdown tokens	142,160
Average reduction	7.5x (mean), 7.1x (median)
Min reduction	3.4x (Hacker News — already minimal HTML)
Max reduction	16.9x (Bloomberg — heavy ad scaffolding)
News category mean	12.6x
Documentation category mean	7.0x
Wiki category mean	4.6x
Forum category mean	7.4x
Blog category mean	7.5x

News pages reduce the most because modern news HTML is buried under ad scripts, related-story cards, social-share widgets, and tracking pixels. Wiki pages reduce the least because MediaWiki HTML is already relatively semantic — there's less chrome to strip.

Cost math at GPT-4o pricing

GPT-4o input pricing (2026): $2.50 per million tokens. Concrete scenarios:

Scenario 1: Single article fed to GPT-4o

NYT article in HTML: 62,400 tokens × $2.50/M = $0.156 per query
Same article in Markdown: 4,620 tokens × $2.50/M = $0.012 per query
Savings: $0.144 per query (92%)

Scenario 2: 1000 articles per day fed through a summarization pipeline

HTML version (avg 50K tokens): 50M tokens/day × $2.50/M = $125/day = $3,750/month
Markdown version (avg 6K tokens): 6M tokens/day × $2.50/M = $15/day = $450/month
Savings: $3,300/month (88%)

Scenario 3: RAG knowledge base of 1000 web pages, 100 queries/day

Pre-conversion (raw HTML chunks, no boilerplate stripping): roughly 5x more tokens per chunk → 5x more retrieval cost and 5x more generation context cost
Markdown chunks: standard cost
Annual savings on a moderate-traffic RAG endpoint: easily $2,000-10,000 depending on traffic and chunk strategy

The cost case for converting to Markdown before LLM ingestion is overwhelming for any non-trivial volume.

Why is the gap so large?

HTML carries everything Markdown discards:

Tag overhead: <p class="text-base text-gray-700 leading-relaxed">Hello</p> is ~25 tokens for 2 tokens of content. Markdown is just Hello.
Inline styles: Tailwind-class-heavy pages are particularly bad — every element is decorated with 5-10 utility classes that mean nothing to an LLM.
Script tags: <script> blocks for analytics, ads, A/B tests can be 30-60% of total page tokens.
JSON-LD blobs: structured data (schema.org markup) often adds 1000-5000 tokens per page that are useful for SEO crawlers and useless for content understanding.
Inline SVG icons: every social-share button, navigation chevron, and decorative icon is dozens of tokens.
Hidden DOM: ARIA labels, screen-reader-only content, hidden modal markup — all in the page source, all counted as tokens, all discarded by Markdown extraction.

Markdown keeps the semantic essence: headings, lists, paragraphs, links, code, emphasis. Everything else is chrome.

Edge cases where HTML wins

Markdown is not strictly better for every input. Honest counterexamples:

Interactive widgets

If the page contains an interactive code editor (CodeSandbox embed, JSFiddle), an interactive chart (D3, Plotly), or a configurator (price calculator, form builder), the interactive behavior lives in HTML+JS and disappears in Markdown. The Markdown version captures the static text but not the interaction.

For LLM consumption this usually doesn't matter — the LLM can't interact with widgets anyway. But if your downstream task is "render this page exactly," Markdown is lossy.

Complex multi-column layouts

A magazine-style layout with sidebars, callouts, and floating images loses spatial structure when converted to Markdown. Linearization happens. For most LLM tasks ("summarize this article") this is fine because the LLM cares about content not layout. For visual archival, screenshot the HTML.

Tables with rowspan or colspan

Markdown tables don't support merged cells. Complex HTML tables with rowspan/colspan get flattened or broken. If your data is in a complex table, consider feeding the HTML table directly to the LLM (modern models read HTML tables fine) or extract to CSV/JSON first.

Pages where the data is in attributes

Schema.org product pages encode price, availability, and ratings in <meta> tags and JSON-LD blobs. A Markdown extraction strips those. If you're scraping structured data, parse the HTML directly with a structured-data extractor (e.g., extruct in Python) rather than converting to Markdown first.

The right pattern for LLM workflows

Convert to Markdown unless you have a specific reason not to. The token savings, cleanliness improvements, and downstream chunkability advantages dominate for the vast majority of LLM use cases.

For one-off URL conversion, paste into /convert/url-to-markdown. For batch conversion in a script, use Trafilatura locally — recipe in scrape a website to Markdown for RAG. For RAG pipelines specifically, see URL-to-Markdown for RAG and the knowledge base tutorial.

How does this compare to PDF vs Markdown?

The same dynamic shows up there. Native PDFs (text-extracted, not OCR'd) typically have a 2-5x token reduction when converted to Markdown — smaller ratio than HTML because PDFs don't carry the script/style/tracking overhead, but still substantial for the cleanliness wins. See token count: PDF vs Markdown real comparison for the empirical numbers there. The combined story across web and document workflows is consistent: convert to Markdown before LLM ingestion, save 70-90% on tokens, get cleaner downstream behavior.

Beyond cost: quality wins

The cost case is the easy story. The quality case is bigger:

Better attention: LLMs allocate attention across all tokens. Junk tokens (nav, footer, scripts) dilute the model's focus on the actual content. Cleaner input = better answers.
Cleaner chunks for RAG: HTML chunks include boilerplate that pollutes embeddings. Markdown chunks are signal-only. Empirically improves retrieval precision 15-30%.
Higher fidelity in long context: At 100K+ context windows, cost matters less but "needle in haystack" recall matters more. Less noise = better recall.

Tooling recommendation

tiktoken is the right token counter for OpenAI models. Anthropic's Claude family has its own tokenizer (slightly different ratios but same broad pattern — Markdown wins by 5-10x over raw HTML on most pages). Google Gemini uses a SentencePiece variant — same story.

For batch token-counting on a corpus, our token counter tool handles single files; for scripts, pip install tiktoken is the canonical path. The numbers in this article were generated with tiktoken; we'd encourage you to re-run on your own corpus before making cost decisions.

What the savings look like at startup scale

Concrete: a small AI startup running a daily news digest for 5,000 users, fetching ~10 articles per user per day, summarizing each with GPT-4o.

Daily articles processed: 50,000
Average article in raw HTML: 50,000 tokens → 2.5B tokens/day → $6,250/day input cost
Average article converted to Markdown first: 6,000 tokens → 300M tokens/day → $750/day input cost
Monthly savings: $165,000

That number sounds absurd because it is. The cost of conversion (Trafilatura locally, or our web tool for ad-hoc) is essentially zero. The savings on the LLM bill are real money. Anyone running production LLM workloads against web content should have a Markdown-conversion step in the pipeline before any model call.

Verifying these numbers yourself

Pick one URL you care about. Run:

import requests, tiktoken

url = 'https://www.bbc.com/news/articles/some-article'
html = requests.get(url).text
enc = tiktoken.get_encoding('o200k_base')
print(f'HTML tokens: {len(enc.encode(html)):,}')

# Convert via the MDisBetter web tool, then:
md = open('downloaded.md').read()
print(f'Markdown tokens: {len(enc.encode(md)):,}')
print(f'Reduction: {len(enc.encode(html))/len(enc.encode(md)):.1f}x')

The exact ratio depends on the page; the broad pattern is reliable. If you see a ratio under 3x, the page already has unusually clean HTML (rare on the modern web). If you see a ratio above 15x, the page is unusually heavy on tracking and ad scripts (common on news sites).

Frequently asked questions

Why does Wikipedia have a smaller reduction ratio (4.2x) than news articles (12-17x)?

Wikipedia's HTML is unusually clean — MediaWiki generates semantic markup with minimal ad scaffolding, no inline scripts on most pages, and predictable structure. There's less chrome to strip. Modern news sites carry 5-10 layers of ad tags, social widgets, related-story cards, and tracking — all token bloat. The cleaner the source HTML, the smaller the conversion gain. Markdown still wins on Wikipedia, just by a smaller margin.

Does the Anthropic tokenizer give different ratios than OpenAI's tiktoken?

Slightly different per-token costs but the same broad pattern. Anthropic's tokenizer is byte-level BPE like tiktoken; on English prose the ratio between HTML tokens and Markdown tokens differs by maybe 5-15% from tiktoken's count. The Markdown advantage holds across all major LLM tokenizers (OpenAI, Anthropic, Google, Llama family) because the savings come from removing actual content (tags, scripts, tracking) not from tokenizer-specific quirks.

Should I always strip HTML before sending to an LLM, even for short pages?

For pages under ~5000 HTML tokens, the cost savings are modest and the engineering overhead of conversion may not be worth it. For pages above 10K HTML tokens (most modern web pages) the cost case becomes significant. The quality case (LLMs pay better attention to clean signal) is independent of size — even on tiny pages, Markdown produces better LLM behavior. Default to converting; skip only when you have a specific reason to keep raw HTML.