May 10, 2026 · 8 min read · MDisBetter

Markdown vs HTML for LLMs: Token Count Comparison (Real Numbers)

The advice "feed Markdown to LLMs, not HTML" is everywhere in AI engineering posts, usually with a vague "Markdown is more efficient" justification. We wanted concrete numbers. We took five representative web pages, captured both the raw HTML and the converted Markdown, ran both through tiktoken, and computed the token counts and the GPT-4o cost difference. Here are the actual numbers.

Methodology

Five URLs picked to span typical workloads:

Wikipedia article (Markdown encyclopedia entry, lots of references)
Stripe API documentation page (heavy with code blocks)
NYT-style news article (article body + ad-laden chrome)
React docs page (JS-rendered, after rendering)
GitHub README (already Markdown rendered to HTML)

For each URL we captured:

Raw HTML: as fetched (after JS rendering for SPA pages), with all chrome intact
Markdown via MDisBetter: clean output with content extraction

Tokens counted with tiktoken.get_encoding("o200k_base") (the GPT-4o encoder).

Results

Page	HTML tokens	Markdown tokens	Reduction	Ratio
Wikipedia article	78,400	14,200	82%	5.5x
Stripe API docs page	42,800	9,100	79%	4.7x
NYT-style article	56,300	3,400	94%	16.6x
React docs page	34,500	4,800	86%	7.2x
GitHub README	21,200	3,600	83%	5.9x
Average	46,640	7,020	85%	~6.6x

Across the five pages, Markdown averaged 85% fewer tokens than the raw HTML — the same content, condensed by stripping markup, scripts, styles, and ad scaffolding.

Why such large differences?

Two effects compound:

1. Markdown syntax is more compact than HTML syntax

For the same semantics, Markdown uses fewer characters:

Semantic	HTML	Markdown
Heading	<h1>Title</h1> (16 chars)	# Title (7 chars)
Bold	<strong>X</strong> (18 chars)	X (5 chars)
Link	<a href="u">X</a> (16 chars)	[X](u) (6 chars)
List item	<li>X</li> (10 chars)	- X (3 chars)

Just from syntax compactness, Markdown saves roughly 50-70% on equivalent semantic content.

2. HTML carries chrome that Markdown extraction strips

The bigger savings come from content extraction. Raw HTML includes:

Site navigation (often 5-15% of tokens)
Ad scaffolding and tracking scripts (often 10-30%)
Sidebar widgets and related-content cards (often 10-20%)
Footer with extensive link farms (often 5-10%)
Inline styles, classes, data-attributes (often 10-20%)
Hidden elements and ARIA wrappers (often 5-10%)

For ad-heavy sites like the NYT example, the chrome dwarfs the article body — that's why the reduction is 94% there. For lean sites like Wikipedia, chrome is smaller relative to content, but Markdown still wins by ~82%.

GPT-4o cost math

GPT-4o pricing as of writing (verify on OpenAI's current pricing page):

Input: $2.50 per 1M tokens
Output: $10.00 per 1M tokens

For our test set, the per-page input cost difference:

Page	HTML cost	Markdown cost	Saving per page
Wikipedia article	$0.196	$0.0355	$0.16
Stripe API docs	$0.107	$0.0228	$0.084
NYT-style article	$0.141	$0.0085	$0.132
React docs page	$0.086	$0.0120	$0.074
GitHub README	$0.053	$0.0090	$0.044

The per-page savings look small ($0.04-$0.16). They aren't — at scale, this multiplies fast.

Scale math

Imagine a RAG pipeline ingesting 10,000 web pages per day:

HTML approach: 10,000 × 46,640 tokens × $2.50/M = $1,166/day = ~$35,000/month input alone
Markdown approach: 10,000 × 7,020 tokens × $2.50/M = $176/day = ~$5,300/month input alone
Savings: ~$30,000/month, or ~$355,000/year, just on input tokens

For a single-developer side project ingesting 100 pages/month, the savings are negligible ($0.30/month vs $0.05/month — pick whichever is easier). For any scale that matters commercially, the Markdown approach is dramatically cheaper.

Context window implications

Cost isn't the only concern. Context windows are finite. GPT-4o has a 128k context window. Claude has up to 200k. Even Gemini's 1M+ context windows aren't infinite for cost reasons.

If you're stuffing pages into a context window for a long-form task, the page count you can fit roughly 6-10x with Markdown vs HTML:

HTML approach: ~2-3 typical web pages fit in 128k context
Markdown approach: ~15-20 typical web pages fit in 128k context

For RAG specifically, this means more relevant chunks per query budget. For long-document tasks, more documents per call. For agent loops with iterative reasoning, more breathing room before context-window pressure forces summarization.

Quality implications

Token efficiency is the headline number. The often-overlooked benefit is answer quality. Models reasoning over Markdown produce better answers than the same models reasoning over HTML for the same content. Why:

Less noise: nav cruft, scripts, styles all distract attention. The model wastes tokens (in its reasoning, not your bill) on figuring out what's content vs chrome.
Cleaner structure: Markdown headings are unambiguous. HTML headings are sometimes <h2>, sometimes <div class="heading">, sometimes <p><strong> styled to look like a heading. Markdown forces semantic clarity.
Better retrieval: for RAG, embedding quality on Markdown tends to be higher because the embedded text is content, not markup.

We discuss the broader argument in best format for LLM input. For PDFs specifically, the equivalent comparison is in our PDF vs Markdown token comparison — same conclusion, different format.

Honest caveats

Where HTML might preserve more useful info than Markdown:

Tables with rowspan/colspan: Markdown can't represent these natively. HTML keeps the structure; Markdown either flattens or duplicates cells.
Custom inline styling: a <span style="color: red"> warning marker has no Markdown equivalent. The semantic might matter for certain analyses.
Forms and interactive elements: HTML preserves form structure; Markdown drops it.
Microformats and structured data: schema.org JSON-LD, RDFa, Microdata. Markdown drops these.

For most LLM use cases (Q&A, summarization, RAG, agent reasoning), none of these matter. For specialized cases (web data extraction, accessibility analysis, form filling), keep the HTML.

Practical recommendation

For any LLM workflow consuming web content:

Convert HTML to Markdown before feeding to the model. Use a quality converter — our URL-to-Markdown tool, Firecrawl, or Jina Reader. See the 2026 review.
For RAG pipelines, use the RAG-specific variant that emits chunked output ready for embedding.
Track token counts of your inputs (use our token counter or tiktoken locally). Knowing your token bill is the first step to managing it.
Don't optimize for HTML preservation unless you have a specific use case that requires it. The default should be Markdown.

The savings compound across cost, context window, and answer quality. There's no scenario where naive HTML beats Markdown for general LLM workloads — only edge cases where specific HTML semantics matter enough to justify the token tax.

Frequently asked questions

Are these numbers reproducible? The actual token counts seem high for the pages.

Counts include the full HTML payload — scripts, styles, hidden elements, all attributes. If you fetched the same pages today the exact numbers would shift (sites change), but the order of magnitude (~6-10x reduction Markdown vs HTML) is consistent across our testing. Run the same test on your representative pages with tiktoken to verify for your workload.

Does this hold for non-OpenAI models like Claude or Gemini?

Yes. Tokenizers differ slightly (Claude uses a different BPE), but the ratio between HTML and Markdown token counts is roughly the same across encoders — both because Markdown syntax is more compact universally and because content extraction strips chrome regardless of tokenizer. The cost calculus differs by model pricing; the efficiency principle is the same.

Should I always strip images and links to reduce tokens further?

Depends on the task. For pure text understanding, yes — images and link URLs add tokens without much information. For tasks where attribution or visual references matter (e.g., "summarize this article and cite the linked sources"), keep them. If you're rolling your own extraction with Trafilatura, the include_links and include_images parameters control exactly this trade-off.