Markdown vs HTML for LLMs: Token Count Comparison (Real Numbers)
The advice "feed Markdown to LLMs, not HTML" is everywhere in AI engineering posts, usually with a vague "Markdown is more efficient" justification. We wanted concrete numbers. We took five representative web pages, captured both the raw HTML and the converted Markdown, ran both through tiktoken, and computed the token counts and the GPT-4o cost difference. Here are the actual numbers.
Methodology
Five URLs picked to span typical workloads:
- Wikipedia article (Markdown encyclopedia entry, lots of references)
- Stripe API documentation page (heavy with code blocks)
- NYT-style news article (article body + ad-laden chrome)
- React docs page (JS-rendered, after rendering)
- GitHub README (already Markdown rendered to HTML)
For each URL we captured:
- Raw HTML: as fetched (after JS rendering for SPA pages), with all chrome intact
- Markdown via MDisBetter: clean output with content extraction
Tokens counted with tiktoken.get_encoding("o200k_base") (the GPT-4o encoder).
Results
| Page | HTML tokens | Markdown tokens | Reduction | Ratio |
|---|---|---|---|---|
| Wikipedia article | 78,400 | 14,200 | 82% | 5.5x |
| Stripe API docs page | 42,800 | 9,100 | 79% | 4.7x |
| NYT-style article | 56,300 | 3,400 | 94% | 16.6x |
| React docs page | 34,500 | 4,800 | 86% | 7.2x |
| GitHub README | 21,200 | 3,600 | 83% | 5.9x |
| Average | 46,640 | 7,020 | 85% | ~6.6x |
Across the five pages, Markdown averaged 85% fewer tokens than the raw HTML — the same content, condensed by stripping markup, scripts, styles, and ad scaffolding.
Why such large differences?
Two effects compound:
1. Markdown syntax is more compact than HTML syntax
For the same semantics, Markdown uses fewer characters:
| Semantic | HTML | Markdown |
|---|---|---|
| Heading | <h1>Title</h1> (16 chars) | # Title (7 chars) |
| Bold | <strong>X</strong> (18 chars) | **X** (5 chars) |
| Link | <a href="u">X</a> (16 chars) | [X](u) (6 chars) |
| List item | <li>X</li> (10 chars) | - X (3 chars) |
Just from syntax compactness, Markdown saves roughly 50-70% on equivalent semantic content.
2. HTML carries chrome that Markdown extraction strips
The bigger savings come from content extraction. Raw HTML includes:
- Site navigation (often 5-15% of tokens)
- Ad scaffolding and tracking scripts (often 10-30%)
- Sidebar widgets and related-content cards (often 10-20%)
- Footer with extensive link farms (often 5-10%)
- Inline styles, classes, data-attributes (often 10-20%)
- Hidden elements and ARIA wrappers (often 5-10%)
For ad-heavy sites like the NYT example, the chrome dwarfs the article body — that's why the reduction is 94% there. For lean sites like Wikipedia, chrome is smaller relative to content, but Markdown still wins by ~82%.
GPT-4o cost math
GPT-4o pricing as of writing (verify on OpenAI's current pricing page):
- Input: $2.50 per 1M tokens
- Output: $10.00 per 1M tokens
For our test set, the per-page input cost difference:
| Page | HTML cost | Markdown cost | Saving per page |
|---|---|---|---|
| Wikipedia article | $0.196 | $0.0355 | $0.16 |
| Stripe API docs | $0.107 | $0.0228 | $0.084 |
| NYT-style article | $0.141 | $0.0085 | $0.132 |
| React docs page | $0.086 | $0.0120 | $0.074 |
| GitHub README | $0.053 | $0.0090 | $0.044 |
The per-page savings look small ($0.04-$0.16). They aren't — at scale, this multiplies fast.
Scale math
Imagine a RAG pipeline ingesting 10,000 web pages per day:
- HTML approach: 10,000 × 46,640 tokens × $2.50/M = $1,166/day = ~$35,000/month input alone
- Markdown approach: 10,000 × 7,020 tokens × $2.50/M = $176/day = ~$5,300/month input alone
- Savings: ~$30,000/month, or ~$355,000/year, just on input tokens
For a single-developer side project ingesting 100 pages/month, the savings are negligible ($0.30/month vs $0.05/month — pick whichever is easier). For any scale that matters commercially, the Markdown approach is dramatically cheaper.
Context window implications
Cost isn't the only concern. Context windows are finite. GPT-4o has a 128k context window. Claude has up to 200k. Even Gemini's 1M+ context windows aren't infinite for cost reasons.
If you're stuffing pages into a context window for a long-form task, the page count you can fit roughly 6-10x with Markdown vs HTML:
- HTML approach: ~2-3 typical web pages fit in 128k context
- Markdown approach: ~15-20 typical web pages fit in 128k context
For RAG specifically, this means more relevant chunks per query budget. For long-document tasks, more documents per call. For agent loops with iterative reasoning, more breathing room before context-window pressure forces summarization.
Quality implications
Token efficiency is the headline number. The often-overlooked benefit is answer quality. Models reasoning over Markdown produce better answers than the same models reasoning over HTML for the same content. Why:
- Less noise: nav cruft, scripts, styles all distract attention. The model wastes tokens (in its reasoning, not your bill) on figuring out what's content vs chrome.
- Cleaner structure: Markdown headings are unambiguous. HTML headings are sometimes
<h2>, sometimes<div class="heading">, sometimes<p><strong>styled to look like a heading. Markdown forces semantic clarity. - Better retrieval: for RAG, embedding quality on Markdown tends to be higher because the embedded text is content, not markup.
We discuss the broader argument in best format for LLM input. For PDFs specifically, the equivalent comparison is in our PDF vs Markdown token comparison — same conclusion, different format.
Honest caveats
Where HTML might preserve more useful info than Markdown:
- Tables with rowspan/colspan: Markdown can't represent these natively. HTML keeps the structure; Markdown either flattens or duplicates cells.
- Custom inline styling: a
<span style="color: red">warning marker has no Markdown equivalent. The semantic might matter for certain analyses. - Forms and interactive elements: HTML preserves form structure; Markdown drops it.
- Microformats and structured data: schema.org JSON-LD, RDFa, Microdata. Markdown drops these.
For most LLM use cases (Q&A, summarization, RAG, agent reasoning), none of these matter. For specialized cases (web data extraction, accessibility analysis, form filling), keep the HTML.
Practical recommendation
For any LLM workflow consuming web content:
- Convert HTML to Markdown before feeding to the model. Use a quality converter — our URL-to-Markdown tool, Firecrawl, or Jina Reader. See the 2026 review.
- For RAG pipelines, use the RAG-specific variant that emits chunked output ready for embedding.
- Track token counts of your inputs (use our token counter or tiktoken locally). Knowing your token bill is the first step to managing it.
- Don't optimize for HTML preservation unless you have a specific use case that requires it. The default should be Markdown.
The savings compound across cost, context window, and answer quality. There's no scenario where naive HTML beats Markdown for general LLM workloads — only edge cases where specific HTML semantics matter enough to justify the token tax.