Pricing Dashboard Sign up
Recent
· 8 min read · MDisBetter

Markdown vs HTML for LLMs: Token Count Comparison (Real Numbers)

The advice "feed Markdown to LLMs, not HTML" is everywhere in AI engineering posts, usually with a vague "Markdown is more efficient" justification. We wanted concrete numbers. We took five representative web pages, captured both the raw HTML and the converted Markdown, ran both through tiktoken, and computed the token counts and the GPT-4o cost difference. Here are the actual numbers.

Methodology

Five URLs picked to span typical workloads:

  1. Wikipedia article (Markdown encyclopedia entry, lots of references)
  2. Stripe API documentation page (heavy with code blocks)
  3. NYT-style news article (article body + ad-laden chrome)
  4. React docs page (JS-rendered, after rendering)
  5. GitHub README (already Markdown rendered to HTML)

For each URL we captured:

Tokens counted with tiktoken.get_encoding("o200k_base") (the GPT-4o encoder).

Results

PageHTML tokensMarkdown tokensReductionRatio
Wikipedia article78,40014,20082%5.5x
Stripe API docs page42,8009,10079%4.7x
NYT-style article56,3003,40094%16.6x
React docs page34,5004,80086%7.2x
GitHub README21,2003,60083%5.9x
Average46,6407,02085%~6.6x

Across the five pages, Markdown averaged 85% fewer tokens than the raw HTML — the same content, condensed by stripping markup, scripts, styles, and ad scaffolding.

Why such large differences?

Two effects compound:

1. Markdown syntax is more compact than HTML syntax

For the same semantics, Markdown uses fewer characters:

SemanticHTMLMarkdown
Heading<h1>Title</h1> (16 chars)# Title (7 chars)
Bold<strong>X</strong> (18 chars)**X** (5 chars)
Link<a href="u">X</a> (16 chars)[X](u) (6 chars)
List item<li>X</li> (10 chars)- X (3 chars)

Just from syntax compactness, Markdown saves roughly 50-70% on equivalent semantic content.

2. HTML carries chrome that Markdown extraction strips

The bigger savings come from content extraction. Raw HTML includes:

For ad-heavy sites like the NYT example, the chrome dwarfs the article body — that's why the reduction is 94% there. For lean sites like Wikipedia, chrome is smaller relative to content, but Markdown still wins by ~82%.

GPT-4o cost math

GPT-4o pricing as of writing (verify on OpenAI's current pricing page):

For our test set, the per-page input cost difference:

PageHTML costMarkdown costSaving per page
Wikipedia article$0.196$0.0355$0.16
Stripe API docs$0.107$0.0228$0.084
NYT-style article$0.141$0.0085$0.132
React docs page$0.086$0.0120$0.074
GitHub README$0.053$0.0090$0.044

The per-page savings look small ($0.04-$0.16). They aren't — at scale, this multiplies fast.

Scale math

Imagine a RAG pipeline ingesting 10,000 web pages per day:

For a single-developer side project ingesting 100 pages/month, the savings are negligible ($0.30/month vs $0.05/month — pick whichever is easier). For any scale that matters commercially, the Markdown approach is dramatically cheaper.

Context window implications

Cost isn't the only concern. Context windows are finite. GPT-4o has a 128k context window. Claude has up to 200k. Even Gemini's 1M+ context windows aren't infinite for cost reasons.

If you're stuffing pages into a context window for a long-form task, the page count you can fit roughly 6-10x with Markdown vs HTML:

For RAG specifically, this means more relevant chunks per query budget. For long-document tasks, more documents per call. For agent loops with iterative reasoning, more breathing room before context-window pressure forces summarization.

Quality implications

Token efficiency is the headline number. The often-overlooked benefit is answer quality. Models reasoning over Markdown produce better answers than the same models reasoning over HTML for the same content. Why:

We discuss the broader argument in best format for LLM input. For PDFs specifically, the equivalent comparison is in our PDF vs Markdown token comparison — same conclusion, different format.

Honest caveats

Where HTML might preserve more useful info than Markdown:

For most LLM use cases (Q&A, summarization, RAG, agent reasoning), none of these matter. For specialized cases (web data extraction, accessibility analysis, form filling), keep the HTML.

Practical recommendation

For any LLM workflow consuming web content:

  1. Convert HTML to Markdown before feeding to the model. Use a quality converter — our URL-to-Markdown tool, Firecrawl, or Jina Reader. See the 2026 review.
  2. For RAG pipelines, use the RAG-specific variant that emits chunked output ready for embedding.
  3. Track token counts of your inputs (use our token counter or tiktoken locally). Knowing your token bill is the first step to managing it.
  4. Don't optimize for HTML preservation unless you have a specific use case that requires it. The default should be Markdown.

The savings compound across cost, context window, and answer quality. There's no scenario where naive HTML beats Markdown for general LLM workloads — only edge cases where specific HTML semantics matter enough to justify the token tax.

Frequently asked questions

Are these numbers reproducible? The actual token counts seem high for the pages.
Counts include the full HTML payload — scripts, styles, hidden elements, all attributes. If you fetched the same pages today the exact numbers would shift (sites change), but the order of magnitude (~6-10x reduction Markdown vs HTML) is consistent across our testing. Run the same test on your representative pages with tiktoken to verify for your workload.
Does this hold for non-OpenAI models like Claude or Gemini?
Yes. Tokenizers differ slightly (Claude uses a different BPE), but the ratio between HTML and Markdown token counts is roughly the same across encoders — both because Markdown syntax is more compact universally and because content extraction strips chrome regardless of tokenizer. The cost calculus differs by model pricing; the efficiency principle is the same.
Should I always strip images and links to reduce tokens further?
Depends on the task. For pure text understanding, yes — images and link URLs add tokens without much information. For tasks where attribution or visual references matter (e.g., "summarize this article and cite the linked sources"), keep them. If you're rolling your own extraction with Trafilatura, the include_links and include_images parameters control exactly this trade-off.