Pricing Dashboard Sign up
Recent
· 10 min read · MDisBetter

HTML Is Killing Your LLM Token Budget (Real Numbers)

You're feeding webpages to an LLM and the bill is bigger than you expected. The model is reading more than the article — it's reading the entire navigation, the footer, the cookie banner, twelve embedded JSON-LD blocks, and the inline scripts that load three different analytics vendors. By raw token count, the article you actually care about is usually 10-20% of the input. The other 80-90% is noise you're paying for, that's diluting the model's attention, and that's almost trivially removable. Here are the real numbers, on real pages, with real cost math.

Anatomy of a typical webpage: where the bytes go

Open any modern webpage's source code and the article body is a needle in a haystack. The page is built around the article, but the article is a small part of what's actually shipped to the browser. A typical news or blog page contains:

None of this matters to the LLM you're going to feed the page to. None of it answers your question. All of it costs tokens. The signal-to-noise ratio on a typical webpage is somewhere between 10% and 20% — meaning 80-90% of the bytes you're sending to the model are wasted.

The token test: five representative pages

To put numbers on this, we ran five real pages through a tiktoken-based estimator (the cl100k_base encoding used by GPT-4o and similar). For each page we measured the raw HTML token count and the equivalent clean Markdown token count after extraction. The results are typical of what you'll see on any modern web page.

Page typeHTML tokens (approx)Markdown tokens (approx)Reduction
NYT-style news article (~1,200 words)~24,000~1,80092%
Stripe-style API documentation page~18,000~3,50081%
Wikipedia article (medium-length)~52,000~6,50087%
Reddit thread (~50 comments)~95,000~7,20092%
GitHub README (medium project)~28,000~3,80086%

Two patterns jump out. First, the reduction is consistently in the 80-92% range — this isn't a fringe case, it's the baseline. Second, the bigger and more interactive the page (Reddit thread, Wikipedia), the more dramatic the savings, because those pages carry more chrome, more widgets, more inline scripts.

These are estimates, not exact figures — token counts vary by tokenizer version and page-specific markup quirks. But the order of magnitude is rock solid. If you're feeding raw HTML to an LLM, you're paying roughly 5-10x what you should be.

The cost math at GPT-4o pricing

GPT-4o input pricing at the time of writing is roughly $2.50 per million input tokens (output is $10/M but we're focused on input here, since the page goes in as input). Let's run the math on a realistic workload: 1,000 pages per month, fed to the model for summarization or Q&A.

Take the news article example. At ~24,000 HTML tokens versus ~1,800 Markdown tokens per page:

Scale up to 10,000 pages/month — say a content moderation system, a research aggregator, or a competitive intelligence tool — and the same article-type gives:

For Reddit-thread-sized inputs (~95k HTML vs ~7k Markdown) the numbers get genuinely painful: 10,000 pages/month is $2,375/month vs $180/month, or roughly $26,000 saved per year on a single workload by switching format.

And these are just the input savings. The output side benefits too, because cleaner input means tighter, more accurate outputs — the model doesn't waste its response budget summarizing irrelevant chrome.

The hidden cost: context window and quality

The dollars matter, but the bigger problem is often the context window. GPT-4o has 128k tokens of context. A single Reddit-thread-style page in raw HTML can eat 95k of that, leaving you almost no room for a system prompt, conversation history, or follow-up questions. Convert to Markdown and the same content fits in 7k, leaving 121k for everything else.

Quality also degrades with bloat. The "lost in the middle" effect — where models pay less attention to content in the middle of a long context — is well documented. When you fill 90% of the context with noise, the relevant 10% is statistically more likely to land in the dead zone of model attention. You're not just paying more, you're getting worse answers for the higher price.

The same problem exists with PDFs — see why PDF wastes your AI tokens and token count: PDF vs Markdown real comparison for the document-side equivalent. Different format, same fundamental waste pattern.

Why every webpage is this bad

It's not the developers' fault. Modern websites are this heavy because they have to be: SEO needs the structured metadata, advertising needs the script tags, regulators need the consent banners, marketing needs the recommendation widgets, A/B testing needs the experiment scaffolding. The page is correctly engineered for a browser visit by a human; it's catastrophically wrong as input to an LLM.

The fix isn't to ask websites to ship lighter HTML — they won't, and shouldn't. The fix is to do the cleanup yourself, once, before the page hits the model. Extract the article, drop everything else, hand the LLM the clean Markdown.

The fix: convert before you feed

Open /convert/url-to-markdown. Paste the URL. Hit convert. Download the .md file or copy the Markdown text. Feed that to the LLM instead of the raw page.

That's the entire fix. The work happens once, you get a stable clean artifact, and every subsequent token spent on the file goes to actual content rather than navigation chrome.

For an LLM-pipeline-focused setup, see URL to Markdown for LLM. For LangChain pipelines specifically, see URL to Markdown for LangChain. For ChatGPT-specific workflows, see URL to Markdown for ChatGPT.

Doing it at scale

If you have a pipeline that processes many URLs per day, you have two reasonable patterns:

  1. Use the web tool ad-hoc for one-off pages and research workflows. Paste, convert, copy. Thirty seconds per URL. Fine for human-driven research.
  2. Run an open-source extractor locally for high-volume automation. Trafilatura (Python) is probably the best open-source HTML-to-clean-text extractor available — it's fast, accurate, handles JavaScript-rendered content via a Playwright integration, and produces Markdown output. html2text, Mozilla's readability.js, and Newspaper3k are also reasonable options depending on language and use case.

For self-hosted automation, point Trafilatura or Readability at your URL list, write the resulting .md files to disk or a vector store, and feed them to your LLM pipeline. The conversion runs entirely on your machines — there's nothing to integrate with mdisbetter for this case.

What if you really need the HTML structure?

Edge case worth addressing: sometimes you need to know about the original HTML — for instance, if you're building a tool that analyzes the page's actual structure (DOM analysis, accessibility audit, ad-tech inspection). In those cases, raw HTML is the right input.

But for content-oriented tasks — summarization, Q&A, claim extraction, classification, RAG indexing, knowledge base ingestion — you want the article text, not the markup. Convert first, feed second.

Combining with PDF workflow

Most real-world pipelines deal with both webpages and PDF documents — research papers, company reports, regulatory filings. The same principle applies to both. Use PDF to Markdown for documents, URL to Markdown for webpages, then your downstream pipeline operates on a single uniform Markdown corpus regardless of source format. This dramatically simplifies chunking, indexing, and retrieval logic.

For background on why structure matters as much as size, see how to save a webpage so AI can actually read it.

The summary

HTML is roughly 5-10x larger than the equivalent Markdown for the same content. At GPT-4o input pricing, this is the difference between a $54/year and a $720/year bill on a single 1,000-page-per-month workload, or 10x that at higher volumes. The quality of LLM responses also degrades when input is bloated, both because of attention dilution and because chunking and retrieval downstream become noisier.

The fix takes thirty seconds: convert the URL to Markdown before you feed it to the model. The savings are immediate and compound across every interaction. There's no scenario where feeding raw HTML is the right call for content-oriented LLM work — the format is wrong for the job, and a one-step conversion fixes it.

One final calibration

If your AI-powered product feels expensive to run and you haven't audited what you're actually feeding the models, this is probably the highest-leverage place to look. A team migrating from raw-HTML to Markdown ingestion typically sees their LLM input bill drop 80%+ overnight, and answer quality improve in parallel. Two wins from one cheap fix. Worth an afternoon.

Frequently asked questions

Are these token counts exact or estimates?
Estimates. Exact counts depend on the tokenizer version, page-specific markup quirks, and how aggressively you strip the HTML. The order of magnitude (80-92% reduction) is consistent across every realistic page we've measured, but for any specific page expect ±10-20% variance from the figures shown.
Does the same logic apply to Claude or Gemini?
Yes. Different models use different tokenizers, but the underlying ratio of article-to-noise on a webpage is the same regardless of which model consumes it. Pricing differs (Claude and Gemini have their own per-token costs), but the proportional savings are essentially identical.
What about prompt caching — does that change the math?
Prompt caching helps when you re-use the same content across many calls, which somewhat reduces the cost of large inputs. It doesn't change the underlying noise problem (the cached content is still mostly chrome) and it doesn't help when each call uses a different page. For most webpage-to-LLM workflows, the format-change savings dominate the caching savings.