HTML Is Killing Your LLM Token Budget (Real Numbers)
You're feeding webpages to an LLM and the bill is bigger than you expected. The model is reading more than the article — it's reading the entire navigation, the footer, the cookie banner, twelve embedded JSON-LD blocks, and the inline scripts that load three different analytics vendors. By raw token count, the article you actually care about is usually 10-20% of the input. The other 80-90% is noise you're paying for, that's diluting the model's attention, and that's almost trivially removable. Here are the real numbers, on real pages, with real cost math.
Anatomy of a typical webpage: where the bytes go
Open any modern webpage's source code and the article body is a needle in a haystack. The page is built around the article, but the article is a small part of what's actually shipped to the browser. A typical news or blog page contains:
- The visible article. Headings, paragraphs, lists, occasionally a table or image — the thing you came to read.
- Navigation chrome. Top nav, side nav, breadcrumbs, mega-menus, footer nav. Repeated on every page of the site.
- Inline scripts. Analytics tags, A/B testing scripts, consent management platforms, pixel trackers, ad-network glue.
- Inline styles. Either critical CSS injected at the top, or per-element style attributes from the CMS.
- Hidden metadata. Open Graph tags, Twitter card tags, schema.org JSON-LD blocks (often duplicated 2-4 times for different consumers), canonical URLs, hreflang sets.
- Recommendation widgets. Pre-rendered "related articles" sections that ship dozens of titles, summaries, thumbnails, and click trackers.
- Comment sections. Either inline or as a script template that hydrates later.
- Newsletter modals, social share buttons, cookie banners, exit-intent popups. All shipped in the HTML even when not visible.
None of this matters to the LLM you're going to feed the page to. None of it answers your question. All of it costs tokens. The signal-to-noise ratio on a typical webpage is somewhere between 10% and 20% — meaning 80-90% of the bytes you're sending to the model are wasted.
The token test: five representative pages
To put numbers on this, we ran five real pages through a tiktoken-based estimator (the cl100k_base encoding used by GPT-4o and similar). For each page we measured the raw HTML token count and the equivalent clean Markdown token count after extraction. The results are typical of what you'll see on any modern web page.
| Page type | HTML tokens (approx) | Markdown tokens (approx) | Reduction |
|---|---|---|---|
| NYT-style news article (~1,200 words) | ~24,000 | ~1,800 | 92% |
| Stripe-style API documentation page | ~18,000 | ~3,500 | 81% |
| Wikipedia article (medium-length) | ~52,000 | ~6,500 | 87% |
| Reddit thread (~50 comments) | ~95,000 | ~7,200 | 92% |
| GitHub README (medium project) | ~28,000 | ~3,800 | 86% |
Two patterns jump out. First, the reduction is consistently in the 80-92% range — this isn't a fringe case, it's the baseline. Second, the bigger and more interactive the page (Reddit thread, Wikipedia), the more dramatic the savings, because those pages carry more chrome, more widgets, more inline scripts.
These are estimates, not exact figures — token counts vary by tokenizer version and page-specific markup quirks. But the order of magnitude is rock solid. If you're feeding raw HTML to an LLM, you're paying roughly 5-10x what you should be.
The cost math at GPT-4o pricing
GPT-4o input pricing at the time of writing is roughly $2.50 per million input tokens (output is $10/M but we're focused on input here, since the page goes in as input). Let's run the math on a realistic workload: 1,000 pages per month, fed to the model for summarization or Q&A.
Take the news article example. At ~24,000 HTML tokens versus ~1,800 Markdown tokens per page:
- 1,000 pages × 24,000 HTML tokens = 24M tokens/month → $60/month just for input on the HTML route.
- 1,000 pages × 1,800 Markdown tokens = 1.8M tokens/month → $4.50/month on the Markdown route.
- Annualized: $720 vs $54. A $666 difference per year on a tiny workload.
Scale up to 10,000 pages/month — say a content moderation system, a research aggregator, or a competitive intelligence tool — and the same article-type gives:
- $600/month vs $45/month → $6,600/year saved on input cost alone.
For Reddit-thread-sized inputs (~95k HTML vs ~7k Markdown) the numbers get genuinely painful: 10,000 pages/month is $2,375/month vs $180/month, or roughly $26,000 saved per year on a single workload by switching format.
And these are just the input savings. The output side benefits too, because cleaner input means tighter, more accurate outputs — the model doesn't waste its response budget summarizing irrelevant chrome.
The hidden cost: context window and quality
The dollars matter, but the bigger problem is often the context window. GPT-4o has 128k tokens of context. A single Reddit-thread-style page in raw HTML can eat 95k of that, leaving you almost no room for a system prompt, conversation history, or follow-up questions. Convert to Markdown and the same content fits in 7k, leaving 121k for everything else.
Quality also degrades with bloat. The "lost in the middle" effect — where models pay less attention to content in the middle of a long context — is well documented. When you fill 90% of the context with noise, the relevant 10% is statistically more likely to land in the dead zone of model attention. You're not just paying more, you're getting worse answers for the higher price.
The same problem exists with PDFs — see why PDF wastes your AI tokens and token count: PDF vs Markdown real comparison for the document-side equivalent. Different format, same fundamental waste pattern.
Why every webpage is this bad
It's not the developers' fault. Modern websites are this heavy because they have to be: SEO needs the structured metadata, advertising needs the script tags, regulators need the consent banners, marketing needs the recommendation widgets, A/B testing needs the experiment scaffolding. The page is correctly engineered for a browser visit by a human; it's catastrophically wrong as input to an LLM.
The fix isn't to ask websites to ship lighter HTML — they won't, and shouldn't. The fix is to do the cleanup yourself, once, before the page hits the model. Extract the article, drop everything else, hand the LLM the clean Markdown.
The fix: convert before you feed
Open /convert/url-to-markdown. Paste the URL. Hit convert. Download the .md file or copy the Markdown text. Feed that to the LLM instead of the raw page.
That's the entire fix. The work happens once, you get a stable clean artifact, and every subsequent token spent on the file goes to actual content rather than navigation chrome.
For an LLM-pipeline-focused setup, see URL to Markdown for LLM. For LangChain pipelines specifically, see URL to Markdown for LangChain. For ChatGPT-specific workflows, see URL to Markdown for ChatGPT.
Doing it at scale
If you have a pipeline that processes many URLs per day, you have two reasonable patterns:
- Use the web tool ad-hoc for one-off pages and research workflows. Paste, convert, copy. Thirty seconds per URL. Fine for human-driven research.
- Run an open-source extractor locally for high-volume automation. Trafilatura (Python) is probably the best open-source HTML-to-clean-text extractor available — it's fast, accurate, handles JavaScript-rendered content via a Playwright integration, and produces Markdown output.
html2text, Mozilla'sreadability.js, andNewspaper3kare also reasonable options depending on language and use case.
For self-hosted automation, point Trafilatura or Readability at your URL list, write the resulting .md files to disk or a vector store, and feed them to your LLM pipeline. The conversion runs entirely on your machines — there's nothing to integrate with mdisbetter for this case.
What if you really need the HTML structure?
Edge case worth addressing: sometimes you need to know about the original HTML — for instance, if you're building a tool that analyzes the page's actual structure (DOM analysis, accessibility audit, ad-tech inspection). In those cases, raw HTML is the right input.
But for content-oriented tasks — summarization, Q&A, claim extraction, classification, RAG indexing, knowledge base ingestion — you want the article text, not the markup. Convert first, feed second.
Combining with PDF workflow
Most real-world pipelines deal with both webpages and PDF documents — research papers, company reports, regulatory filings. The same principle applies to both. Use PDF to Markdown for documents, URL to Markdown for webpages, then your downstream pipeline operates on a single uniform Markdown corpus regardless of source format. This dramatically simplifies chunking, indexing, and retrieval logic.
For background on why structure matters as much as size, see how to save a webpage so AI can actually read it.
The summary
HTML is roughly 5-10x larger than the equivalent Markdown for the same content. At GPT-4o input pricing, this is the difference between a $54/year and a $720/year bill on a single 1,000-page-per-month workload, or 10x that at higher volumes. The quality of LLM responses also degrades when input is bloated, both because of attention dilution and because chunking and retrieval downstream become noisier.
The fix takes thirty seconds: convert the URL to Markdown before you feed it to the model. The savings are immediate and compound across every interaction. There's no scenario where feeding raw HTML is the right call for content-oriented LLM work — the format is wrong for the job, and a one-step conversion fixes it.
One final calibration
If your AI-powered product feels expensive to run and you haven't audited what you're actually feeding the models, this is probably the highest-leverage place to look. A team migrating from raw-HTML to Markdown ingestion typically sees their LLM input bill drop 80%+ overnight, and answer quality improve in parallel. Two wins from one cheap fix. Worth an afternoon.