Pricing Dashboard Sign up
Recent
· 7 min read · MDisBetter

Why PDF Wastes 95% of Your AI Tokens (Real Numbers)

Take a 50-page PDF, paste it into ChatGPT, and watch the token counter. You'll see something like 80,000 tokens of input. Now take the same document, convert it to Markdown, paste that. The token counter shows 18,000 — a 78% reduction. On scanned PDFs the reduction routinely exceeds 95%. Where did all those tokens go? They went to noise the model didn't need.

Inside a PDF — what AI actually receives

A PDF file isn't a document in the way you think. It's a sequence of low-level drawing instructions: place glyph U+0048 at coordinate (72.0, 711.5); place glyph U+0065 at (78.4, 711.5); fill rectangle from (72, 700) to (528, 720) with color #f0f0f0. There's no concept of "paragraph", "heading", or "table" anywhere in the file format.

When ChatGPT or Claude receives a PDF, an extraction pipeline tries to reconstruct readable text from those drawing instructions. The pipeline walks every glyph, guesses reading order, infers paragraph boundaries from whitespace, and produces a flat text stream. That flat stream is what the LLM actually sees — not the rendered visual document, not the original Word source, just the reconstructed glyphs in best-guess order.

The reconstruction is messy. It includes:

You're paying token-billable rate for all of it.

The 95% number — where it comes from

We tested 10 production documents — academic papers, financial reports, contracts, manuals — and measured tokens in three forms: raw PDF (as ChatGPT receives it), our Markdown conversion, and a hypothetical "signal-only" baseline (just the answer-relevant content, hand-curated).

Document typePDF tokensMarkdown tokensSignal-only tokensPDF noise %
Scanned legal contract89,00018,5004,20095%
Multi-column research paper42,80011,2003,80091%
Financial 10-K198,00072,00022,00089%
Slide deck PDF export52,00011,8005,40090%
SaaS pricing brochure48,20014,8004,80090%
User manual118,00038,50015,00087%
Government report171,00058,20022,40087%
Technical whitepaper39,20013,4005,20087%
Academic textbook chapter67,40022,6009,20086%
Marketing one-pager9,8003,4001,80082%

The "PDF noise %" column is computed as (PDF tokens − signal-only tokens) / PDF tokens. Worst case: 82% noise on a clean one-page document. Best case (i.e., highest noise): 95% noise on a scanned legal contract. Average across the 10: 88% of input tokens to ChatGPT are noise the model doesn't need.

Markdown conversion doesn't get all the way down to the hand-curated signal-only baseline (some structural overhead is necessary), but it cuts the noise by 60–80% in every case. The remaining gap to signal-only is the cost of preserving structure for the LLM to navigate — which is structure the model uses, not noise.

Token breakdown by PDF element

Decomposing the average PDF input by what its tokens encode:

The numbers are approximate but the proportions are stable across document types. The takeaway: when you upload a PDF to ChatGPT, you're paying roughly 9× more for content delivery than you'd pay with Markdown conversion in the loop.

The solution — strip the noise, keep the content

The intervention is one preprocessing step: convert the PDF to Markdown before sending it to your LLM. Our PDF to Markdown converter handles digital and scanned PDFs, preserves headings/lists/tables/equations, and strips the page furniture and layout artifacts that bloat raw PDF extraction.

The savings show up immediately on per-conversation cost (60–80% reduction on input tokens) and on answer quality (the model isn't reasoning over noise). For programmatic workloads, MDisBetter doesn't currently offer a public API — the realistic path is an OSS converter like Marker (Apache 2.0), Docling (MIT), or PyMuPDF dropped in as a Python preprocessing step before your existing OpenAI or Anthropic call. Same token reduction; you control the pipeline.

For workflows that need to keep the document content available across many queries (RAG, knowledge bases), see clean PDF for LLM context. The principle is the same: strip the noise, keep the structure, pay only for what the LLM uses.

Frequently asked questions

Does the 95% noise figure apply to all documents?
No — the worst case (95%+) is scanned PDFs where OCR introduces additional artifacts. Clean digital PDFs are typically 80–90% noise. Either way, conversion to Markdown removes most of it.
Why don't LLM providers strip the noise themselves?
Their PDF parsers do strip some — but they're optimized for reliability across all PDFs (don't crash on weird inputs) rather than for output quality. A specialized tool can be more aggressive about removing noise because it can fail more gracefully on edge cases.
Does this affect Claude and Gemini equally?
Yes — the underlying issue is the PDF format, not any specific model's parser. Claude and Gemini also benefit from converting PDFs to Markdown before input. We have model-specific guides for <a href="/convert/pdf-to-markdown-for-claude">Claude</a> and <a href="/convert/pdf-to-markdown-for-gemini">Gemini</a>.