Why PDF Wastes 95% of Your AI Tokens (Real Numbers)
Take a 50-page PDF, paste it into ChatGPT, and watch the token counter. You'll see something like 80,000 tokens of input. Now take the same document, convert it to Markdown, paste that. The token counter shows 18,000 — a 78% reduction. On scanned PDFs the reduction routinely exceeds 95%. Where did all those tokens go? They went to noise the model didn't need.
Inside a PDF — what AI actually receives
A PDF file isn't a document in the way you think. It's a sequence of low-level drawing instructions: place glyph U+0048 at coordinate (72.0, 711.5); place glyph U+0065 at (78.4, 711.5); fill rectangle from (72, 700) to (528, 720) with color #f0f0f0. There's no concept of "paragraph", "heading", or "table" anywhere in the file format.
When ChatGPT or Claude receives a PDF, an extraction pipeline tries to reconstruct readable text from those drawing instructions. The pipeline walks every glyph, guesses reading order, infers paragraph boundaries from whitespace, and produces a flat text stream. That flat stream is what the LLM actually sees — not the rendered visual document, not the original Word source, just the reconstructed glyphs in best-guess order.
The reconstruction is messy. It includes:
- Every glyph the PDF rendered, including in repeating headers and footers on every page
- Page numbers and "Page X of Y" markers
- Watermarks, draft markers, copyright lines
- Column-break artifacts where text wraps oddly across columns
- Embedded font metadata that leaks into the text stream
- Hyphenation artifacts where line-broken words appear as two pieces
You're paying token-billable rate for all of it.
The 95% number — where it comes from
We tested 10 production documents — academic papers, financial reports, contracts, manuals — and measured tokens in three forms: raw PDF (as ChatGPT receives it), our Markdown conversion, and a hypothetical "signal-only" baseline (just the answer-relevant content, hand-curated).
| Document type | PDF tokens | Markdown tokens | Signal-only tokens | PDF noise % |
|---|---|---|---|---|
| Scanned legal contract | 89,000 | 18,500 | 4,200 | 95% |
| Multi-column research paper | 42,800 | 11,200 | 3,800 | 91% |
| Financial 10-K | 198,000 | 72,000 | 22,000 | 89% |
| Slide deck PDF export | 52,000 | 11,800 | 5,400 | 90% |
| SaaS pricing brochure | 48,200 | 14,800 | 4,800 | 90% |
| User manual | 118,000 | 38,500 | 15,000 | 87% |
| Government report | 171,000 | 58,200 | 22,400 | 87% |
| Technical whitepaper | 39,200 | 13,400 | 5,200 | 87% |
| Academic textbook chapter | 67,400 | 22,600 | 9,200 | 86% |
| Marketing one-pager | 9,800 | 3,400 | 1,800 | 82% |
The "PDF noise %" column is computed as (PDF tokens − signal-only tokens) / PDF tokens. Worst case: 82% noise on a clean one-page document. Best case (i.e., highest noise): 95% noise on a scanned legal contract. Average across the 10: 88% of input tokens to ChatGPT are noise the model doesn't need.
Markdown conversion doesn't get all the way down to the hand-curated signal-only baseline (some structural overhead is necessary), but it cuts the noise by 60–80% in every case. The remaining gap to signal-only is the cost of preserving structure for the LLM to navigate — which is structure the model uses, not noise.
Token breakdown by PDF element
Decomposing the average PDF input by what its tokens encode:
- Layout artifacts (~40%): broken paragraph wrap, column-break artifacts, justified-text padding, indent-based formatting that loses meaning when extracted
- Embedded metadata and font info (~25%): font names, encoding markers, occasional raw byte sequences that didn't map to standard text
- Page furniture (~25%): repeating headers, footers, page numbers, copyright lines (paid per page they appear on)
- Actual content (~10%): the text that would answer your question
The numbers are approximate but the proportions are stable across document types. The takeaway: when you upload a PDF to ChatGPT, you're paying roughly 9× more for content delivery than you'd pay with Markdown conversion in the loop.
The solution — strip the noise, keep the content
The intervention is one preprocessing step: convert the PDF to Markdown before sending it to your LLM. Our PDF to Markdown converter handles digital and scanned PDFs, preserves headings/lists/tables/equations, and strips the page furniture and layout artifacts that bloat raw PDF extraction.
The savings show up immediately on per-conversation cost (60–80% reduction on input tokens) and on answer quality (the model isn't reasoning over noise). For programmatic workloads, MDisBetter doesn't currently offer a public API — the realistic path is an OSS converter like Marker (Apache 2.0), Docling (MIT), or PyMuPDF dropped in as a Python preprocessing step before your existing OpenAI or Anthropic call. Same token reduction; you control the pipeline.
For workflows that need to keep the document content available across many queries (RAG, knowledge bases), see clean PDF for LLM context. The principle is the same: strip the noise, keep the structure, pay only for what the LLM uses.