May 10, 2026 · 7 min read · MDisBetter

Why PDF Wastes 95% of Your AI Tokens (Real Numbers)

Take a 50-page PDF, paste it into ChatGPT, and watch the token counter. You'll see something like 80,000 tokens of input. Now take the same document, convert it to Markdown, paste that. The token counter shows 18,000 — a 78% reduction. On scanned PDFs the reduction routinely exceeds 95%. Where did all those tokens go? They went to noise the model didn't need.

Inside a PDF — what AI actually receives

A PDF file isn't a document in the way you think. It's a sequence of low-level drawing instructions: place glyph U+0048 at coordinate (72.0, 711.5); place glyph U+0065 at (78.4, 711.5); fill rectangle from (72, 700) to (528, 720) with color #f0f0f0. There's no concept of "paragraph", "heading", or "table" anywhere in the file format.

When ChatGPT or Claude receives a PDF, an extraction pipeline tries to reconstruct readable text from those drawing instructions. The pipeline walks every glyph, guesses reading order, infers paragraph boundaries from whitespace, and produces a flat text stream. That flat stream is what the LLM actually sees — not the rendered visual document, not the original Word source, just the reconstructed glyphs in best-guess order.

The reconstruction is messy. It includes:

Every glyph the PDF rendered, including in repeating headers and footers on every page
Page numbers and "Page X of Y" markers
Watermarks, draft markers, copyright lines
Column-break artifacts where text wraps oddly across columns
Embedded font metadata that leaks into the text stream
Hyphenation artifacts where line-broken words appear as two pieces

You're paying token-billable rate for all of it.

The 95% number — where it comes from

We tested 10 production documents — academic papers, financial reports, contracts, manuals — and measured tokens in three forms: raw PDF (as ChatGPT receives it), our Markdown conversion, and a hypothetical "signal-only" baseline (just the answer-relevant content, hand-curated).

Document type	PDF tokens	Markdown tokens	Signal-only tokens	PDF noise %
Scanned legal contract	89,000	18,500	4,200	95%
Multi-column research paper	42,800	11,200	3,800	91%
Financial 10-K	198,000	72,000	22,000	89%
Slide deck PDF export	52,000	11,800	5,400	90%
SaaS pricing brochure	48,200	14,800	4,800	90%
User manual	118,000	38,500	15,000	87%
Government report	171,000	58,200	22,400	87%
Technical whitepaper	39,200	13,400	5,200	87%
Academic textbook chapter	67,400	22,600	9,200	86%
Marketing one-pager	9,800	3,400	1,800	82%

The "PDF noise %" column is computed as (PDF tokens − signal-only tokens) / PDF tokens. Worst case: 82% noise on a clean one-page document. Best case (i.e., highest noise): 95% noise on a scanned legal contract. Average across the 10: 88% of input tokens to ChatGPT are noise the model doesn't need.

Markdown conversion doesn't get all the way down to the hand-curated signal-only baseline (some structural overhead is necessary), but it cuts the noise by 60–80% in every case. The remaining gap to signal-only is the cost of preserving structure for the LLM to navigate — which is structure the model uses, not noise.

Token breakdown by PDF element

Decomposing the average PDF input by what its tokens encode:

Layout artifacts (~40%): broken paragraph wrap, column-break artifacts, justified-text padding, indent-based formatting that loses meaning when extracted
Embedded metadata and font info (~25%): font names, encoding markers, occasional raw byte sequences that didn't map to standard text
Page furniture (~25%): repeating headers, footers, page numbers, copyright lines (paid per page they appear on)
Actual content (~10%): the text that would answer your question

The numbers are approximate but the proportions are stable across document types. The takeaway: when you upload a PDF to ChatGPT, you're paying roughly 9× more for content delivery than you'd pay with Markdown conversion in the loop.

The solution — strip the noise, keep the content

The intervention is one preprocessing step: convert the PDF to Markdown before sending it to your LLM. Our PDF to Markdown converter handles digital and scanned PDFs, preserves headings/lists/tables/equations, and strips the page furniture and layout artifacts that bloat raw PDF extraction.

The savings show up immediately on per-conversation cost (60–80% reduction on input tokens) and on answer quality (the model isn't reasoning over noise). For programmatic workloads, MDisBetter doesn't currently offer a public API — the realistic path is an OSS converter like Marker (Apache 2.0), Docling (MIT), or PyMuPDF dropped in as a Python preprocessing step before your existing OpenAI or Anthropic call. Same token reduction; you control the pipeline.

For workflows that need to keep the document content available across many queries (RAG, knowledge bases), see clean PDF for LLM context. The principle is the same: strip the noise, keep the structure, pay only for what the LLM uses.

Frequently asked questions

Does the 95% noise figure apply to all documents?

No — the worst case (95%+) is scanned PDFs where OCR introduces additional artifacts. Clean digital PDFs are typically 80–90% noise. Either way, conversion to Markdown removes most of it.

Why don't LLM providers strip the noise themselves?

Their PDF parsers do strip some — but they're optimized for reliability across all PDFs (don't crash on weird inputs) rather than for output quality. A specialized tool can be more aggressive about removing noise because it can fail more gracefully on edge cases.

Does this affect Claude and Gemini equally?

Yes — the underlying issue is the PDF format, not any specific model's parser. Claude and Gemini also benefit from converting PDFs to Markdown before input. We have model-specific guides for <a href="/convert/pdf-to-markdown-for-claude">Claude</a> and <a href="/convert/pdf-to-markdown-for-gemini">Gemini</a>.