Token Count: PDF vs Markdown on 20 Real Documents (Hard Numbers)
The claim that converting PDFs to Markdown saves tokens is widely repeated but rarely measured precisely. We took 20 production documents — diverse types, sizes, and quality levels — and ran token counts through three tokenizers (ChatGPT, Claude, Gemini). Here are the actual numbers, with cost translations and the methodology to replicate.
Methodology
20 documents across 5 categories (4 each):
- Academic papers (single + multi-column, with and without equations)
- Financial reports (10-K, earnings, equity research)
- Legal contracts (NDAs, service agreements)
- Product manuals (mixed layout with tables and code)
- Scanned documents (range of OCR quality)
For each document, we measured token count three ways:
- Raw PDF tokens: tokens that ChatGPT/Claude/Gemini would charge for if you uploaded the PDF directly (estimated by tokenizing the text their internal extractors produce)
- Plain text tokens: tokens for naive text extraction (e.g.,
pdftotext) - Markdown tokens: tokens for our converter's Markdown output
Tokenizers used: tiktoken with cl100k_base for OpenAI models, Anthropic's tokenizer for Claude, Google's tokenizer for Gemini.
Results table
Token counts in thousands. Reduction % shown vs raw PDF.
| Document | Raw PDF | Plain text | Markdown | Reduction |
|---|---|---|---|---|
| arXiv ML paper (24pg) | 52.4 | 32.1 | 13.8 | 74% |
| NeurIPS paper (16pg) | 38.2 | 23.4 | 10.1 | 74% |
| Physics review article (40pg) | 97.6 | 61.2 | 26.4 | 73% |
| Math monograph chapter (28pg) | 71.3 | 44.8 | 18.7 | 74% |
| 10-K filing (96pg) | 198.0 | 124.5 | 72.0 | 64% |
| Equity research note (32pg) | 71.6 | 43.2 | 22.4 | 69% |
| Earnings release (12pg) | 26.4 | 15.8 | 8.2 | 69% |
| Annual report (140pg) | 312.0 | 196.0 | 108.0 | 65% |
| Master service agreement (52pg) | 96.4 | 61.0 | 32.8 | 66% |
| Standard NDA (8pg) | 15.2 | 9.6 | 5.4 | 64% |
| Vendor contract (38pg) | 67.8 | 42.4 | 23.6 | 65% |
| Lease agreement (28pg) | 52.6 | 33.0 | 18.4 | 65% |
| SaaS user manual (54pg) | 118.0 | 74.6 | 38.5 | 67% |
| Hardware spec sheet (24pg) | 54.8 | 34.6 | 16.2 | 70% |
| API reference PDF (62pg) | 132.6 | 83.2 | 42.0 | 68% |
| Installation guide (16pg) | 34.0 | 21.2 | 11.4 | 66% |
| Scanned legal contract (24pg) | 89.0 | 52.4 | 18.5 | 79% |
| Faxed invoice (4pg) | 16.8 | 9.8 | 3.4 | 80% |
| Photographed page (6pg) | 22.4 | 13.6 | 4.8 | 79% |
| Scanned manual (40pg) | 148.0 | 87.6 | 31.0 | 79% |
Average reduction across all 20 documents: 70%. Range: 64-80%. Worst case (cleanest digital PDF): still 64% reduction.
Reductions by category
- Academic papers: 73-74% reduction (consistent across types)
- Financial reports: 65-69% (slightly lower due to dense tables that don't compress as much)
- Legal contracts: 64-66% (relatively clean PDFs to begin with)
- Product manuals: 66-70% (mixed layout)
- Scanned documents: 79-80% (highest reduction — OCR + extraction noise compounds without conversion)
The pattern: scanned documents see the largest reduction because raw PDF processing of them is particularly token-inefficient. Clean digital PDFs see smaller (but still substantial) reductions.
Cost implications
Pricing as of mid-2026:
- GPT-4o: $2.50 / 1M input tokens
- Claude Sonnet 4.6: $3.00 / 1M input tokens
- Gemini 2.5 Pro: $1.25 / 1M input tokens
For a workload of 1,000 conversations per month at ~60K input tokens per conversation (typical document Q&A):
| Model | PDF input cost | Markdown input cost | Saved |
|---|---|---|---|
| GPT-4o | $150/mo | $45/mo | $105/mo |
| Claude Sonnet 4.6 | $180/mo | $54/mo | $126/mo |
| Gemini 2.5 Pro | $75/mo | $22/mo | $53/mo |
Multiply by your actual volume. For agency or consultancy workflows running tens of thousands of conversations monthly, savings hit thousands of dollars per month from a one-line preprocessing change.
Tokenizer differences across models
An interesting nuance: the same Markdown text tokenizes slightly differently across models because they use different tokenizers. On our 20-document corpus:
- OpenAI
cl100k_base: baseline (numbers above) - Anthropic Claude tokenizer: ~5% more tokens for the same text
- Google Gemini tokenizer: ~3% fewer tokens for the same text
The relative reduction (PDF vs Markdown) is consistent across tokenizers — about 70% on each. The absolute token counts shift slightly. For cost calculations, use the tokenizer matching your target model.
Why the savings appear at all
Three sources of token reduction:
Stripped page furniture (~25% of savings)
Repeating headers, footers, page numbers, copyright lines. Present on every page in raw PDF extraction; removed by Markdown conversion.
Collapsed whitespace and normalized encoding (~30% of savings)
Raw extraction emits multiple consecutive spaces, broken paragraphs from line wrapping, encoding markers leaked into text. Markdown is normalized.
Eliminated layout artifacts (~45% of savings)
Column-break artifacts, justified-text padding, font metadata leakage, hyphenation artifacts. None of which carry meaning; all of which tokenize.
Replication
Want to verify on your own corpus? Three steps:
- Get token counts on your PDFs as they currently feed to your LLM (use the relevant tokenizer)
- Convert via our web converter for one-offs, or via OSS like Marker/Docling/PyMuPDF for batch (we don't ship a programmatic API today)
- Get token counts on the resulting Markdown using the same tokenizer
Use our token counter tool for a quick comparison without installing tokenizer libraries. The reduction on your specific documents may be slightly different from our benchmark; the direction (substantial reduction) is universal.