May 10, 2026 · 8 min read · MDisBetter

Token Count: PDF vs Markdown on 20 Real Documents (Hard Numbers)

The claim that converting PDFs to Markdown saves tokens is widely repeated but rarely measured precisely. We took 20 production documents — diverse types, sizes, and quality levels — and ran token counts through three tokenizers (ChatGPT, Claude, Gemini). Here are the actual numbers, with cost translations and the methodology to replicate.

Methodology

20 documents across 5 categories (4 each):

Academic papers (single + multi-column, with and without equations)
Financial reports (10-K, earnings, equity research)
Legal contracts (NDAs, service agreements)
Product manuals (mixed layout with tables and code)
Scanned documents (range of OCR quality)

For each document, we measured token count three ways:

Raw PDF tokens: tokens that ChatGPT/Claude/Gemini would charge for if you uploaded the PDF directly (estimated by tokenizing the text their internal extractors produce)
Plain text tokens: tokens for naive text extraction (e.g., pdftotext)
Markdown tokens: tokens for our converter's Markdown output

Tokenizers used: tiktoken with cl100k_base for OpenAI models, Anthropic's tokenizer for Claude, Google's tokenizer for Gemini.

Results table

Token counts in thousands. Reduction % shown vs raw PDF.

Document	Raw PDF	Plain text	Markdown	Reduction
arXiv ML paper (24pg)	52.4	32.1	13.8	74%
NeurIPS paper (16pg)	38.2	23.4	10.1	74%
Physics review article (40pg)	97.6	61.2	26.4	73%
Math monograph chapter (28pg)	71.3	44.8	18.7	74%
10-K filing (96pg)	198.0	124.5	72.0	64%
Equity research note (32pg)	71.6	43.2	22.4	69%
Earnings release (12pg)	26.4	15.8	8.2	69%
Annual report (140pg)	312.0	196.0	108.0	65%
Master service agreement (52pg)	96.4	61.0	32.8	66%
Standard NDA (8pg)	15.2	9.6	5.4	64%
Vendor contract (38pg)	67.8	42.4	23.6	65%
Lease agreement (28pg)	52.6	33.0	18.4	65%
SaaS user manual (54pg)	118.0	74.6	38.5	67%
Hardware spec sheet (24pg)	54.8	34.6	16.2	70%
API reference PDF (62pg)	132.6	83.2	42.0	68%
Installation guide (16pg)	34.0	21.2	11.4	66%
Scanned legal contract (24pg)	89.0	52.4	18.5	79%
Faxed invoice (4pg)	16.8	9.8	3.4	80%
Photographed page (6pg)	22.4	13.6	4.8	79%
Scanned manual (40pg)	148.0	87.6	31.0	79%

Average reduction across all 20 documents: 70%. Range: 64-80%. Worst case (cleanest digital PDF): still 64% reduction.

Reductions by category

Academic papers: 73-74% reduction (consistent across types)
Financial reports: 65-69% (slightly lower due to dense tables that don't compress as much)
Legal contracts: 64-66% (relatively clean PDFs to begin with)
Product manuals: 66-70% (mixed layout)
Scanned documents: 79-80% (highest reduction — OCR + extraction noise compounds without conversion)

The pattern: scanned documents see the largest reduction because raw PDF processing of them is particularly token-inefficient. Clean digital PDFs see smaller (but still substantial) reductions.

Cost implications

Pricing as of mid-2026:

GPT-4o: $2.50 / 1M input tokens
Claude Sonnet 4.6: $3.00 / 1M input tokens
Gemini 2.5 Pro: $1.25 / 1M input tokens

For a workload of 1,000 conversations per month at ~60K input tokens per conversation (typical document Q&A):

Model	PDF input cost	Markdown input cost	Saved
GPT-4o	$150/mo	$45/mo	$105/mo
Claude Sonnet 4.6	$180/mo	$54/mo	$126/mo
Gemini 2.5 Pro	$75/mo	$22/mo	$53/mo

Multiply by your actual volume. For agency or consultancy workflows running tens of thousands of conversations monthly, savings hit thousands of dollars per month from a one-line preprocessing change.

Tokenizer differences across models

An interesting nuance: the same Markdown text tokenizes slightly differently across models because they use different tokenizers. On our 20-document corpus:

OpenAI cl100k_base: baseline (numbers above)
Anthropic Claude tokenizer: ~5% more tokens for the same text
Google Gemini tokenizer: ~3% fewer tokens for the same text

The relative reduction (PDF vs Markdown) is consistent across tokenizers — about 70% on each. The absolute token counts shift slightly. For cost calculations, use the tokenizer matching your target model.

Why the savings appear at all

Three sources of token reduction:

Stripped page furniture (~25% of savings)

Repeating headers, footers, page numbers, copyright lines. Present on every page in raw PDF extraction; removed by Markdown conversion.

Collapsed whitespace and normalized encoding (~30% of savings)

Raw extraction emits multiple consecutive spaces, broken paragraphs from line wrapping, encoding markers leaked into text. Markdown is normalized.

Eliminated layout artifacts (~45% of savings)

Column-break artifacts, justified-text padding, font metadata leakage, hyphenation artifacts. None of which carry meaning; all of which tokenize.

Replication

Want to verify on your own corpus? Three steps:

Get token counts on your PDFs as they currently feed to your LLM (use the relevant tokenizer)
Convert via our web converter for one-offs, or via OSS like Marker/Docling/PyMuPDF for batch (we don't ship a programmatic API today)
Get token counts on the resulting Markdown using the same tokenizer

Use our token counter tool for a quick comparison without installing tokenizer libraries. The reduction on your specific documents may be slightly different from our benchmark; the direction (substantial reduction) is universal.

Frequently asked questions

Does this benchmark hold for non-English documents?

Yes, with similar magnitudes. We re-ran on a smaller French and Mandarin corpus and saw 65-78% reductions — same range as English. The mechanism (removing layout noise) is language-agnostic.

What about embedded images in PDFs?

Images aren't counted as tokens by these models in their text-only mode. The token reduction we measure is purely on text content. Vision-enabled models that process embedded images add separate per-image costs that aren't affected by Markdown conversion.

Can I get even more reduction beyond Markdown?

Marginally. Manually pulling out only the relevant section and dropping the rest can get you to 90%+ reduction vs raw PDF. But that requires per-document human work; Markdown conversion is the highest reduction available automatically.