Pricing Dashboard Sign up
Recent
· 8 min read · MDisBetter

Token Count: PDF vs Markdown on 20 Real Documents (Hard Numbers)

The claim that converting PDFs to Markdown saves tokens is widely repeated but rarely measured precisely. We took 20 production documents — diverse types, sizes, and quality levels — and ran token counts through three tokenizers (ChatGPT, Claude, Gemini). Here are the actual numbers, with cost translations and the methodology to replicate.

Methodology

20 documents across 5 categories (4 each):

For each document, we measured token count three ways:

  1. Raw PDF tokens: tokens that ChatGPT/Claude/Gemini would charge for if you uploaded the PDF directly (estimated by tokenizing the text their internal extractors produce)
  2. Plain text tokens: tokens for naive text extraction (e.g., pdftotext)
  3. Markdown tokens: tokens for our converter's Markdown output

Tokenizers used: tiktoken with cl100k_base for OpenAI models, Anthropic's tokenizer for Claude, Google's tokenizer for Gemini.

Results table

Token counts in thousands. Reduction % shown vs raw PDF.

DocumentRaw PDFPlain textMarkdownReduction
arXiv ML paper (24pg)52.432.113.874%
NeurIPS paper (16pg)38.223.410.174%
Physics review article (40pg)97.661.226.473%
Math monograph chapter (28pg)71.344.818.774%
10-K filing (96pg)198.0124.572.064%
Equity research note (32pg)71.643.222.469%
Earnings release (12pg)26.415.88.269%
Annual report (140pg)312.0196.0108.065%
Master service agreement (52pg)96.461.032.866%
Standard NDA (8pg)15.29.65.464%
Vendor contract (38pg)67.842.423.665%
Lease agreement (28pg)52.633.018.465%
SaaS user manual (54pg)118.074.638.567%
Hardware spec sheet (24pg)54.834.616.270%
API reference PDF (62pg)132.683.242.068%
Installation guide (16pg)34.021.211.466%
Scanned legal contract (24pg)89.052.418.579%
Faxed invoice (4pg)16.89.83.480%
Photographed page (6pg)22.413.64.879%
Scanned manual (40pg)148.087.631.079%

Average reduction across all 20 documents: 70%. Range: 64-80%. Worst case (cleanest digital PDF): still 64% reduction.

Reductions by category

The pattern: scanned documents see the largest reduction because raw PDF processing of them is particularly token-inefficient. Clean digital PDFs see smaller (but still substantial) reductions.

Cost implications

Pricing as of mid-2026:

For a workload of 1,000 conversations per month at ~60K input tokens per conversation (typical document Q&A):

ModelPDF input costMarkdown input costSaved
GPT-4o$150/mo$45/mo$105/mo
Claude Sonnet 4.6$180/mo$54/mo$126/mo
Gemini 2.5 Pro$75/mo$22/mo$53/mo

Multiply by your actual volume. For agency or consultancy workflows running tens of thousands of conversations monthly, savings hit thousands of dollars per month from a one-line preprocessing change.

Tokenizer differences across models

An interesting nuance: the same Markdown text tokenizes slightly differently across models because they use different tokenizers. On our 20-document corpus:

The relative reduction (PDF vs Markdown) is consistent across tokenizers — about 70% on each. The absolute token counts shift slightly. For cost calculations, use the tokenizer matching your target model.

Why the savings appear at all

Three sources of token reduction:

Stripped page furniture (~25% of savings)

Repeating headers, footers, page numbers, copyright lines. Present on every page in raw PDF extraction; removed by Markdown conversion.

Collapsed whitespace and normalized encoding (~30% of savings)

Raw extraction emits multiple consecutive spaces, broken paragraphs from line wrapping, encoding markers leaked into text. Markdown is normalized.

Eliminated layout artifacts (~45% of savings)

Column-break artifacts, justified-text padding, font metadata leakage, hyphenation artifacts. None of which carry meaning; all of which tokenize.

Replication

Want to verify on your own corpus? Three steps:

  1. Get token counts on your PDFs as they currently feed to your LLM (use the relevant tokenizer)
  2. Convert via our web converter for one-offs, or via OSS like Marker/Docling/PyMuPDF for batch (we don't ship a programmatic API today)
  3. Get token counts on the resulting Markdown using the same tokenizer

Use our token counter tool for a quick comparison without installing tokenizer libraries. The reduction on your specific documents may be slightly different from our benchmark; the direction (substantial reduction) is universal.

Frequently asked questions

Does this benchmark hold for non-English documents?
Yes, with similar magnitudes. We re-ran on a smaller French and Mandarin corpus and saw 65-78% reductions — same range as English. The mechanism (removing layout noise) is language-agnostic.
What about embedded images in PDFs?
Images aren't counted as tokens by these models in their text-only mode. The token reduction we measure is purely on text content. Vision-enabled models that process embedded images add separate per-image costs that aren't affected by Markdown conversion.
Can I get even more reduction beyond Markdown?
Marginally. Manually pulling out only the relevant section and dropping the rest can get you to 90%+ reduction vs raw PDF. But that requires per-document human work; Markdown conversion is the highest reduction available automatically.