How much a typical PDF wastes
We benchmarked 10 representative documents — academic papers, product manuals, financial reports, legal contracts, slide decks. Average token reduction from PDF text to Markdown: 68%. Worst case: 41% (a clean digital paper with minimal furniture). Best case: 96% (a scanned, multi-column report where the OCR text itself was mostly noise).
Where does the saving come from? Roughly 25% from removing repeating headers, footers and page numbers; 30% from collapsing whitespace and normalising encoding; the rest from dropping invisible glyphs, watermarks, and broken column boundaries that produced duplicated content in extraction.
What to keep, what to strip
Keep: headings, lists, code, tables, links, math notation, and paragraph breaks that respect the document's argument structure.
Strip: page numbers, repeating headers and footers, watermarks, "Page X of Y" markers, copyright lines on every page, and decorative separators. None of them help the LLM; all of them cost tokens.