How PDF Works Internally (And Why Text Extraction Always Breaks)
Every PDF-to-anything-else converter eventually fails on something. Tables flatten, columns scramble, equations garble, headers turn into body text. The failures aren't bugs — they're inherent to the PDF format itself. Understanding why requires looking at how PDF actually represents content. Once you see the format, the failure modes become obvious and the design of better converters becomes clear.
What's actually inside a PDF file
A PDF is a structured collection of objects: page descriptions, fonts, images, metadata, and a cross-reference table that ties them together. The interesting part for our discussion is the page content stream — the data that determines what gets rendered when you display a page.
A page content stream is a sequence of operators in PostScript-like syntax. Here's a simplified example of how the text "Hello" might be encoded:
BT % Begin text
/F1 12 Tf % Set font F1 at 12pt
100 700 Td % Move to position (100, 700)
(Hello) Tj % Show the string "Hello"
ET % End textNotice what's there and what isn't. The format encodes: which font, what size, what position, what character codes. The format does NOT encode: that this is a paragraph, that it's a heading, that it's the start of a section, that it's part of a list. Those concepts simply don't exist at the file-format level.
Glyph positioning vs character semantics
In the example above, (Hello) Tj shows the string "Hello" — which sounds straightforward but isn't. The string contains glyph codes, not Unicode characters. To translate from glyph codes back to Unicode, you need the font's encoding map (a CMap or ToUnicode map).
Many PDFs ship without proper ToUnicode maps, especially:
- PDFs generated by older software
- PDFs with subset fonts (only some glyphs from the original font are embedded)
- PDFs with custom or symbol fonts
- PDFs that have been re-saved by intermediate tools
When the ToUnicode map is missing, glyph code 65 (which renders as 'A' visually) might map to anything — or to nothing. The extracted text becomes garbled or empty.
Reading order is approximated, not encoded
The order of Tj operators in the content stream is whatever the PDF generator emitted. For simple single-column documents, this is usually top-to-bottom, left-to-right. For multi-column documents, it depends entirely on the generator:
- Word saves content in document order (correct reading order)
- InDesign saves content per text frame (often wrong reading order)
- LaTeX saves content in compilation order (sometimes wrong on complex layouts)
- OCR'd PDFs save content in OCR-engine order (variable)
An extractor that walks the content stream sequentially produces correct text from Word PDFs, scrambled text from InDesign PDFs, sometimes-wrong text from LaTeX PDFs. The same input PDF can produce wildly different output depending on which PDF generator originally created it.
Sophisticated extractors (including ours) compensate by spatial analysis: they cluster text by X-coordinate to detect columns, then re-emit text in geometric reading order rather than stream order. This works most of the time but breaks on irregular layouts.
Tables don't exist in PDF
This is the failure mode that matters most for AI workflows. When you look at a PDF table, you see rows and columns. The PDF file sees a sequence of glyphs at coordinates, with no annotation that these glyphs form a table.
Some PDFs include tagging (Tagged PDFs that include a structure tree marking text as <Table>, <TR>, <TD>). Tagged PDFs are designed for accessibility and form a small minority — most PDFs in the wild are not tagged.
For untagged PDFs, table extraction requires inferring structure from coordinates: detect horizontal and vertical lines, cluster text by spatial alignment, reconstruct cell boundaries. This is hard. Specialized table-extraction tools (Tabula, Camelot) and modern AI converters (ours, Marker, Docling) achieve high accuracy on standard tables; complex tables (merged cells, borderless designs, multi-row headers) remain a frontier.
Why scanned PDFs are worse
Scanned PDFs are PDFs containing only page images — no text content stream at all. Extraction requires OCR (optical character recognition) before any of the above analysis can happen.
OCR introduces its own error sources:
- Character recognition errors: similar shapes (l/1, O/0, rn/m) get confused
- Word-segmentation errors: missing or extra spaces between characters
- Layout-detection errors: paragraph boundaries inferred from whitespace, often wrong
- Skew and noise: tilted scans, low-resolution, photo distortion all degrade accuracy
After OCR, you still have all the layout-reconstruction problems of digital PDFs, plus an additional ~1-5% character error rate baked in. Modern OCR is dramatically better than it was even five years ago, but it's still probabilistic.
What this means for AI workflows
When you upload a PDF to ChatGPT, Claude, or Gemini, an extraction pipeline runs server-side that does its best with all of the above. The pipeline is:
- Parse PDF structure (objects, pages, fonts)
- Walk page content streams, decode glyph codes to Unicode via ToUnicode maps
- If text content is sparse (likely scanned), fall back to OCR
- Apply heuristics to recover paragraph boundaries, headings, lists
- Emit a flat text stream to the model
Every step is best-effort. Errors compound. The final text the model sees can be quite far from the original document — and the model has no way to know what was lost.
The fix is to do the extraction yourself with a tool optimized for it, then feed clean Markdown to the model. The model spends its tokens reasoning over content rather than parsing reconstruction errors. We document the practical implications in why PDF wastes 95% of your AI tokens.
What a good converter does differently
Compared to naive text extraction, a layout-aware converter:
- Builds a spatial layout model (column boundaries, line baselines, text-block hierarchies)
- Re-emits content in geometric reading order, not stream order
- Detects tables via line clustering and cell-boundary inference
- Identifies and strips repeating page furniture (headers, footers, page numbers)
- Recognizes equation regions and processes them through math-aware OCR
- Handles missing ToUnicode maps via OCR fallback on the rendered glyphs
- Emits structured Markdown (or HTML, JSON) rather than flat text
None of this is magic — it's a stack of specialized models and heuristics, each addressing one of the failure modes inherent to the format. Tools like Marker, Docling, and ours are all variations on this approach. See our 10-tool benchmark for how the major options stack up.
Why this matters for the next generation of tools
The best PDF-to-Markdown tools today still have failure modes — not because the engineering is bad, but because the format is fundamentally hostile to machine reading. The right long-term fix is at the format level: documents authored in semantic formats (Markdown, AsciiDoc, structured XML) and rendered to PDF only when print is needed.
For now, the workflow is: author or receive in PDF, convert to Markdown, do the actual work in Markdown. The conversion is a tax we pay for PDF's continued dominance. Better tools reduce the tax; they don't eliminate it.