PDF vs Markdown vs TXT — Which Format for AI?
You have a document. You want an LLM to read it. PDF, plain text, or Markdown? They're not equivalent — each format carries different amounts of structure, tokenizes differently, and produces measurably different answer quality. The right pick is rarely PDF, occasionally plain text, and most of the time Markdown.
The 3 formats compared
| Format | Structure | Tokens (relative) | Accuracy | Compatibility |
|---|---|---|---|---|
| Visual only — none for AI | 3.0× | Poor (extraction lossy) | Universal | |
| Plain text | None | 1.5× | Mediocre (no cues) | Universal |
| Markdown | Headings, lists, code, tables | 1.0× (baseline) | High (native to LLMs) | Excellent (every LLM) |
Tokens-relative is normalized to Markdown = 1.0×. PDF averaging 3.0× means a typical document costs three times as many tokens to feed as PDF vs as Markdown. Plain text sits in between — fewer artifacts than PDF, but loses the structural cues Markdown carries.
PDF — why it's the worst for AI
PDF was designed for printing, not machine reading. The format encodes glyphs at coordinates with no semantic notion of "paragraph", "heading", or "table". When ChatGPT or Claude receives a PDF, an extraction step has to reconstruct readable text — and that step is best-effort:
- Reading order on multi-column pages comes out wrong roughly 30% of the time
- Tables typically flatten into prose, losing row/column structure
- Page furniture (headers, footers, page numbers) leaks into the body
- Encoding issues on older or international PDFs introduce garbled text
The model receives noisy, structurally-broken text and pays full token price for the noise. The result is bad answers at high cost. We document this in detail in why PDF wastes 95% of your AI tokens.
Plain text — better but you lose structure
If you've already extracted a PDF to plain text, you've stripped the layout artifacts (good) but also stripped every structural cue (bad). The LLM sees a wall of paragraphs with no signals about what's a heading, what's a list, what's code.
Token usage is reasonable — about 1.5× a Markdown equivalent (the gap comes from inline code and tables that Markdown represents compactly). Accuracy is mediocre: the model has to infer structure from sentence patterns, which works for short documents and breaks on long ones.
When plain text is the right answer: short, prose-only content (transcripts, email bodies, novel excerpts) where there's no structure to preserve. Even there, Markdown is at worst neutral and often slightly better.
Markdown — the sweet spot
Markdown gives you the best of both: the cleanness of plain text plus explicit structural cues. Headings as #, lists as -, code as fenced blocks, tables as pipes. The model reads each cue natively because every major LLM was trained on enormous amounts of Markdown content (READMEs, documentation, GitHub).
The benefits compound. Token-wise, Markdown is the most compact — heading syntax adds two characters per heading, list syntax adds two characters per item, total overhead is ~1% of total content. Accuracy-wise, the structural cues let the model build an internal map of the document, which dramatically improves long-context retrieval. Compatibility-wise, every API (ChatGPT, Claude, Gemini, Llama, Mistral, plus dozens of smaller models) accepts Markdown without configuration.
Decision tree — pick the right format
You have a PDF
→ Convert to Markdown via our PDF to Markdown converter. Always. The conversion is free, takes 30 seconds, and pays back on every subsequent LLM call.
You have a Word document or PowerPoint
→ Convert to Markdown via the equivalent tool. The same logic applies — every Office format carries layout metadata that wastes tokens.
You have raw text (chat log, transcript, email body)
→ Plain text is fine for very short content; convert to Markdown via our text-to-Markdown tool for anything over a few paragraphs. The auto-detected structure improves retrieval and adds negligible overhead.
You have HTML or a web page
→ Convert to Markdown. HTML carries even more structural noise than PDF — script tags, navigation, ads, sidebars all tokenize wastefully. Use our URL to Markdown tool.
You're feeding a structured database export (CSV, JSON)
→ Markdown for tables (better readability for the LLM), JSON for structured data the model needs to navigate programmatically. Both work; pick based on your downstream task.
The exception — when PDF is acceptable
If your document is one page, has no layout (a single column of prose), no tables, and you'll only ask one question — feeding the PDF directly to ChatGPT works fine and saves you the conversion step. For literally everything else, conversion to Markdown is the right preprocessing step.
The rule of thumb: any document worth asking more than one question about is worth converting once. The conversion takes seconds; the savings compound across every subsequent query.