May 10, 2026 · 7 min read · MDisBetter

PDF vs Markdown vs TXT — Which Format for AI?

Q: What about RTF, DOCX, ODT — should I convert all of them?

Yes — every Office and rich-text format carries layout metadata that bloats LLM input. Convert any of them to Markdown for clean, efficient LLM consumption. Same workflow, same benefits.

You have a document. You want an LLM to read it. PDF, plain text, or Markdown? They're not equivalent — each format carries different amounts of structure, tokenizes differently, and produces measurably different answer quality. The right pick is rarely PDF, occasionally plain text, and most of the time Markdown.

The 3 formats compared

Format	Structure	Tokens (relative)	Accuracy	Compatibility
PDF	Visual only — none for AI	3.0×	Poor (extraction lossy)	Universal
Plain text	None	1.5×	Mediocre (no cues)	Universal
Markdown	Headings, lists, code, tables	1.0× (baseline)	High (native to LLMs)	Excellent (every LLM)

Tokens-relative is normalized to Markdown = 1.0×. PDF averaging 3.0× means a typical document costs three times as many tokens to feed as PDF vs as Markdown. Plain text sits in between — fewer artifacts than PDF, but loses the structural cues Markdown carries.

PDF — why it's the worst for AI

PDF was designed for printing, not machine reading. The format encodes glyphs at coordinates with no semantic notion of "paragraph", "heading", or "table". When ChatGPT or Claude receives a PDF, an extraction step has to reconstruct readable text — and that step is best-effort:

Reading order on multi-column pages comes out wrong roughly 30% of the time
Tables typically flatten into prose, losing row/column structure
Page furniture (headers, footers, page numbers) leaks into the body
Encoding issues on older or international PDFs introduce garbled text

The model receives noisy, structurally-broken text and pays full token price for the noise. The result is bad answers at high cost. We document this in detail in why PDF wastes 95% of your AI tokens.

Plain text — better but you lose structure

If you've already extracted a PDF to plain text, you've stripped the layout artifacts (good) but also stripped every structural cue (bad). The LLM sees a wall of paragraphs with no signals about what's a heading, what's a list, what's code.

Token usage is reasonable — about 1.5× a Markdown equivalent (the gap comes from inline code and tables that Markdown represents compactly). Accuracy is mediocre: the model has to infer structure from sentence patterns, which works for short documents and breaks on long ones.

When plain text is the right answer: short, prose-only content (transcripts, email bodies, novel excerpts) where there's no structure to preserve. Even there, Markdown is at worst neutral and often slightly better.

Markdown — the sweet spot

Markdown gives you the best of both: the cleanness of plain text plus explicit structural cues. Headings as #, lists as -, code as fenced blocks, tables as pipes. The model reads each cue natively because every major LLM was trained on enormous amounts of Markdown content (READMEs, documentation, GitHub).

The benefits compound. Token-wise, Markdown is the most compact — heading syntax adds two characters per heading, list syntax adds two characters per item, total overhead is ~1% of total content. Accuracy-wise, the structural cues let the model build an internal map of the document, which dramatically improves long-context retrieval. Compatibility-wise, every API (ChatGPT, Claude, Gemini, Llama, Mistral, plus dozens of smaller models) accepts Markdown without configuration.

Decision tree — pick the right format

You have a PDF

→ Convert to Markdown via our PDF to Markdown converter. Always. The conversion is free, takes 30 seconds, and pays back on every subsequent LLM call.

You have a Word document or PowerPoint

→ Convert to Markdown via the equivalent tool. The same logic applies — every Office format carries layout metadata that wastes tokens.

You have raw text (chat log, transcript, email body)

→ Plain text is fine for very short content; convert to Markdown via our text-to-Markdown tool for anything over a few paragraphs. The auto-detected structure improves retrieval and adds negligible overhead.

You have HTML or a web page

→ Convert to Markdown. HTML carries even more structural noise than PDF — script tags, navigation, ads, sidebars all tokenize wastefully. Use our URL to Markdown tool.

You're feeding a structured database export (CSV, JSON)

→ Markdown for tables (better readability for the LLM), JSON for structured data the model needs to navigate programmatically. Both work; pick based on your downstream task.

The exception — when PDF is acceptable

If your document is one page, has no layout (a single column of prose), no tables, and you'll only ask one question — feeding the PDF directly to ChatGPT works fine and saves you the conversion step. For literally everything else, conversion to Markdown is the right preprocessing step.

The rule of thumb: any document worth asking more than one question about is worth converting once. The conversion takes seconds; the savings compound across every subsequent query.

Frequently asked questions

What about HTML directly to LLMs — is that better than PDF?

Marginally. HTML preserves some structure (headings, lists) but adds significant tokenizer overhead from tags and attributes. For LLM input, Markdown is consistently the best target format from any source.

Are there cases where Markdown is worse than plain text?

Practically none. Even on pure prose with no structure, the Markdown overhead is negligible (<1% of tokens). On structured content, Markdown is strictly better. We've not found a real document type where plain text outperforms Markdown for LLM input.

What about RTF, DOCX, ODT — should I convert all of them?

Yes — every Office and rich-text format carries layout metadata that bloats LLM input. Convert any of them to Markdown for clean, efficient LLM consumption. Same workflow, same benefits.