Why Markdown is the lingua franca of LLMs
Across OpenAI, Anthropic, Google DeepMind, Meta, Mistral, and the open-source long tail, training corpora are dominated by Markdown-flavoured text — README files, documentation sites, blog posts, GitHub wikis. The result is that every modern model recognises the same handful of cues: # means heading, - means list item, fenced code blocks are inviolable, tables are tables.
None of those cues exist in a PDF. PDF is a sequence of glyphs at coordinates; the structure has to be inferred. Inference costs tokens (the model thinks about layout instead of content) and introduces errors (the model gets the layout wrong). Markdown skips both costs.
Model-specific guides
The savings and best practices vary by model. We maintain a guide per major destination:
- ChatGPT — token economics on GPT-4o / GPT-5 / o-series
- Claude — Sonnet 4.6 and Opus 4.7 with the 200k context window
- Gemini — 1M context on 2.5 Pro and AI Studio workflow
- RAG pipelines — chunking by Markdown headers
- LangChain and LlamaIndex — code-level integration