The three problems with feeding .docx to LLMs
XML overhead. A .docx is a ZIP of XML files. When the LLM reads it, the entire XML envelope counts toward your token budget — paragraph IDs (rsidR markers), run properties, font definitions, theme references, default style schemas. Typical overhead: 30-50% more tokens than the same content as Markdown.
Formatting noise. The model spends invisible effort filtering structural metadata before it can reason about content. On long documents the filtering occasionally fails — responses paraphrase style IDs as if they were body text, or reference invisible track-changes, or garble list numbering. Markdown removes the failure mode.
Lost structure. When LLMs do extract text from .docx, they often lose the heading hierarchy that gives Markdown its semantic value. Plain text runs of "Section 3.1" lose the H3-ness that lets a model treat it as a navigable anchor. Markdown headings are persistent across model invocations.
Why Markdown specifically (vs plain text)
Plain text strips formatting noise but also strips structure. Markdown keeps the structure (headings, lists, emphasis, tables, code) in a syntax every modern LLM was trained on. The model treats # H1 as a document title, ## H2 as a section, **bold** as emphasis. None of that survives a plain-text export.
Model-specific guides
- ChatGPT — token economy and custom GPT knowledge bases
- Claude — 200K-context document libraries in Projects
- Gemini — controllable input vs the native .docx path
- RAG — production retrieval over enterprise document corpora
- LangChain and LlamaIndex — code-level integration
Other source modalities: PDF for LLMs, URL for LLMs, Audio for LLMs, Video for LLMs — same principles, different inputs.