Why HTML is a bad LLM input format
HTML interleaves content with presentation: classes, inline styles, ARIA labels, tracking pixels, ad slots, JSON-LD, schema.org microdata. The actual article is maybe 5% of the bytes. Models can technically parse it, but every token spent on <div class="kicker-headline-eyebrow"> is a token not spent on understanding the article. And the noise distorts attention — models routinely hallucinate that the page contained an ad it actually had a paragraph about.
Markdown is the inverse: pure structure. # means heading. - means list. Code is fenced. Tables are tables. Every modern LLM was trained on millions of Markdown documents and reads them as native semantic content.
Model-specific guides
- ChatGPT — bypassing browse-tool failures, token economics
- Claude — Projects-as-web-knowledge-base patterns
- Gemini — controlled input for the 1M context window
- RAG — web-to-knowledge-base scraping pipelines
- LangChain and LlamaIndex — code-level integration
For PDF sources, see PDF to Markdown for LLMs — same principles, different input format.