HTML to Markdown vs HTML to Text: Which Should You Use?
You have an HTML page and you want to extract its content. Two paths: convert to Markdown or convert to plain text. They sound similar but produce very different outputs, and the choice changes everything downstream — what you can do with the result, how AI tools handle it, whether links and tables survive. The right answer depends on what you'll do with the output. Here's the honest breakdown.
Definitions
HTML to Text strips all HTML markup and returns the visible text content as a flat string. Headings become regular text. Lists become paragraphs. Tables become space-or-tab-separated values. Links become bare URLs (or are dropped). Images become their alt text (or are dropped). The output is unambiguously plain text — readable in any editor, but with no structural information.
HTML to Markdown translates HTML into Markdown — a plain text format with structural conventions (# for headings, - for lists, | for tables, [text](url) for links,  for images, > for quotes). The output is also plain text, but it carries the document's structure in a form humans and machines can both read.
Both are conversions from HTML. The difference is what survives the conversion.
Comparison table
| Feature | HTML to Text | HTML to Markdown |
|---|---|---|
| Headings | Flattened to body text | Preserved as #, ##, ### |
| Bulleted lists | Bullets dropped, items become paragraphs | Preserved as - or * |
| Numbered lists | Numbers may survive but list semantics lost | Preserved as 1., 2. |
| Tables | Often unreadable (space/tab-separated) | Preserved with | separators |
| Links | URL appears as bare text or is dropped | Preserved as [text](url) |
| Images | Alt text or dropped | Preserved as  |
| Code blocks | Flattens into prose | Preserved with triple-backtick fences |
| Quotes | Lost or appear as plain paragraphs | Preserved with > prefix |
| Bold/italic | Lost | Preserved as **bold**, *italic* |
| Document hierarchy | Lost | Preserved via heading levels |
| Token count for LLMs | Lowest | Slightly higher (the #, - markers) |
| Human readability | OK for short text, poor for structured | Good across all content types |
| Machine reasoning quality | Lower (structure has to be inferred) | Higher (structure is explicit) |
Use cases for HTML to Text
Plain text is the right answer when:
- You only need keywords or word counts. Search indexing, frequency analysis, language detection — none of these care about structure.
- You're feeding the content into something that doesn't understand Markdown. Some legacy text-processing pipelines, certain TTS engines, or simple search systems work with raw text only.
- Maximum compactness matters more than structure. Plain text is marginally smaller than Markdown (no
#,-,|overhead). For huge corpora where every byte counts, this can matter. - The source is genuinely unstructured. A page that's a single block of prose with no headings, lists, or tables loses nothing in the text conversion.
For most modern use cases, none of these apply. Plain text is increasingly a niche format.
Use cases for HTML to Markdown
Markdown is the right answer when:
- You're feeding the content to an LLM. ChatGPT, Claude, Gemini, and every other major model read Markdown natively. Structure preservation directly improves answer quality. See why copy-pasting from websites ruins your AI answers.
- You want to keep links. Plain text either drops links or appends raw URLs awkwardly. Markdown's
[anchor text](url)syntax keeps both anchor and URL together, readable for humans and machines. - You're archiving for the long term. A Markdown file is human-readable, structurally rich, future-proof. A plain text file loses information that might matter later. See save web content as Markdown.
- You want to load content into a note tool. Obsidian, Logseq, Notion, Bear — every modern note tool speaks Markdown. Plain text loads but loses the structure these tools rely on.
- You're processing tables or technical documentation. Plain text destroys both. Markdown keeps both intact.
Recommendation by goal
Goal: feed to ChatGPT/Claude/Gemini. Markdown, no question. The quality difference is large. Use URL to Markdown.
Goal: save for later reading or archive. Markdown. You'll thank yourself when you look at the file in five years and headings still make sense.
Goal: load into Obsidian or another note tool. Markdown — these tools are designed for it.
Goal: feed into a legacy text-processing pipeline that requires plain text. Plain text, by necessity. Or convert to Markdown first and post-process.
Goal: simple keyword extraction or search indexing. Either works. Plain text is marginally simpler if you don't need anything else.
Goal: reading the article on a Kindle or e-reader. Markdown converts cleanly to EPUB; plain text doesn't.
The token cost question
For LLM use, a common worry: "Doesn't Markdown's # and - waste tokens?" Yes, marginally. A Markdown article is typically 2-5% larger than the same article as plain text. The structure markers add a small overhead.
But this is a rounding error compared to what raw HTML would cost (often 5-10x the tokens of Markdown), and it's offset by improved answer quality. The model uses the structural markers as cues — it's not wasted overhead, it's signal. We cover the token economics in detail in why PDF wastes your AI tokens (the pattern transfers to web content).
What about HTML to JSON?
A third option some pipelines use: extract HTML to a structured JSON object (e.g., { "title": "...", "sections": [...] }). This is the right format when you need programmatic access to specific fields. It's the wrong format when you want human-readable output or LLM input — JSON syntax is unfamiliar to LLMs as a content medium and adds significant token overhead.
Use JSON for structured data extraction ("give me the price, the rating, the reviewer count"); use Markdown for everything else.
The simple test
If you're not sure which to pick, ask: "Would I lose anything I care about if I read this output instead of the original page?"
If yes (because headings matter, links matter, tables matter), you want Markdown.
If no (because the page is one block of prose), plain text is fine.
For 90% of modern use cases, the honest answer is Markdown.
What about non-HTML sources?
The same logic applies to PDFs and other formats: convert to Markdown when you want structure preserved, convert to plain text when you only want the words. PDFs in particular benefit enormously from Markdown conversion because they encode structure visually rather than semantically — see PDF to text vs PDF to Markdown for the document version of this same comparison.
Worked examples
To make the difference concrete, three short examples of how the same input HTML produces different outputs in each format.
Example 1: a heading and a paragraph
Input: <h2>Pricing</h2><p>Our plans start at $10/mo.</p>
HTML to Text: Pricing\nOur plans start at $10/mo. — the heading is just another line. Indistinguishable from a one-word paragraph.
HTML to Markdown: ## Pricing\n\nOur plans start at $10/mo. — the heading marker is preserved; an LLM or human reader sees structure immediately.
Example 2: a link inside a sentence
Input: <p>See our <a href="/docs">documentation</a> for details.</p>
HTML to Text: See our documentation for details. — the URL is gone. You know there was a link, but not where it pointed.
HTML to Markdown: See our [documentation](/docs) for details. — both anchor and URL preserved.
Example 3: a comparison table
Input: a 4-row, 3-column HTML table.
HTML to Text: rows and columns mash together with whitespace; the structure is lost. An LLM reading this can usually reconstruct two-column tables but reliably fails on three or more.
HTML to Markdown: a clean Markdown table with | separators and a header row. An LLM reading this handles arbitrary column counts.
The cumulative effect across a long document is enormous. Plain text degrades gradually until what's left is unrecognizable; Markdown preserves the document essentially intact.
What about Markdown flavors?
One small wrinkle: not all Markdown is identical. CommonMark is the official spec; GitHub Flavored Markdown adds tables, strikethrough, task lists; some tools support footnotes and definition lists; some support callouts and admonitions. For LLM consumption, none of these differences matter materially — every model reads all common Markdown variants. For round-tripping into a specific tool (Notion, Obsidian, etc.), check that tool's supported flavor and stick to it for cleaner imports.
Migration path
If you have a body of plain-text web archives from years ago, you can mostly leave them alone — they served their purpose. For new content going forward, switching to Markdown is the right default. The marginal cost is essentially zero (the converter does the same work either way) and the marginal benefit accumulates as your archive grows and your AI tooling matures.
The bottom line
Plain text is a legacy format that solved a real problem in an era when many systems couldn't handle structure. Markdown is a more capable format that those systems now do handle universally. There's no longer a strong reason to default to text — and many strong reasons to default to Markdown. Pick text only when you have a specific constraint that demands it. Pick Markdown for everything else.
Practical FAQ for first-time switchers
People who have spent years defaulting to plain text often have a few small worries when they first switch. Two come up repeatedly:
"Won't the # and - characters look ugly when I read the file?" They don't. After a few minutes of exposure, your eye stops registering the markers as noise. Many text editors fade Markdown punctuation visually so it recedes; some render it inline. Once you've used Markdown for a week, the markers become genuinely invisible — your brain just sees the structure they convey.
"Will my old text-based tools still work?" Almost always yes. Markdown is text — every text editor, search tool, command-line utility, and scripting library reads it with no modification. The structural markers are just additional characters in a text file. You don't lose any tooling by switching.
The transition cost is minimal. The benefits — for AI, for archival, for note tools — are large. The default that made sense in 2005 is no longer the right default in 2026. Pick Markdown unless you have a reason to pick text, and most of the time you'll find you don't.