Pricing Dashboard Sign up
Recent
· 7 min read · MDisBetter

HTML to Markdown vs HTML to Text: Which Should You Use?

You have an HTML page and you want to extract its content. Two paths: convert to Markdown or convert to plain text. They sound similar but produce very different outputs, and the choice changes everything downstream — what you can do with the result, how AI tools handle it, whether links and tables survive. The right answer depends on what you'll do with the output. Here's the honest breakdown.

Definitions

HTML to Text strips all HTML markup and returns the visible text content as a flat string. Headings become regular text. Lists become paragraphs. Tables become space-or-tab-separated values. Links become bare URLs (or are dropped). Images become their alt text (or are dropped). The output is unambiguously plain text — readable in any editor, but with no structural information.

HTML to Markdown translates HTML into Markdown — a plain text format with structural conventions (# for headings, - for lists, | for tables, [text](url) for links, ![alt](src) for images, > for quotes). The output is also plain text, but it carries the document's structure in a form humans and machines can both read.

Both are conversions from HTML. The difference is what survives the conversion.

Comparison table

FeatureHTML to TextHTML to Markdown
HeadingsFlattened to body textPreserved as #, ##, ###
Bulleted listsBullets dropped, items become paragraphsPreserved as - or *
Numbered listsNumbers may survive but list semantics lostPreserved as 1., 2.
TablesOften unreadable (space/tab-separated)Preserved with | separators
LinksURL appears as bare text or is droppedPreserved as [text](url)
ImagesAlt text or droppedPreserved as ![alt](src)
Code blocksFlattens into prosePreserved with triple-backtick fences
QuotesLost or appear as plain paragraphsPreserved with > prefix
Bold/italicLostPreserved as **bold**, *italic*
Document hierarchyLostPreserved via heading levels
Token count for LLMsLowestSlightly higher (the #, - markers)
Human readabilityOK for short text, poor for structuredGood across all content types
Machine reasoning qualityLower (structure has to be inferred)Higher (structure is explicit)

Use cases for HTML to Text

Plain text is the right answer when:

For most modern use cases, none of these apply. Plain text is increasingly a niche format.

Use cases for HTML to Markdown

Markdown is the right answer when:

Recommendation by goal

Goal: feed to ChatGPT/Claude/Gemini. Markdown, no question. The quality difference is large. Use URL to Markdown.

Goal: save for later reading or archive. Markdown. You'll thank yourself when you look at the file in five years and headings still make sense.

Goal: load into Obsidian or another note tool. Markdown — these tools are designed for it.

Goal: feed into a legacy text-processing pipeline that requires plain text. Plain text, by necessity. Or convert to Markdown first and post-process.

Goal: simple keyword extraction or search indexing. Either works. Plain text is marginally simpler if you don't need anything else.

Goal: reading the article on a Kindle or e-reader. Markdown converts cleanly to EPUB; plain text doesn't.

The token cost question

For LLM use, a common worry: "Doesn't Markdown's # and - waste tokens?" Yes, marginally. A Markdown article is typically 2-5% larger than the same article as plain text. The structure markers add a small overhead.

But this is a rounding error compared to what raw HTML would cost (often 5-10x the tokens of Markdown), and it's offset by improved answer quality. The model uses the structural markers as cues — it's not wasted overhead, it's signal. We cover the token economics in detail in why PDF wastes your AI tokens (the pattern transfers to web content).

What about HTML to JSON?

A third option some pipelines use: extract HTML to a structured JSON object (e.g., { "title": "...", "sections": [...] }). This is the right format when you need programmatic access to specific fields. It's the wrong format when you want human-readable output or LLM input — JSON syntax is unfamiliar to LLMs as a content medium and adds significant token overhead.

Use JSON for structured data extraction ("give me the price, the rating, the reviewer count"); use Markdown for everything else.

The simple test

If you're not sure which to pick, ask: "Would I lose anything I care about if I read this output instead of the original page?"

If yes (because headings matter, links matter, tables matter), you want Markdown.

If no (because the page is one block of prose), plain text is fine.

For 90% of modern use cases, the honest answer is Markdown.

What about non-HTML sources?

The same logic applies to PDFs and other formats: convert to Markdown when you want structure preserved, convert to plain text when you only want the words. PDFs in particular benefit enormously from Markdown conversion because they encode structure visually rather than semantically — see PDF to text vs PDF to Markdown for the document version of this same comparison.

Worked examples

To make the difference concrete, three short examples of how the same input HTML produces different outputs in each format.

Example 1: a heading and a paragraph

Input: <h2>Pricing</h2><p>Our plans start at $10/mo.</p>

HTML to Text: Pricing\nOur plans start at $10/mo. — the heading is just another line. Indistinguishable from a one-word paragraph.

HTML to Markdown: ## Pricing\n\nOur plans start at $10/mo. — the heading marker is preserved; an LLM or human reader sees structure immediately.

Example 2: a link inside a sentence

Input: <p>See our <a href="/docs">documentation</a> for details.</p>

HTML to Text: See our documentation for details. — the URL is gone. You know there was a link, but not where it pointed.

HTML to Markdown: See our [documentation](/docs) for details. — both anchor and URL preserved.

Example 3: a comparison table

Input: a 4-row, 3-column HTML table.

HTML to Text: rows and columns mash together with whitespace; the structure is lost. An LLM reading this can usually reconstruct two-column tables but reliably fails on three or more.

HTML to Markdown: a clean Markdown table with | separators and a header row. An LLM reading this handles arbitrary column counts.

The cumulative effect across a long document is enormous. Plain text degrades gradually until what's left is unrecognizable; Markdown preserves the document essentially intact.

What about Markdown flavors?

One small wrinkle: not all Markdown is identical. CommonMark is the official spec; GitHub Flavored Markdown adds tables, strikethrough, task lists; some tools support footnotes and definition lists; some support callouts and admonitions. For LLM consumption, none of these differences matter materially — every model reads all common Markdown variants. For round-tripping into a specific tool (Notion, Obsidian, etc.), check that tool's supported flavor and stick to it for cleaner imports.

Migration path

If you have a body of plain-text web archives from years ago, you can mostly leave them alone — they served their purpose. For new content going forward, switching to Markdown is the right default. The marginal cost is essentially zero (the converter does the same work either way) and the marginal benefit accumulates as your archive grows and your AI tooling matures.

The bottom line

Plain text is a legacy format that solved a real problem in an era when many systems couldn't handle structure. Markdown is a more capable format that those systems now do handle universally. There's no longer a strong reason to default to text — and many strong reasons to default to Markdown. Pick text only when you have a specific constraint that demands it. Pick Markdown for everything else.

Practical FAQ for first-time switchers

People who have spent years defaulting to plain text often have a few small worries when they first switch. Two come up repeatedly:

"Won't the # and - characters look ugly when I read the file?" They don't. After a few minutes of exposure, your eye stops registering the markers as noise. Many text editors fade Markdown punctuation visually so it recedes; some render it inline. Once you've used Markdown for a week, the markers become genuinely invisible — your brain just sees the structure they convey.

"Will my old text-based tools still work?" Almost always yes. Markdown is text — every text editor, search tool, command-line utility, and scripting library reads it with no modification. The structural markers are just additional characters in a text file. You don't lose any tooling by switching.

The transition cost is minimal. The benefits — for AI, for archival, for note tools — are large. The default that made sense in 2005 is no longer the right default in 2026. Pick Markdown unless you have a reason to pick text, and most of the time you'll find you don't.

Frequently asked questions

Is Markdown harder to read than plain text?
For prose, no — the markers (<code>#</code>, <code>-</code>) are unobtrusive and most readers learn to skim past them within minutes. For complex documents (lots of tables, code), Markdown is significantly easier to read than the equivalent plain text because structure is visible.
Can I convert Markdown back to HTML?
Yes, trivially — every static site generator, documentation tool, and many text editors do this. The conversion is lossless for standard Markdown. Plain text to HTML is much harder because you'd have to re-infer the structure that was destroyed.
Which format is more compatible across software?
Plain text is the absolute lowest common denominator — every text editor on every platform reads it. Markdown is the de facto standard for any modern note tool, documentation system, or developer toolchain. For practical purposes today, both are universally readable.