May 10, 2026 · 9 min read · MDisBetter

How HTML to Markdown Conversion Actually Works (Under the Hood)

HTML to Markdown looks like a one-liner — pass tags through a converter and emit text. The reality is a stack of decisions: which DOM nodes count as content, how to flatten arbitrary nesting into Markdown's flat structure, what to do with elements that have no Markdown equivalent, how to preserve semantics when the source HTML mixes presentation and structure. Every "naive" converter falls over on the same handful of cases. Here's what's actually happening under the hood.

The pipeline at 30,000 feet

Every HTML-to-Markdown converter follows the same broad pipeline:

Parse the HTML into a DOM tree
Clean the tree (strip scripts, styles, ad scaffolding, optionally identify main content)
Walk the tree depth-first, emitting Markdown for each node
Post-process the output (collapse whitespace, fix list nesting, normalize line breaks)

Each stage has failure modes. Naive converters skip stages 2 and 4 entirely; quality converters do real work in all four.

Stage 1: parsing

HTML in the wild is messy. Unclosed tags, mismatched nesting, attributes without quotes, character-encoding mistakes, embedded CDATA. The parser has to be lenient — strict XML-style parsing fails on roughly 80% of real web pages. Standard libraries (browsers' parsers, BeautifulSoup, html5ever, lxml in lenient mode) all implement the HTML5 parsing algorithm which specifies error recovery for every malformed input.

The output is a DOM tree. Naive HTML strings like Hello world become:

document
 └── p
     ├── text "Hello "
     └── b
         └── text "world"

The text nodes are leaf nodes. The element nodes carry tag names, attributes, and children. Walking this tree is the rest of the pipeline.

Stage 2: cleaning and content extraction

This is the stage naive converters skip and quality converters live in. Modern web pages are 80%+ chrome — navigation, ads, sidebars, footers, related-stories cards, newsletter modals. Convert the entire DOM and you get noise. Convert just the article body and you get content.

Common approaches:

Hard-coded selectors: site-specific recipes ("on nytimes.com, the article is in section[name=articleBody]"). Highest quality, doesn't generalize.
Heuristic content extraction: algorithms like Mozilla Readability score nodes by text density, link density, paragraph density, and surrounding-element type. The highest-scoring subtree is treated as the article body. Used by most general-purpose converters.
LLM-driven extraction: pass the DOM (or a markdown of it) to an LLM and ask which subtree is the content. Adapts to weird layouts; slower and more expensive.

Output of this stage: a pruned subtree containing what's plausibly the actual content. Everything else (nav, footer, ads, sidebars) is discarded before walking.

Stage 3: tree walking and element conversion

The walker visits each node depth-first and emits Markdown according to a per-element rule. The conceptual mapping is simple in principle:

HTML element	Markdown emission
<h1>Title</h1>	# Title\n\n
<h2>Sub</h2>	## Sub\n\n
<p>Text</p>	Text\n\n
<strong>X</strong>	X
<em>X</em>	X
<a href="...">X</a>	[X](...)
<ul><li>...</li></ul>	- ...
<ol><li>...</li></ol>	1. ...
<code>X</code>	`X`
<pre><code>X</code></pre>	```\nX\n```
<img src="u" alt="a">	![a](u)
<blockquote>X</blockquote>	> X

Easy in isolation. The problems start when elements compose.

The hard cases

Nested lists

HTML allows arbitrary nesting; Markdown lists rely on indentation. The walker has to track current depth and emit the right number of spaces (typically 2 per level for unordered, 3 for ordered). Mixed nesting (a <ul> inside an <ol> inside another <ul>) requires careful state. Many naive converters lose nesting and flatten everything to one level.

Tables

HTML tables are arbitrarily complex (rowspan, colspan, nested tables, header rows in <tfoot>). Markdown tables are a single grid with optional alignment. Conversion is lossy:

Rowspan/colspan: no Markdown equivalent. Quality converters duplicate cell content; naive ones drop or merge cells.
Nested tables: no equivalent. Quality converters flatten with awkward formatting; naive ones produce broken output.
Multi-row headers: no equivalent. Quality converters concatenate header rows with separators; naive ones pick one and drop the rest.

Code blocks

Detecting code is straightforward when the source uses <pre><code>. Detecting the language is harder. The information might be in:

A class="language-python" attribute (CommonMark convention)
A data-lang attribute (some highlighters)
A wrapping <div class="highlight-python">
Sibling elements (a label above the block)
Nowhere (the page relied on JS-driven highlighting at load time)

Quality converters check all of these in priority order. Naive converters emit fenced blocks with no language tag, which downstream tools (and LLMs) then guess at.

Inline styles

HTML allows X as a structural-equivalent to . Markdown has no class/style mechanism — the converter has to either translate the style to **X** or drop it. Same for , color, font-family, custom alignment. Quality converters parse the style attribute and translate where there's a Markdown equivalent. Naive converters drop everything.

Whitespace

HTML collapses whitespace between text nodes; Markdown is whitespace-sensitive (newlines separate blocks, two spaces at end-of-line creates a soft break). The walker has to track whitespace deliberately — strip leading/trailing space inside inline elements but preserve it between block elements; collapse runs of whitespace into single spaces but preserve newlines that separate blocks. Easy to get subtly wrong.

Inline HTML inside Markdown

Markdown allows raw HTML to pass through. Quality converters use this for elements with no Markdown equivalent (, , , <kbd>). Naive converters drop these. The trade-off: passing HTML through preserves semantics but produces output that isn't pure Markdown — downstream tools that strip HTML will lose the content.

Stage 4: post-processing

The walker emits a stream of Markdown that's syntactically correct but visually awkward — blank lines in the wrong places, list indentation inconsistencies, trailing whitespace on lines. Post-processing fixes:

Collapse three or more consecutive blank lines to two
Strip trailing whitespace on lines (unless intentional for soft breaks)
Ensure code fences are surrounded by blank lines
Fix list indentation if the source had irregular spacing
Normalize line endings (LF, not CRLF)
Optionally enforce a Markdown dialect (GFM tables, CommonMark headings, etc.)

Skipping post-processing produces output that's technically valid Markdown but renders inconsistently across viewers. Quality converters always do this stage.

Why naive html2text falls short on modern sites

The classic html2text Python library implements stages 1, 3, and most of 4. It does not implement stage 2 (cleaning and content extraction). When you feed it raw HTML from a modern site, the output contains:

Site navigation menu as a long list at the top
Sidebar widgets and ad slots inline with content
Footer copyright and link farm at the bottom
The actual article body buried somewhere in the middle

The library is doing exactly what it says — converting HTML to Markdown. The problem is that "the HTML" of a modern web page is mostly chrome. To get usable output you have to pre-clean (run Readability, hand-pick a CSS selector, or use a content-extraction service) before passing to html2text.

Hosted converters (MDisBetter, Firecrawl, Jina Reader) all bundle stage 2 as part of the service. That's why they produce cleaner output without configuration. See our 8-tool benchmark for how each stacks up.

JavaScript-rendered pages add a stage 0

For SPAs and JS-heavy sites, none of the above runs against meaningful HTML — the source HTML is a near-empty shell that loads content via JS after page load. Before any of the four stages can run, you need a stage 0: render the JS, wait for content to appear, then capture the post-render HTML.

This requires a headless browser (Playwright, Puppeteer) or a service that runs one for you. Local libraries like html2text don't include this; hosted services do. We cover the technical specifics in handling JavaScript-rendered pages.

What a quality converter does differently

Compared to naive HTML-to-Markdown:

Renders JS first (when needed) so the DOM contains actual content
Runs content extraction (Readability or LLM-based) to identify the article body
Strips known noise patterns (cookie banners, newsletter modals, related-stories cards)
Walks the cleaned subtree with rules for every common element type
Handles edge cases (nested lists, tables with rowspan, inline styles, code-block language detection)
Post-processes the output for consistent formatting

None of this is novel — every quality converter does roughly the same thing. The differentiator is execution: how good are the heuristics, how broad are the element-handling rules, how deep does the JS rendering go. Real-world quality varies more than you'd expect from "everyone implements the same algorithm."

For practical use, see our URL-to-Markdown web tool for ad-hoc conversions, or pair Trafilatura with the chunking patterns in scrape a website to Markdown for RAG for embedding pipelines.

Frequently asked questions

Why not just use Readability and pipe to html2text?

You can, and many pipelines do. The downsides: you handle the integration glue, you manage the JS-rendering step yourself if needed, you don't get code-block language preservation or other Markdown-aware niceties. Hosted converters bundle all of this; the build-it-yourself path works but requires more engineering for marginal output quality difference.

Is there a perfect HTML-to-Markdown converter?

No, because the conversion is fundamentally lossy. HTML expresses things Markdown can't (rowspan, inline styles, custom elements). Every converter chooses what to preserve and what to drop. "Perfect" isn't a property of the converter; it's a fit between the converter's choices and your downstream use case.

Can I write my own HTML-to-Markdown converter in a weekend?

A working one for simple HTML, yes — there are good tutorials. A converter that handles the long tail of weird real-world HTML reliably, no — that's months of edge-case work. Reusing an existing library (html2text, turndown, html-to-text) and adding pre-cleaning is a much better engineering tradeoff.