How HTML to Markdown Conversion Actually Works (Under the Hood)
HTML to Markdown looks like a one-liner — pass tags through a converter and emit text. The reality is a stack of decisions: which DOM nodes count as content, how to flatten arbitrary nesting into Markdown's flat structure, what to do with elements that have no Markdown equivalent, how to preserve semantics when the source HTML mixes presentation and structure. Every "naive" converter falls over on the same handful of cases. Here's what's actually happening under the hood.
The pipeline at 30,000 feet
Every HTML-to-Markdown converter follows the same broad pipeline:
- Parse the HTML into a DOM tree
- Clean the tree (strip scripts, styles, ad scaffolding, optionally identify main content)
- Walk the tree depth-first, emitting Markdown for each node
- Post-process the output (collapse whitespace, fix list nesting, normalize line breaks)
Each stage has failure modes. Naive converters skip stages 2 and 4 entirely; quality converters do real work in all four.
Stage 1: parsing
HTML in the wild is messy. Unclosed tags, mismatched nesting, attributes without quotes, character-encoding mistakes, embedded CDATA. The parser has to be lenient — strict XML-style parsing fails on roughly 80% of real web pages. Standard libraries (browsers' parsers, BeautifulSoup, html5ever, lxml in lenient mode) all implement the HTML5 parsing algorithm which specifies error recovery for every malformed input.
The output is a DOM tree. Naive HTML strings like <p>Hello <b>world</b></p> become:
document
└── p
├── text "Hello "
└── b
└── text "world"The text nodes are leaf nodes. The element nodes carry tag names, attributes, and children. Walking this tree is the rest of the pipeline.
Stage 2: cleaning and content extraction
This is the stage naive converters skip and quality converters live in. Modern web pages are 80%+ chrome — navigation, ads, sidebars, footers, related-stories cards, newsletter modals. Convert the entire DOM and you get noise. Convert just the article body and you get content.
Common approaches:
- Hard-coded selectors: site-specific recipes ("on nytimes.com, the article is in
section[name=articleBody]"). Highest quality, doesn't generalize. - Heuristic content extraction: algorithms like Mozilla Readability score nodes by text density, link density, paragraph density, and surrounding-element type. The highest-scoring subtree is treated as the article body. Used by most general-purpose converters.
- LLM-driven extraction: pass the DOM (or a markdown of it) to an LLM and ask which subtree is the content. Adapts to weird layouts; slower and more expensive.
Output of this stage: a pruned subtree containing what's plausibly the actual content. Everything else (nav, footer, ads, sidebars) is discarded before walking.
Stage 3: tree walking and element conversion
The walker visits each node depth-first and emits Markdown according to a per-element rule. The conceptual mapping is simple in principle:
| HTML element | Markdown emission |
|---|---|
| <h1>Title</h1> | # Title\n\n |
| <h2>Sub</h2> | ## Sub\n\n |
| <p>Text</p> | Text\n\n |
| <strong>X</strong> | **X** |
| <em>X</em> | *X* |
| <a href="...">X</a> | [X](...) |
| <ul><li>...</li></ul> | - ... |
| <ol><li>...</li></ol> | 1. ... |
| <code>X</code> | `X` |
| <pre><code>X</code></pre> | ```\nX\n``` |
| <img src="u" alt="a"> |  |
| <blockquote>X</blockquote> | > X |
Easy in isolation. The problems start when elements compose.
The hard cases
Nested lists
HTML allows arbitrary nesting; Markdown lists rely on indentation. The walker has to track current depth and emit the right number of spaces (typically 2 per level for unordered, 3 for ordered). Mixed nesting (a <ul> inside an <ol> inside another <ul>) requires careful state. Many naive converters lose nesting and flatten everything to one level.
Tables
HTML tables are arbitrarily complex (rowspan, colspan, nested tables, header rows in <tfoot>). Markdown tables are a single grid with optional alignment. Conversion is lossy:
- Rowspan/colspan: no Markdown equivalent. Quality converters duplicate cell content; naive ones drop or merge cells.
- Nested tables: no equivalent. Quality converters flatten with awkward formatting; naive ones produce broken output.
- Multi-row headers: no equivalent. Quality converters concatenate header rows with separators; naive ones pick one and drop the rest.
Code blocks
Detecting code is straightforward when the source uses <pre><code>. Detecting the language is harder. The information might be in:
- A
class="language-python"attribute (CommonMark convention) - A
data-langattribute (some highlighters) - A wrapping
<div class="highlight-python"> - Sibling elements (a label above the block)
- Nowhere (the page relied on JS-driven highlighting at load time)
Quality converters check all of these in priority order. Naive converters emit fenced blocks with no language tag, which downstream tools (and LLMs) then guess at.
Inline styles
HTML allows <span style="font-weight: bold">X</span> as a structural-equivalent to <strong>. Markdown has no class/style mechanism — the converter has to either translate the style to **X** or drop it. Same for <span style="font-style: italic">, color, font-family, custom alignment. Quality converters parse the style attribute and translate where there's a Markdown equivalent. Naive converters drop everything.
Whitespace
HTML collapses whitespace between text nodes; Markdown is whitespace-sensitive (newlines separate blocks, two spaces at end-of-line creates a soft break). The walker has to track whitespace deliberately — strip leading/trailing space inside inline elements but preserve it between block elements; collapse runs of whitespace into single spaces but preserve newlines that separate blocks. Easy to get subtly wrong.
Inline HTML inside Markdown
Markdown allows raw HTML to pass through. Quality converters use this for elements with no Markdown equivalent (<sub>, <sup>, <mark>, <kbd>). Naive converters drop these. The trade-off: passing HTML through preserves semantics but produces output that isn't pure Markdown — downstream tools that strip HTML will lose the content.
Stage 4: post-processing
The walker emits a stream of Markdown that's syntactically correct but visually awkward — blank lines in the wrong places, list indentation inconsistencies, trailing whitespace on lines. Post-processing fixes:
- Collapse three or more consecutive blank lines to two
- Strip trailing whitespace on lines (unless intentional for soft breaks)
- Ensure code fences are surrounded by blank lines
- Fix list indentation if the source had irregular spacing
- Normalize line endings (LF, not CRLF)
- Optionally enforce a Markdown dialect (GFM tables, CommonMark headings, etc.)
Skipping post-processing produces output that's technically valid Markdown but renders inconsistently across viewers. Quality converters always do this stage.
Why naive html2text falls short on modern sites
The classic html2text Python library implements stages 1, 3, and most of 4. It does not implement stage 2 (cleaning and content extraction). When you feed it raw HTML from a modern site, the output contains:
- Site navigation menu as a long list at the top
- Sidebar widgets and ad slots inline with content
- Footer copyright and link farm at the bottom
- The actual article body buried somewhere in the middle
The library is doing exactly what it says — converting HTML to Markdown. The problem is that "the HTML" of a modern web page is mostly chrome. To get usable output you have to pre-clean (run Readability, hand-pick a CSS selector, or use a content-extraction service) before passing to html2text.
Hosted converters (MDisBetter, Firecrawl, Jina Reader) all bundle stage 2 as part of the service. That's why they produce cleaner output without configuration. See our 8-tool benchmark for how each stacks up.
JavaScript-rendered pages add a stage 0
For SPAs and JS-heavy sites, none of the above runs against meaningful HTML — the source HTML is a near-empty shell that loads content via JS after page load. Before any of the four stages can run, you need a stage 0: render the JS, wait for content to appear, then capture the post-render HTML.
This requires a headless browser (Playwright, Puppeteer) or a service that runs one for you. Local libraries like html2text don't include this; hosted services do. We cover the technical specifics in handling JavaScript-rendered pages.
What a quality converter does differently
Compared to naive HTML-to-Markdown:
- Renders JS first (when needed) so the DOM contains actual content
- Runs content extraction (Readability or LLM-based) to identify the article body
- Strips known noise patterns (cookie banners, newsletter modals, related-stories cards)
- Walks the cleaned subtree with rules for every common element type
- Handles edge cases (nested lists, tables with rowspan, inline styles, code-block language detection)
- Post-processes the output for consistent formatting
None of this is novel — every quality converter does roughly the same thing. The differentiator is execution: how good are the heuristics, how broad are the element-handling rules, how deep does the JS rendering go. Real-world quality varies more than you'd expect from "everyone implements the same algorithm."
For practical use, see our URL-to-Markdown web tool for ad-hoc conversions, or pair Trafilatura with the chunking patterns in scrape a website to Markdown for RAG for embedding pipelines.