May 10, 2026 · 9 min read · MDisBetter

How to Extract Just the Article from a Messy Webpage

The web is built around articles, but webpages are not built to give you the article cleanly. The article is wrapped in navigation, sidebars, ads, recommendation widgets, cookie banners, and three different newsletter modals. Pulling just the article — the actual prose, headings, and structure that you came for — is a surprisingly hard problem with a long history of clever solutions. Here's a tour of the main options in 2026, when each one wins, when each one breaks, and where AI-powered extraction has changed the calculus.

The "main content" extraction problem

The technical challenge is called main-content extraction or boilerplate removal. Given an HTML page that contains ~10-20% article and ~80-90% chrome (navigation, ads, widgets, scripts, footer), produce just the article. It sounds simple. In practice, every site is built differently, the article isn't always in the obvious place, and the heuristics that work on one site fail on another.

The problem has been worked on for decades — academic papers on "web page segmentation" date back to the early 2000s. The dominant approaches have evolved through three generations:

Heuristic / DOM-based. Look at the HTML structure: text density per node, link-to-text ratio, common selectors (<article>, <main>, itemprop="articleBody"). Pick the dominant block of prose. Mozilla Readability is the canonical example.
Statistical / machine-learning. Train a model on labeled examples to classify each block as content or boilerplate. Trafilatura uses a hybrid of heuristics plus ML.
LLM-powered. Use a language model to actually understand what's article and what isn't. Slower and more expensive per page, but handles edge cases that heuristics break on.

Tool 1: Mozilla Readability — the gold standard heuristic

Mozilla Readability is the open-source library that powers Firefox's Reader View. It's well-engineered, well-maintained, and works remarkably well on the long tail of typical article-shaped pages. It's also embedded into Safari Reader, Edge Immersive Reader (in part), and dozens of read-later apps and browser extensions.

How it works: walks the DOM, scores each candidate node by a heuristic combining text density, paragraph count, link-to-text ratio, common positive/negative class names (article, content, main score positive; comment, sidebar, footer score negative), and picks the highest-scoring subtree. The result is a clean DOM fragment containing just the article.

Strengths: battle-tested on millions of pages, fast (runs in-browser), free and open source, good defaults.

Limitations: heuristic-based, so it fails on pages that don't fit the assumed shape. Multi-section pages (where the heuristic picks one section), highly customized layouts, dashboards, and JavaScript-heavy SPAs that haven't fully rendered all give Readability trouble. The output is HTML, not Markdown — you need a second step to convert if Markdown is your target.

When to use: in-browser extension or a self-hosted Node.js script processing typical news/blog pages. If your use case fits Readability's assumptions, it's free, fast, and excellent.

Tool 2: Trafilatura — the production-grade extractor

Trafilatura is a Python library specifically built for high-volume web text extraction in research and ML pipelines. It's used by computational linguists building corpora, by news aggregators, and by anyone who needs to extract article text from millions of URLs reliably.

How it works: hybrid pipeline that tries multiple extraction strategies (Readability-style, jusText, custom rules), picks the best output by quality heuristics, and outputs in your choice of plain text, Markdown, XML-TEI, or JSON. Handles JavaScript-rendered pages via integration with Playwright or Selenium.

Strengths: excellent extraction quality, multiple output formats including Markdown, supports batch processing, designed for high volume, well-maintained, free and open source. Handles edge cases that Readability misses (multilingual content, atypical layouts, pages with mixed content).

Limitations: Python only — if you're not in a Python environment, integration is awkward. Like all heuristic-based extractors, occasionally fails on highly atypical pages. Per-page latency is higher than Readability because of the multi-strategy approach.

When to use: Python-based research or production pipelines processing many URLs. If you're building an automated system and Python is acceptable, Trafilatura is probably the right answer.

Tool 3: Reader Mode in browsers

Safari Reader, Firefox Reader View, and Edge Immersive Reader are all UI wrappers around Readability-style extraction (with some custom tuning per browser). They give you a one-click clean view of the page in the same browser tab.

Strengths: zero install, zero config, integrated into the browsing experience. Excellent for ad-hoc reading.

Limitations: the output stays inside the browser. To export, you copy-paste (which loses structure) or use the browser's print-to-PDF (which gives a clean PDF but not editable text). Same heuristic-based failure modes as Readability — many pages don't trigger Reader Mode at all.

When to use: you want to read a single article distraction-free right now and don't need to export. Don't use when you need a portable archive or AI-ready output.

Tool 4: AI-powered extraction (mdisbetter's approach)

The newest generation of extractors uses an LLM-based step to actually read the page and identify what is and isn't article content. This is slower per page than heuristics, but the quality on edge cases is significantly better. mdisbetter's URL to Markdown converter uses this hybrid approach: a fast heuristic pass for typical pages, and an LLM-augmented pass for pages where heuristics fail or where the structure is genuinely ambiguous.

What this fixes that heuristics miss:

JavaScript-heavy SPAs where the article is rendered by client-side JS into a generic <div> with no semantic markup. Heuristic extractors see the empty pre-render and give up; the LLM-augmented pipeline triggers a headless browser, waits for hydration, then extracts.
Atypical layouts like multi-column scientific journals, dashboard-style how-to pages with sidebars containing legitimate content, and pages where the article shares the page with substantive related content.
Heading recovery. When a page's HTML has lost its semantic heading hierarchy (everything is a styled <div>), an LLM can often correctly reconstruct the headings from visual cues like font weight, size, and position. Pure heuristic extractors can't do this.
Mixed content pages where the "article" is actually a curated collection (a Reddit-style thread, a documentation page with multiple subsections, an aggregated review). The LLM can preserve the structure of the collection rather than picking one block.

How to use:

Open /convert/url-to-markdown.
Paste the URL.
Hit convert.
Download the resulting Markdown file or copy the text.

Limitations: per-page latency is a few seconds (vs sub-second for pure heuristics). If you're processing millions of pages on your own infrastructure, you may want to use Trafilatura locally instead of doing every page through a hosted service.

When to use: ad-hoc article extraction, edge-case pages where Readability or Trafilatura return empty/wrong output, or when you specifically need clean Markdown output without writing a post-processing step. mdisbetter is the convenience-first choice; the OSS tools are the build-it-yourself choice.

When AI extraction beats heuristic extraction

The honest answer is: most of the time, on common article shapes, modern heuristic tools (Readability, Trafilatura) are essentially as good as AI extraction. They're free, fast, and reliable. If you're building a pipeline at scale, those should be your default.

AI extraction wins clearly in these cases:

JavaScript-heavy pages where the heuristic tools can't even see the rendered content without a headless browser setup.
Pages with atypical structure — dashboards, multi-column scientific layouts, pages where the "article" is actually a collection.
Pages where the HTML has lost its semantic hierarchy and headings need to be reconstructed from visual cues.
One-off ad-hoc use where the cost of installing and configuring a heuristic tool isn't worth it for the few pages you actually need to extract.

The decision isn't "AI vs heuristic, which is better?" — it's "for my specific use case, which is the better tool?" For a one-off webpage you want as clean Markdown right now, the web tool wins on convenience. For a high-volume Python pipeline, Trafilatura wins on cost and control. For an in-browser "clean this up so I can read it," Reader Mode wins on speed.

Walkthrough: ad-hoc extraction with mdisbetter

Find the URL of the messy webpage you want to extract.
Open /convert/url-to-markdown in any browser.
Paste the URL into the input field.
Click Convert. Wait 2-5 seconds.
Inspect the preview — the Markdown output should contain just the article body, with headings preserved.
Download the .md file or copy the Markdown to your clipboard.

For LLM-pipeline-focused workflows, see URL to Markdown for LLM. For LangChain integrations specifically, see URL to Markdown for LangChain.

What about the same problem on PDFs?

The article-extraction problem on PDFs is different but related — instead of stripping HTML chrome you're recovering structure from positioned glyphs. The tooling and tradeoffs are different but the principle (heuristic vs ML vs LLM) is the same. See PDF to Markdown for the document-side equivalent and why PDF wastes your AI tokens for context on why structure recovery matters.

Building it yourself if you want to

If you want to roll your own extractor for a specific site or pipeline:

Start with Readability.js if you're in a JS/Node environment. Tiny dependency, well-documented, good defaults.
Start with Trafilatura if you're in Python. Better quality on the edge cases, native Markdown output.
Add a headless browser (Playwright or Puppeteer) if you're processing JS-heavy pages. Both libraries integrate with Readability or Trafilatura naturally — render the page, then hand the rendered DOM to the extractor.
Add an LLM post-processing step for stubborn pages where heuristic output is wrong. This adds cost but rescues the long tail of edge cases.

For the convenience-first path that handles most of the above out of the box, just use the web tool — that's what it's for. For background on why the format choice matters at all, see how to save a webpage so AI can actually read it and HTML is killing your LLM token budget.

The honest summary

Article extraction is a 25-year-old problem with mature solutions. For typical pages, Mozilla Readability and Trafilatura are excellent free options — use them when you have the engineering bandwidth and a Python or Node environment. For ad-hoc extraction, edge-case pages, or when you specifically need clean Markdown output without integration work, mdisbetter's AI-augmented converter is the convenience win. There is no single best tool — there's the best tool for your use case, and now you know roughly which one that is.

One workflow note

If you find yourself doing article extraction more than a few times a week, build the muscle memory: pick one tool, learn its quirks, stop second-guessing. Constantly switching between Reader Mode, copy-paste, and assorted browser extensions wastes more time than just settling on a default. For most people that default is either "open the web tool, paste the URL" or "run the Python script." Both are fine. Pick one and use it.

Frequently asked questions

Is Mozilla Readability free for commercial use?

Yes — Readability.js is licensed under Apache 2.0, which permits commercial use, modification, and redistribution. The same applies to Trafilatura (GPL v3, but only restrictive if you redistribute modified versions). For most use cases — internal tools, paid products, research — both are usable without licensing concerns.

How does article extraction handle paywalled content?

It can't extract what isn't accessible. If the article is behind a paywall and the public HTML only contains a teaser, that's all any extractor — heuristic or AI — can return. To extract the full article you'd need to authenticate first (e.g., via a logged-in browser session), then point your extraction tool at the rendered, authenticated DOM.

Will the extracted Markdown include the original article's images?

Image references (Markdown image syntax with alt text and source URL) are preserved. The image binaries themselves stay on the original server — your Markdown file references them by URL. If you want the images bundled locally, add a post-processing step that downloads each referenced image and rewrites the URLs to local paths.