Pricing Dashboard Sign up
Recent
· 10 min read · MDisBetter

Content Extraction: Readability vs Trafilatura vs AI-Powered

The main-content extraction problem sounds trivial: given a webpage, return just the article body without the navigation, sidebar, ads, footer, cookie banner, related-stories cards, and newsletter modal. In practice it is one of the messiest problems in web tooling. Three families of solutions dominate today — Mozilla Readability (heuristic, browser-derived), Trafilatura (Python, news/blog optimized), and AI-powered extraction (LLM-driven). Each makes different trade-offs around accuracy, speed, cost, and the long tail of weird real-world pages. Here is the engineering comparison, with code, and an honest take on when to reach for which.

Why "just give me the article body" is hard

A modern webpage is mostly chrome. Look at the rendered DOM of any major news article — the actual article body is typically 10-20% of the total node count. The remaining 80-90% is navigation, sidebar widgets, ad slots, social-share rails, comment sections, related-stories carousels, newsletter modals, footer link farms, and tracking pixels. Convert all of it and you get noise. Convert only the article body and you get content.

The challenge is that there is no universal way to identify which subtree is the article body. Sites use different markup conventions: some wrap content in <article>, some in <main>, some in <div id="story-body">, some in nested generic divs with no semantic hint at all. Class names are no better — article-content, post-body, entry-text, story, content-main all coexist across the web with no agreed convention.

The three families of solutions take fundamentally different approaches to inferring the answer.

Family 1: Mozilla Readability

Readability is the algorithm that powers Firefox Reader Mode and Pocket. It is open source, battle-tested across hundreds of millions of pages, and shipped in production browsers — which means its edge cases have been pounded on by real-world traffic for over a decade.

The algorithm at a high level:

  1. Walk every <p> in the document.
  2. Score each paragraph based on its text length, comma count, and the class/ID names of its ancestor elements (anti-patterns like comment, sidebar, ad reduce score; signals like article, content, post increase it).
  3. Propagate scores up the DOM tree to ancestor elements.
  4. The highest-scoring subtree is the article body.
  5. Strip remaining noise (links with no surrounding text, empty divs, social widgets) from the chosen subtree.

Strengths:

Limitations:

Calling Readability from Node.js with JSDOM:

import { Readability } from '@mozilla/readability';
import { JSDOM } from 'jsdom';

const dom = new JSDOM(html, { url });
const reader = new Readability(dom.window.document);
const article = reader.parse();

console.log(article.title);
console.log(article.byline);
console.log(article.content);  // cleaned HTML
console.log(article.textContent);  // plain text

Family 2: Trafilatura

Trafilatura is a Python library specifically optimized for news, blog, and article-style content. It is fast, has minimal dependencies, and shines for batch local processing — converting hundreds or thousands of pages in a pipeline without per-page API costs.

The approach combines several extraction strategies and picks the best result:

  1. Try precision-oriented selectors first (looks for <article>, common content classes, and OpenGraph hints).
  2. Fall back to a recall-oriented heuristic similar to Readability for pages without semantic markers.
  3. Apply a learned classifier (XGBoost) to decide between candidate extractions on ambiguous pages.
  4. Strip boilerplate (nav, footer, comments, related links) using a curated rule set.
  5. Optionally output structured metadata (title, author, date, language, categories).

Strengths:

Limitations:

Basic usage:

import trafilatura

downloaded = trafilatura.fetch_url('https://example.com/article')
text = trafilatura.extract(
    downloaded,
    output_format='markdown',
    include_comments=False,
    include_tables=True,
    with_metadata=True,
)
print(text)

For batch processing, the canonical pattern is a producer-consumer queue with rate limiting:

import trafilatura
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
import hashlib

def process_url(url: str, out_dir: Path):
    downloaded = trafilatura.fetch_url(url)
    if downloaded is None:
        return None
    md = trafilatura.extract(downloaded, output_format='markdown', with_metadata=True)
    if md is None:
        return None
    fname = hashlib.sha1(url.encode()).hexdigest()[:12] + '.md'
    (out_dir / fname).write_text(md, encoding='utf-8')
    return fname

urls = Path('urls.txt').read_text().splitlines()
out = Path('extracted'); out.mkdir(exist_ok=True)

with ThreadPoolExecutor(max_workers=8) as pool:
    results = list(pool.map(lambda u: process_url(u, out), urls))

print(f"Extracted {sum(1 for r in results if r)} of {len(urls)} pages")

Eight workers, polite throttling at the source domain, ~50-200 pages per minute on a laptop. This is the right tool for "I need to convert 5,000 blog posts to Markdown for a RAG corpus."

Family 3: AI-powered extraction

The newest family. Pass the page (or a markdown of the DOM) to an LLM and ask it to identify the main content. The LLM uses semantic understanding rather than DOM heuristics — which means it handles cases that defeat both Readability and Trafilatura.

Cases where AI extraction wins:

Trade-offs:

The mdisbetter.com URL-to-Markdown converter uses an AI-powered approach for one-off conversions, which is well-matched to the use case: humans converting individual unusual pages where the long tail matters and the per-page cost does not. For batch pipelines (the Trafilatura sweet spot), running OSS locally is the right tool — see Building a Web Knowledge Base for AI for the architecture.

Decision framework

The simple version, in two questions.

Question 1: how many pages?

Question 2: what kind of pages?

For the JS-rendering step, Playwright in headless mode is the modern default:

from playwright.sync_api import sync_playwright
import trafilatura

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto(url, wait_until='networkidle')
    html = page.content()
    browser.close()

markdown = trafilatura.extract(html, output_format='markdown')

This pattern — Playwright for rendering, Trafilatura for extraction — covers ~95% of real-world cases when the volume justifies a local pipeline.

Comparing output quality

A rough qualitative comparison from running the same 100-URL test set (mix of news, blog, docs, SaaS, forum) through each:

GenreReadabilityTrafilaturaAI-powered
News articleExcellentExcellentExcellent
Blog postExcellentExcellentExcellent
DocumentationGoodGoodExcellent
SaaS landing pageFairFairGood
Forum threadPoorFairGood
E-commerce product pagePoorPoorFair
JS-heavy SPA (no rendering)FailsFailsFails
JS-heavy SPA (with Playwright)GoodGoodExcellent

The pattern: the three families converge on conventional content, and diverge on the long tail. For conventional content the cheapest tool wins; for the long tail AI extraction earns its cost.

The PDF parallel

The same problem in a different domain: PDF text extraction has its own technical stack (PyMuPDF, pdfplumber, Marker, Docling) with the same shape of trade-offs — cheap deterministic libraries for clean documents, AI-powered tools for the long tail of weird layouts. See how PDF works internally and why extraction breaks for the parallel deep-dive on the PDF side.

Practical recommendation

For application developers building a content pipeline, the canonical stack is Trafilatura + Playwright running locally — predictable cost, fast, deterministic, handles 95% of pages. Reach for AI-powered extraction (via the web tool for one-offs, or via a frontier-model API for programmatic use) when the long tail matters: forums, paywalled content, unusual SPAs, multilingual edge cases.

For end users converting one URL at a time, the web tool is the entire workflow. For developers, the technical depth above is what should drive the build-vs-buy decision. See also handling JavaScript-rendered pages for Markdown for the rendering layer specifically, and the 8-tool benchmark for empirical numbers across hosted services.

Frequently asked questions

Can I run Readability outside the browser?
Yes — the @mozilla/readability package on npm runs in any Node.js environment when paired with JSDOM as the DOM provider. The standalone library is the same algorithm Firefox Reader Mode uses, exposed for programmatic use. There is also a Python port (readability-lxml) and a Go port (go-readability), though the JavaScript original is the most actively maintained and most aligned with browser behavior.
Why does Trafilatura sometimes return None for a URL that loads fine in my browser?
Two common causes. First, the page is rendered by JavaScript and the raw HTML Trafilatura fetched is an empty shell — solve by using Playwright to render and passing the post-render HTML to extract(). Second, the page returned an HTTP error or got blocked by anti-bot middleware — check the response status and consider rotating user agents or using a residential-IP proxy if the site permits it. Trafilatura returns None rather than partial garbage, which is usually the right behavior.
Is AI-powered extraction always more accurate than the heuristic libraries?
On conventional content (news, blog posts), no — Readability and Trafilatura are excellent and the AI model adds latency and cost without measurable accuracy gain. The wins are concentrated in the long tail: forums, documentation with heavy structure, unusual layouts, multilingual edge cases, and pages where the article body is interleaved with non-content elements in ways that defeat density heuristics. Pick the tool to the page type, not the other way around.