Content Extraction: Readability vs Trafilatura vs AI-Powered
The main-content extraction problem sounds trivial: given a webpage, return just the article body without the navigation, sidebar, ads, footer, cookie banner, related-stories cards, and newsletter modal. In practice it is one of the messiest problems in web tooling. Three families of solutions dominate today — Mozilla Readability (heuristic, browser-derived), Trafilatura (Python, news/blog optimized), and AI-powered extraction (LLM-driven). Each makes different trade-offs around accuracy, speed, cost, and the long tail of weird real-world pages. Here is the engineering comparison, with code, and an honest take on when to reach for which.
Why "just give me the article body" is hard
A modern webpage is mostly chrome. Look at the rendered DOM of any major news article — the actual article body is typically 10-20% of the total node count. The remaining 80-90% is navigation, sidebar widgets, ad slots, social-share rails, comment sections, related-stories carousels, newsletter modals, footer link farms, and tracking pixels. Convert all of it and you get noise. Convert only the article body and you get content.
The challenge is that there is no universal way to identify which subtree is the article body. Sites use different markup conventions: some wrap content in <article>, some in <main>, some in <div id="story-body">, some in nested generic divs with no semantic hint at all. Class names are no better — article-content, post-body, entry-text, story, content-main all coexist across the web with no agreed convention.
The three families of solutions take fundamentally different approaches to inferring the answer.
Family 1: Mozilla Readability
Readability is the algorithm that powers Firefox Reader Mode and Pocket. It is open source, battle-tested across hundreds of millions of pages, and shipped in production browsers — which means its edge cases have been pounded on by real-world traffic for over a decade.
The algorithm at a high level:
- Walk every
<p>in the document. - Score each paragraph based on its text length, comma count, and the class/ID names of its ancestor elements (anti-patterns like
comment,sidebar,adreduce score; signals likearticle,content,postincrease it). - Propagate scores up the DOM tree to ancestor elements.
- The highest-scoring subtree is the article body.
- Strip remaining noise (links with no surrounding text, empty divs, social widgets) from the chosen subtree.
Strengths:
- Battle-tested: real-world coverage from Firefox Reader Mode and Pocket means edge cases have been ground out over years.
- Fast: pure JavaScript, runs in a browser tab in milliseconds. No network calls, no API costs.
- Deterministic: same input produces same output, every time.
- Well-maintained: Mozilla actively maintains the standalone library at github.com/mozilla/readability.
Limitations:
- Single-page-app blindness: Readability operates on the DOM as it exists. SPA-rendered content that loads after initial HTML requires you to render the JS first (Playwright/Puppeteer), then run Readability on the post-render DOM. Readability does not do the rendering for you.
- Listicle and gallery pages: pages that are structurally many short items (image-heavy galleries, top-10 lists with embedded media between items) confuse the paragraph-density heuristic.
- Multi-part articles: Readability picks one subtree. Pages with content split across multiple non-adjacent containers (some news sites, forum threads with original-post + replies as siblings) lose content.
- Forums and discussion threads: not the use case Readability was designed for.
Calling Readability from Node.js with JSDOM:
import { Readability } from '@mozilla/readability';
import { JSDOM } from 'jsdom';
const dom = new JSDOM(html, { url });
const reader = new Readability(dom.window.document);
const article = reader.parse();
console.log(article.title);
console.log(article.byline);
console.log(article.content); // cleaned HTML
console.log(article.textContent); // plain text
Family 2: Trafilatura
Trafilatura is a Python library specifically optimized for news, blog, and article-style content. It is fast, has minimal dependencies, and shines for batch local processing — converting hundreds or thousands of pages in a pipeline without per-page API costs.
The approach combines several extraction strategies and picks the best result:
- Try precision-oriented selectors first (looks for
<article>, common content classes, and OpenGraph hints). - Fall back to a recall-oriented heuristic similar to Readability for pages without semantic markers.
- Apply a learned classifier (XGBoost) to decide between candidate extractions on ambiguous pages.
- Strip boilerplate (nav, footer, comments, related links) using a curated rule set.
- Optionally output structured metadata (title, author, date, language, categories).
Strengths:
- Speed at scale: ~200-500 pages per second per CPU core on modern hardware, no API latency.
- Local execution: runs on your laptop, your server, or a Lambda. No data leaves your environment.
- Output formats: native support for plain text, Markdown, XML, JSON-LD, and TEI-XML for academic corpora.
- Metadata extraction: title, author, publication date, hreflang, categories, license — extracted with the same call.
- Good for news/blog: the genre Trafilatura was designed for. Often beats Readability on long-form journalism.
Limitations:
- Same SPA blindness as Readability: Trafilatura operates on HTML it receives. JS-rendered content needs upstream rendering.
- Less robust on non-news genres: documentation pages, product pages, e-commerce listings, and forum threads are not the optimization target. Quality drops on these.
- Python-only: no first-class Node, Go, or Rust port.
Basic usage:
import trafilatura
downloaded = trafilatura.fetch_url('https://example.com/article')
text = trafilatura.extract(
downloaded,
output_format='markdown',
include_comments=False,
include_tables=True,
with_metadata=True,
)
print(text)
For batch processing, the canonical pattern is a producer-consumer queue with rate limiting:
import trafilatura
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
import hashlib
def process_url(url: str, out_dir: Path):
downloaded = trafilatura.fetch_url(url)
if downloaded is None:
return None
md = trafilatura.extract(downloaded, output_format='markdown', with_metadata=True)
if md is None:
return None
fname = hashlib.sha1(url.encode()).hexdigest()[:12] + '.md'
(out_dir / fname).write_text(md, encoding='utf-8')
return fname
urls = Path('urls.txt').read_text().splitlines()
out = Path('extracted'); out.mkdir(exist_ok=True)
with ThreadPoolExecutor(max_workers=8) as pool:
results = list(pool.map(lambda u: process_url(u, out), urls))
print(f"Extracted {sum(1 for r in results if r)} of {len(urls)} pages")
Eight workers, polite throttling at the source domain, ~50-200 pages per minute on a laptop. This is the right tool for "I need to convert 5,000 blog posts to Markdown for a RAG corpus."
Family 3: AI-powered extraction
The newest family. Pass the page (or a markdown of the DOM) to an LLM and ask it to identify the main content. The LLM uses semantic understanding rather than DOM heuristics — which means it handles cases that defeat both Readability and Trafilatura.
Cases where AI extraction wins:
- JS-heavy SPAs: when paired with a headless browser that captures the post-render DOM, the LLM operates on actual content rather than empty shells.
- Forum threads and discussion pages: multi-author, threaded content that confuses paragraph-density heuristics is straightforward for an LLM that understands what a thread looks like.
- Documentation pages: heavy structure (nested headings, code blocks, tables, callouts) is preserved with semantic accuracy. Readability often flattens these.
- Unusual layouts: pages that wrap article body in non-standard markup, or mix multiple content types on one page, are recognizable to a model that has seen the long tail of web layouts during training.
- Multilingual edge cases: Trafilatura's heuristics are language-aware but skew toward English/European; LLMs handle CJK, RTL, and rare scripts more uniformly.
Trade-offs:
- Per-page cost: a frontier-model API call per page is meaningfully more expensive than a CPU pass through Readability or Trafilatura. For 10 pages, irrelevant. For 100,000 pages, a real budget item.
- Latency: model inference adds 1-5 seconds per page. Bulk processing requires async batching.
- Non-determinism: same input does not always produce identical output. For most extraction tasks the variation is immaterial; for reproducibility-critical pipelines it matters.
- Privacy: the page content is sent to the model provider. For sensitive internal content, choose a model deployment (local, on-prem, or contractually private) that satisfies your data policy.
The mdisbetter.com URL-to-Markdown converter uses an AI-powered approach for one-off conversions, which is well-matched to the use case: humans converting individual unusual pages where the long tail matters and the per-page cost does not. For batch pipelines (the Trafilatura sweet spot), running OSS locally is the right tool — see Building a Web Knowledge Base for AI for the architecture.
Decision framework
The simple version, in two questions.
Question 1: how many pages?
- 1-50 pages, one-offs, irregular cadence: web tool with AI extraction. The per-page accuracy and zero setup beat any DIY pipeline at this volume.
- 50-500 pages, periodic: Readability via a script, or Trafilatura. Both run on a laptop in seconds.
- 500+ pages, batch or recurring: Trafilatura locally. The throughput, cost profile, and metadata extraction are unmatched.
Question 2: what kind of pages?
- News articles, blog posts, long-form journalism: Trafilatura first. Its optimization target.
- SaaS marketing pages, documentation, product pages: Readability or AI-powered. Trafilatura is weaker here.
- Forums, discussion threads, multi-author pages: AI-powered. The two heuristic libraries struggle.
- JavaScript-rendered SPAs: any of the three, but you must render first with Playwright or Puppeteer.
For the JS-rendering step, Playwright in headless mode is the modern default:
from playwright.sync_api import sync_playwright
import trafilatura
with sync_playwright() as p:
browser = p.chromium.launch()
page = browser.new_page()
page.goto(url, wait_until='networkidle')
html = page.content()
browser.close()
markdown = trafilatura.extract(html, output_format='markdown')
This pattern — Playwright for rendering, Trafilatura for extraction — covers ~95% of real-world cases when the volume justifies a local pipeline.
Comparing output quality
A rough qualitative comparison from running the same 100-URL test set (mix of news, blog, docs, SaaS, forum) through each:
| Genre | Readability | Trafilatura | AI-powered |
|---|---|---|---|
| News article | Excellent | Excellent | Excellent |
| Blog post | Excellent | Excellent | Excellent |
| Documentation | Good | Good | Excellent |
| SaaS landing page | Fair | Fair | Good |
| Forum thread | Poor | Fair | Good |
| E-commerce product page | Poor | Poor | Fair |
| JS-heavy SPA (no rendering) | Fails | Fails | Fails |
| JS-heavy SPA (with Playwright) | Good | Good | Excellent |
The pattern: the three families converge on conventional content, and diverge on the long tail. For conventional content the cheapest tool wins; for the long tail AI extraction earns its cost.
The PDF parallel
The same problem in a different domain: PDF text extraction has its own technical stack (PyMuPDF, pdfplumber, Marker, Docling) with the same shape of trade-offs — cheap deterministic libraries for clean documents, AI-powered tools for the long tail of weird layouts. See how PDF works internally and why extraction breaks for the parallel deep-dive on the PDF side.
Practical recommendation
For application developers building a content pipeline, the canonical stack is Trafilatura + Playwright running locally — predictable cost, fast, deterministic, handles 95% of pages. Reach for AI-powered extraction (via the web tool for one-offs, or via a frontier-model API for programmatic use) when the long tail matters: forums, paywalled content, unusual SPAs, multilingual edge cases.
For end users converting one URL at a time, the web tool is the entire workflow. For developers, the technical depth above is what should drive the build-vs-buy decision. See also handling JavaScript-rendered pages for Markdown for the rendering layer specifically, and the 8-tool benchmark for empirical numbers across hosted services.