What gets kept, what gets dropped
Kept: the post title (becomes # H1), the author name and publish date (kept as YAML front matter when requested), the article body in correct paragraph order, headings, blockquotes, lists, code blocks, images with their alt text. Dropped: every ad slot, share buttons, comment threads, related-post widgets, sidebar columns, modal popups (newsletter, GDPR, paywall teasers), and footer link soup. The before/after byte ratio is usually 50–100×.
How we identify the article body
Modern blogs follow Open Graph and JSON-LD Article conventions, which gives us a structured signal for title, author, and date. For the body, we run a Readability-style pass (the same algorithm Firefox Reader View uses, hardened for edge cases) that scores DOM nodes by content density and link-to-text ratio. The winning subtree is the article. Then we apply Markdown-specific cleanup: collapse trailing whitespace, normalise smart quotes, deduplicate consecutive blank lines.