News-specific extraction challenges
News articles share the blog-post extraction baseline but add their own complications. First, multi-paragraph ads injected mid-article ("Story continues below") that look like content to a naïve extractor. Second, image galleries that fragment the prose into one-paragraph slides. Third, "live blog" formats where the chronology is reversed. Fourth, soft paywalls — the paragraphs are present in the HTML but visually hidden by an overlay, which our extractor reveals (because the paragraphs are public HTML; we're not bypassing anything).
Byline, dateline, and source attribution
News conventions matter for citation. We extract the byline (author or wire service), the dateline (location and date), and the publication name from page metadata, and emit them in YAML front matter. Quotes within the article keep their attribution structure. The result is Markdown you can drop into a citation manager or feed to an LLM that needs to attribute sources correctly.