URL to Markdown for Journalism: Archive Sources Safely
The story you publish on Tuesday cites a primary source that goes dark on Thursday. The press release you screenshotted last month gets stealth-edited overnight; the original wording — the wording your story relied on — is gone. The corporate blog post that was your smoking gun gets unpublished within an hour of your reporter calling for comment. None of this is hypothetical. Source disappearance is a daily risk in working journalism, and the standard tools (browser bookmarks, screenshots, Pocket) provide neither searchability nor evidentiary durability. URL-to-Markdown is a working reporter's archive layer: the moment you encounter a source, it becomes a portable, plaintext, timestamped, searchable, share-with-the-lawyer-able file. Combined with the Wayback Machine for public-facing receipts, it's the closest thing to affidavit-quality preservation a newsroom can deploy without a documents desk.
The disappearing-source problem
Web-based primary sources go dark for predictable reasons. A company quietly retracts a regulatory filing. A government agency restructures its website and an entire archive of agency reports stops resolving. A trade publication paywalls content that was free at the time you reported on it. A press release gets a stealth edit — same URL, different text, no diff trail visible to readers. A blog post by a public figure gets deleted after the figure decides it was a mistake. A LinkedIn post that confirmed a key fact in your story gets taken down by the poster after they're contacted for comment.
Each of these has happened to working reporters in the last calendar year. None of them is preventable. But all of them are recoverable from — if you have a clean archived copy of the source as it existed on the date you read it.
What proper journalism archiving needs to do
Three jobs the standard tools don't all do:
- Capture the content as you read it. Not as a screenshot (image, not searchable, not quotable). Not as a bookmark (a pointer to a URL that may already be different). As text, in a format you can grep, quote from, and re-read.
- Preserve enough metadata to be evidentiary. The URL, the access timestamp, the page title, the byline if available, ideally the HTTP status code and response headers at the time of capture.
- Be retrievable across an entire investigation. A 6-month investigation can accumulate 200+ web sources. They need to be searchable as a corpus, not as a folder of files you remember by name.
URL-to-Markdown does the first two natively. For the third, the output drops into Obsidian, DEVONthink, or a simple grep -r workflow that scales to thousands of files.
The reporter's archiving workflow
Step 1: Capture at read-time, not at publish-time
The discipline that separates a recoverable archive from a fragile one is converting the moment you find the source, not when you sit down to write. Sources die on the timeline of news cycles, not on your publishing timeline. The lag between reading and writing is exactly when disappearance happens.
Workflow: every time a source URL becomes interesting enough to potentially cite, paste it into our URL-to-Markdown converter. Output goes into a Sources/Inbox/ folder for the story. The conversion takes 2-3 seconds and the file is now permanent regardless of what happens to the URL.
For routine beat reporting, a browser keyboard shortcut that fires the conversion in one keystroke pays for itself in the first preserved source.
Step 2: Capture metadata that holds up
Each converted file's frontmatter should contain, at minimum:
---
source_url: https://example.com/press-release
title: "Company X announces Q4 results"
byline: Jane Doe
published_on_source: 2026-04-30
accessed_at: 2026-05-10T14:22:09Z
http_status: 200
fetched_by: a.reporter@newsroom.com
story: investigation-acme-2026
wayback_snapshot: https://web.archive.org/web/20260510142210/https://example.com/press-release
sha256: 9f86d081884c7d65...
---
The sha256 hash of the page contents is what makes this evidentiary. If a colleague — or a lawyer, or a fact-checker, or a court — needs to verify that the archived copy is unchanged since capture, they re-hash the file and compare. Tampering becomes detectable.
The wayback_snapshot URL is your public-facing receipt. Submitting every captured URL to web.archive.org/save/<url> at the moment of capture creates an independent timestamped record on infrastructure you don't control. Belt-and-suspenders preservation: your local Markdown is the working copy, the Wayback snapshot is the third-party witness.
Step 3: Build a story corpus
For a single article, a flat Sources/ folder is fine. For investigations that run for weeks or months, organize per-story:
Investigations/
acme-2026/
sources/
web/
2026-04-30 - Company X press release.md
2026-04-30 - Reuters coverage.md
2026-05-02 - SEC filing summary page.md
pdf/
2026-04-30 - SEC filing 10-Q.md
transcripts/
2026-05-03 - Source A interview.md
drafts/
v1.md
v2-after-legal.md
fact-check/
claims-to-verify.md
verified.md
For mixed-format investigations — your sources are a combination of web pages, leaked PDFs, public records PDFs, and your own interview transcripts — the same Markdown substrate covers everything. The PDF half of the corpus goes through our PDF to Markdown converter; the workflow tailored to records-heavy reporting is detailed in PDF to Markdown for lawyers, which most investigative reporters will recognize as adjacent to their own document workflow.
Step 4: Search the corpus during reporting
The advantage of a Markdown corpus over a folder of PDFs and screenshots is that grep and ripgrep work on it. To find every source that mentions a specific name, dollar figure, or date across an entire investigation:
rg -i "acme corp" investigations/acme-2026/sources/
rg -i "\$[0-9]+(\.[0-9]+)? million" investigations/acme-2026/sources/
rg -i "april 30, 2026" investigations/acme-2026/sources/
For non-technical reporters, Obsidian's full-text search over the same folder structure does the same thing through a UI. DEVONthink (Mac) is the gold standard for journalism archives over 1,000+ documents — it indexes the Markdown corpus and gives you fuzzy search, related-documents suggestions, and semantic similarity over the entire investigation.
Step 5: Hand off to fact-checking and legal
The fact-check pass on every published story involves verifying that every quote, every figure, every claim attributed to a source actually appears in that source. With an HTML/PDF/screenshot mix, this is hours per story. With a Markdown corpus, it's a sequence of greps. Fact-checkers can run the verification themselves without needing the original reporter present.
For legal review on sensitive stories, the same applies. The newsroom lawyer can search the source corpus directly to confirm that any claim flagged in the draft has supporting language in the archived sources. The hash + Wayback receipt + access timestamp combination satisfies most newsroom standards for source preservation.
Detecting stealth edits
One of the most useful side effects of capture-at-read-time: you build a history of how each source has changed over time. If a press release gets edited a week after publication, re-converting the URL produces a new Markdown file you can diff against the original capture.
diff sources/web/2026-04-30-press-release.md sources/web/2026-05-07-press-release-recheck.md
Stealth edits are themselves often news. "Company quietly removed paragraph admitting X from press release a week after our reporter contacted them" is a story. The diff is the evidence.
AI-assisted investigation across the corpus
For investigations with 50+ source documents, the same AI workflows used in academic literature reviews work for journalism. Drop the entire sources/ folder into a Claude Project (200k context comfortably handles dozens of medium-length sources, more with chunking) and ask:
- "Across all these sources, list every dollar figure mentioned and the source it came from."
- "Which sources contradict each other on the timeline of events?"
- "Identify every named individual mentioned more than once and the role each is described in."
- "Flag any claim that appears in multiple sources but with subtly different wording."
This is not the model writing the story. It's the model serving as a tireless research assistant pulling threads across hundreds of documents simultaneously — exactly the work that traditionally requires a senior researcher and three weeks. The Markdown substrate is what makes the corpus AI-readable in the first place.
Concrete example: a one-month investigation
You're working a corruption story over four weeks. Total source intake: ~80 web pages (press releases, news coverage, official statements, social media posts) and ~30 PDFs (court filings, financial disclosures, public records).
- Daily intake: every URL you read goes through URL-to-Markdown the moment you decide it's potentially citable. Wayback snapshot fires in parallel. PDF intake goes through the PDF converter.
- Weekly review: triage
Inbox/intoCited/andBackground/. Tag with story-relevant entities (people, organizations, dates). - Week 3 (synthesis): AI pass over the full corpus. Build a chronology, a who's-who, and a list of unresolved contradictions.
- Week 4 (drafting + fact-check): write with the corpus open. Every fact in the draft links to its source file by relative path. Fact-checker runs verification entirely from the corpus.
- Publication day: legal sign-off references the same archive. Story ships with confidence that every cited source is preserved, hashed, and Wayback-snapshotted.
- Post-publication: weekly re-conversion of the most consequential source URLs to detect stealth edits. Any diff is a potential follow-up.
This is the discipline that separates investigations that hold up under post-publication scrutiny from investigations that quietly get retracted six months later because a key source disappeared and nobody had a copy.
What about subscription content and paywalls?
If you have legitimate subscription access to a source, our browser extension captures the rendered DOM after you've authenticated — the same content you can read with your eyes, preserved as Markdown. The cloud converter (which fetches URLs anonymously) cannot bypass paywalls, by design. Use the extension for any source behind authentication; use the cloud converter for everything public.
Tools and integrations specific to newsrooms
- Obsidian: per-investigation vault, full-text search, graph view of cross-referenced sources
- DEVONthink: industrial-strength archiving for newsrooms with thousands of accumulated sources across many investigations
- archivebox: self-hosted Wayback Machine alternative for full-fidelity HTML/PDF/screenshot snapshots alongside Markdown
- Hashicorp Vault / 1Password Teams: for shared credentials when capturing subscription content collaboratively across a team
- Git LFS or a private S3 bucket: for backing up the source corpus alongside drafts
The pattern is the same across every category: Markdown is the durable, searchable, AI-ready substrate. URL-to-Markdown is what makes the live web compatible with the rest of a serious investigation's evidentiary infrastructure. Used at read-time, it transforms source preservation from a chronic risk into a solved problem.