Batch Convert 100+ URLs to Markdown at Once
Converting one URL at a time works for ad-hoc clipping. Converting your 800-URL reading list, an entire blog archive, or a competitor's documentation site needs batch — and batch done well needs concurrency, rate-limit handling, retry logic, and an output structure that doesn't collapse into 800 unnamed files. MDisBetter's web converter handles single URLs beautifully, but it's not a programmatic API — for true batch you'll roll a small script over OSS tooling. Here are the patterns that hold up from 100 URLs to 100,000.
When batch is the right move
- Migrating an old blog (200-2000 posts) into a docs-as-code or knowledge-base workflow
- Building a RAG corpus from a documentation site (200-5000 pages)
- Ingesting a saved-articles export from Pocket, Instapaper, Pinboard (1000-10,000 URLs)
- Converting a competitor's public docs for offline analysis
- Bulk-archiving a personal reading list before a year-end cleanup
Below ~10 URLs, batch isn't worth the setup — open the MDisBetter web tool in a few tabs and click Convert on each. From 10 to ~50 URLs, the multi-tab approach still works if you don't mind the elbow grease. Above 50 URLs, scripted batch over OSS extractors is dramatically better.
The honest baseline: MDisBetter is a web tool
MDisBetter doesn't currently expose a programmatic API, CLI, Python SDK, or MCP server for URL-to-Markdown. The web converter at /convert/url-to-markdown takes one URL per click and returns one .md download. That's by design — it's the lowest-friction surface for ad-hoc work. For 100+ URLs you want automation, and the right path is a self-rolled script using the OSS extractors below. None of them are commercial; all are pip-installable.
Web tool with multiple tabs (up to ~50 URLs)
For small batches, the manual path is hard to beat: open /convert/url-to-markdown in 5-10 browser tabs, paste a URL into each, click Convert, save the resulting .md to a folder. Modern browsers can keep dozens of tabs alive without complaint, and the conversions run server-side in parallel — you're effectively batching at the user-agent layer. Best for one-shot work where setting up Python feels heavier than opening tabs.
Sequential Python script using Trafilatura
For batches of 50-200 URLs and someone comfortable in Python, Trafilatura is the best-in-class OSS readability + Markdown extractor. It handles boilerplate stripping, link preservation, and table extraction in one call:
#!/usr/bin/env python
# pip install trafilatura
import re
from pathlib import Path
import trafilatura
OUT = Path('./out')
OUT.mkdir(exist_ok=True)
def url_to_path(url):
slug = re.sub(r'[^A-Za-z0-9]+', '_',
url.replace('https://', '').replace('http://', ''))[:100]
return OUT / f'{slug}.md'
for url in Path('urls.txt').read_text().splitlines():
url = url.strip()
if not url or url.startswith('#'):
continue
out = url_to_path(url)
if out.exists():
print(f'SKIP {url}')
continue
downloaded = trafilatura.fetch_url(url)
if not downloaded:
print(f'FETCH_FAIL {url}')
continue
md = trafilatura.extract(downloaded, output_format='markdown',
include_links=True, include_tables=True)
if md:
out.write_text(md, encoding='utf-8')
print(f'OK {url}')
else:
print(f'EXTRACT_FAIL {url}')
Run with python convert.py. Resume support is built in (the out.exists() check). Sequential, ~1-3 seconds per URL, so 100 URLs takes ~3-5 minutes.
Parallel Python script with httpx + Trafilatura
Drop sequential time by 10x by parallelising the network fetch (Trafilatura's extraction is CPU-bound and quick):
#!/usr/bin/env python
# pip install httpx trafilatura tqdm
import asyncio
import re
from pathlib import Path
import httpx
import trafilatura
from tqdm.asyncio import tqdm_asyncio
OUT = Path('./out')
OUT.mkdir(exist_ok=True)
CONCURRENCY = 15
MAX_RETRIES = 3
sem = asyncio.Semaphore(CONCURRENCY)
def url_to_path(url):
slug = re.sub(r'[^A-Za-z0-9]+', '_',
url.replace('https://', '').replace('http://', ''))[:100]
return OUT / f'{slug}.md'
async def convert(client, url, attempt=1):
out = url_to_path(url)
if out.exists():
return ('skip', url)
async with sem:
try:
r = await client.get(url, timeout=60, follow_redirects=True,
headers={'User-Agent': 'Mozilla/5.0 (compatible; URL2MD/1.0)'})
except (httpx.TimeoutException, httpx.NetworkError):
if attempt < MAX_RETRIES:
await asyncio.sleep(2 ** attempt)
return await convert(client, url, attempt + 1)
return ('net_fail', url)
if r.status_code == 429:
await asyncio.sleep(int(r.headers.get('Retry-After', 5)))
if attempt < MAX_RETRIES:
return await convert(client, url, attempt + 1)
return ('rate_limit', url)
if r.status_code != 200:
return (f'http_{r.status_code}', url)
md = trafilatura.extract(r.text, output_format='markdown',
include_links=True, include_tables=True)
if not md:
return ('extract_fail', url)
out.write_text(md, encoding='utf-8')
return ('ok', url)
async def main(urls):
async with httpx.AsyncClient() as client:
results = await tqdm_asyncio.gather(*[convert(client, u) for u in urls])
counts = {}
for status, url in results:
counts[status] = counts.get(status, 0) + 1
if status not in ('ok', 'skip'):
print(f'{status}: {url}')
print(f'Summary: {counts}')
if __name__ == '__main__':
urls = [u.strip() for u in Path('urls.txt').read_text().splitlines()
if u.strip() and not u.strip().startswith('#')]
print(f'Processing {len(urls)} URLs at concurrency={CONCURRENCY}')
asyncio.run(main(urls))
15 concurrent workers, progress bar, retry-with-backoff for transient failures. 1000 URLs in 3-6 minutes on a typical home connection. The same 80 lines work for 1000 URLs or 100,000.
JS-rendered pages: Playwright + html2text
Trafilatura is fast and good at static HTML. For SPAs (React, Vue, Angular, Docusaurus, Notion-public-pages, GitBook), it'll get the empty shell — you need a real browser. Playwright is the standard OSS answer:
#!/usr/bin/env python
# pip install playwright html2text
# python -m playwright install chromium
import asyncio
import re
from pathlib import Path
from playwright.async_api import async_playwright
import html2text
OUT = Path('./out')
OUT.mkdir(exist_ok=True)
CONCURRENCY = 5 # browser pages are heavier than HTTP requests
h = html2text.HTML2Text()
h.body_width = 0
h.ignore_images = False
def url_to_path(url):
slug = re.sub(r'[^A-Za-z0-9]+', '_',
url.replace('https://', '').replace('http://', ''))[:100]
return OUT / f'{slug}.md'
async def convert(browser, url, sem):
out = url_to_path(url)
if out.exists():
return
async with sem:
ctx = await browser.new_context()
page = await ctx.new_page()
try:
await page.goto(url, wait_until='networkidle', timeout=30000)
html = await page.content()
finally:
await ctx.close()
md = h.handle(html)
out.write_text(md, encoding='utf-8')
async def main(urls):
sem = asyncio.Semaphore(CONCURRENCY)
async with async_playwright() as p:
browser = await p.chromium.launch()
await asyncio.gather(*[convert(browser, u, sem) for u in urls])
await browser.close()
urls = [u.strip() for u in Path('urls.txt').read_text().splitlines() if u.strip()]
asyncio.run(main(urls))
Slower per page (~3-5 sec each, including browser-context spin-up) but works on every JS-rendered site. For mixed batches, classify URLs first (by domain or by a quick HEAD-check on response size) and route static URLs through the Trafilatura path, JS URLs through Playwright.
Output organization patterns
Flat folder, slugged filenames
Simplest. Every .md in one folder, filename derived from URL. Works up to ~5000 files before filesystem performance degrades on most modern OS.
Mirror the URL path structure
For converting a docs site, mirror the path so files map back to URL structure:
def url_to_mirror_path(url):
from urllib.parse import urlparse
p = urlparse(url)
parts = [p.netloc] + [x for x in p.path.strip('/').split('/') if x]
if not parts or not parts[-1].endswith('.md'):
parts.append('index.md')
parts[-1] = parts[-1].replace('.html', '') + ('.md' if not parts[-1].endswith('.md') else '')
return OUT.joinpath(*parts)
Result: out/docs.python.org/3/library/json.md — full URL hierarchy preserved, easy to navigate manually.
One file per URL with frontmatter
For knowledge-base ingestion, prepend YAML frontmatter to each file so downstream tools (Obsidian, Hugo, MkDocs) can index by source:
from datetime import date
def write_with_frontmatter(out, url, title, md):
fm = (
'---\n'
f'source: {url}\n'
f'title: "{title}"\n'
f'fetched: {date.today().isoformat()}\n'
'---\n\n'
)
out.write_text(fm + md, encoding='utf-8')
Single mega-file for LLM ingestion
For dropping into Claude's 1M-token context:
with open('mega.md', 'w', encoding='utf-8') as f:
for path in sorted(OUT.glob('*.md')):
f.write(f'\n\n---\n\n# {path.stem}\n\n')
f.write(path.read_text(encoding='utf-8'))
Rate-limit hygiene
You're hitting third-party sites, not MDisBetter — so the rate limits in play are theirs, not ours. Concurrency=15 with the User-Agent header set to something honest is gentle enough for almost every public site; for known-aggressive rate limiters (StackOverflow, GitHub, news aggregators), drop to concurrency=3-5 and add a small asyncio.sleep(0.5) per request. Always honour robots.txt on sites you don't own. Pocket / Instapaper exports point at hundreds of distinct domains, so cross-domain rate limits naturally spread; the script above handles 5000 mixed-domain URLs in under 30 minutes without trouble.
Common pitfalls
Filename collisions
Two URLs with similar paths produce identical filenames. Add a hash suffix or include the netloc in the filename to disambiguate.
Charset issues
Always write with encoding='utf-8' in Python. PowerShell users running the scripts on Windows: pass -Encoding utf8 when piping output, otherwise PowerShell will UTF-16 your files and break downstream tools.
Resume after partial failure
The existence check (if out.exists()) means you can re-run the script after a crash and only the missing URLs convert. Don't skip this — at scale, every batch will hit some failures, and the resume capability turns "start over" into "keep going".
Memory for huge batches
The asyncio.gather pattern loads all coroutines into memory before running. For 100K+ URLs, switch to a streaming pattern with asyncio.as_completed or a producer-consumer queue. Reduces memory from O(N) to O(concurrency).
Working with PDFs in batch?
For batch PDF conversion (a folder of 500 papers, an enterprise spec archive), the architecture is similar — swap the URL fetch for a file walk, swap Trafilatura for a PDF extractor (PyMuPDF, marker, pdftotext). See the batch convert 100+ PDFs to Markdown guide. Many workflows mix the two: scrape URLs from a sitemap and grab PDFs from a folder, both going into the same vector DB or vault.
Tracking progress on long-running batches
For batches that take 30+ minutes, you want visible progress and the ability to monitor from another terminal. Two lightweight patterns:
tqdm progress bar
The script above already uses tqdm_asyncio.gather — live progress bar, ETA, throughput rate, one extra import. Best default.
Stats file
Periodically write counts to a JSON file. Tail it from another terminal:
import json
from datetime import datetime
stats = {'ok': 0, 'fail': 0, 'started': datetime.now().isoformat()}
async def convert(client, url):
# ... do work ...
if status == 'ok':
stats['ok'] += 1
else:
stats['fail'] += 1
if (stats['ok'] + stats['fail']) % 10 == 0:
json.dump(stats, open('progress.json', 'w'))
From another terminal: watch -n 1 cat progress.json on Linux/macOS, or a small PowerShell loop on Windows.
Cost math at scale
Self-rolled batch conversion is essentially free in dollar terms — your only outlays are the bandwidth and the compute of the box you run it on. A 10,000-URL batch fits comfortably on a laptop. A 100,000-URL batch is a few hours on the same laptop, or 30 minutes on a single $20/month VPS.
The trade-off is engineering time: building, testing, and hardening the script costs a few hours up front, plus periodic maintenance when target sites change their HTML. For one-off batch jobs, that cost is paid once. For ongoing high-volume URL ingestion, it amortises to nothing.
Comparing OSS extractors
The four OSS choices for the extraction step:
- Trafilatura — best general-purpose. Excellent boilerplate stripping (benchmarks well against Mozilla Readability), Markdown output native, includes metadata extraction (title, date, author). Default recommendation.
- Mozilla Readability (via
readability-lxmlin Python) — the reference implementation. Slightly better at long-form articles, less good at structured docs. Pair withhtml2textormarkdownifyfor Markdown output. - html2text alone — zero readability layer. Use only when you've already isolated the content container with BeautifulSoup. Otherwise you'll get every nav menu and footer in the output.
- Pandoc — extremely capable but heavier. Better for structured input formats (HTML5, DocBook) than for the messy real-world web. Worth using when you need exotic output formats (LaTeX, EPUB) alongside Markdown.
For 95% of URL-to-Markdown work, Trafilatura is the right default.
Recommendation
For one-shot conversion of <50 URLs: open the MDisBetter web tool in a few tabs. For 50-1000 URLs of static pages: the Trafilatura sequential script. For 1000+ URLs or production workflows: the async httpx + Trafilatura version with tqdm and retry. For JS-heavy targets: the Playwright variant. Build one of these once, reuse forever — they're 30-80 lines whether you're doing 1000 URLs or 1,000,000. See also scrape a website to Markdown for RAG for the full RAG pipeline that uses this batch step as input.