Pricing Dashboard Sign up
Recent
· 9 min read · MDisBetter

Batch Convert 100+ URLs to Markdown at Once

Converting one URL at a time works for ad-hoc clipping. Converting your 800-URL reading list, an entire blog archive, or a competitor's documentation site needs batch — and batch done well needs concurrency, rate-limit handling, retry logic, and an output structure that doesn't collapse into 800 unnamed files. MDisBetter's web converter handles single URLs beautifully, but it's not a programmatic API — for true batch you'll roll a small script over OSS tooling. Here are the patterns that hold up from 100 URLs to 100,000.

When batch is the right move

Below ~10 URLs, batch isn't worth the setup — open the MDisBetter web tool in a few tabs and click Convert on each. From 10 to ~50 URLs, the multi-tab approach still works if you don't mind the elbow grease. Above 50 URLs, scripted batch over OSS extractors is dramatically better.

The honest baseline: MDisBetter is a web tool

MDisBetter doesn't currently expose a programmatic API, CLI, Python SDK, or MCP server for URL-to-Markdown. The web converter at /convert/url-to-markdown takes one URL per click and returns one .md download. That's by design — it's the lowest-friction surface for ad-hoc work. For 100+ URLs you want automation, and the right path is a self-rolled script using the OSS extractors below. None of them are commercial; all are pip-installable.

Web tool with multiple tabs (up to ~50 URLs)

For small batches, the manual path is hard to beat: open /convert/url-to-markdown in 5-10 browser tabs, paste a URL into each, click Convert, save the resulting .md to a folder. Modern browsers can keep dozens of tabs alive without complaint, and the conversions run server-side in parallel — you're effectively batching at the user-agent layer. Best for one-shot work where setting up Python feels heavier than opening tabs.

Sequential Python script using Trafilatura

For batches of 50-200 URLs and someone comfortable in Python, Trafilatura is the best-in-class OSS readability + Markdown extractor. It handles boilerplate stripping, link preservation, and table extraction in one call:

#!/usr/bin/env python
# pip install trafilatura
import re
from pathlib import Path
import trafilatura

OUT = Path('./out')
OUT.mkdir(exist_ok=True)

def url_to_path(url):
    slug = re.sub(r'[^A-Za-z0-9]+', '_',
                  url.replace('https://', '').replace('http://', ''))[:100]
    return OUT / f'{slug}.md'

for url in Path('urls.txt').read_text().splitlines():
    url = url.strip()
    if not url or url.startswith('#'):
        continue
    out = url_to_path(url)
    if out.exists():
        print(f'SKIP {url}')
        continue
    downloaded = trafilatura.fetch_url(url)
    if not downloaded:
        print(f'FETCH_FAIL {url}')
        continue
    md = trafilatura.extract(downloaded, output_format='markdown',
                             include_links=True, include_tables=True)
    if md:
        out.write_text(md, encoding='utf-8')
        print(f'OK   {url}')
    else:
        print(f'EXTRACT_FAIL {url}')

Run with python convert.py. Resume support is built in (the out.exists() check). Sequential, ~1-3 seconds per URL, so 100 URLs takes ~3-5 minutes.

Parallel Python script with httpx + Trafilatura

Drop sequential time by 10x by parallelising the network fetch (Trafilatura's extraction is CPU-bound and quick):

#!/usr/bin/env python
# pip install httpx trafilatura tqdm
import asyncio
import re
from pathlib import Path
import httpx
import trafilatura
from tqdm.asyncio import tqdm_asyncio

OUT = Path('./out')
OUT.mkdir(exist_ok=True)
CONCURRENCY = 15
MAX_RETRIES = 3

sem = asyncio.Semaphore(CONCURRENCY)

def url_to_path(url):
    slug = re.sub(r'[^A-Za-z0-9]+', '_',
                  url.replace('https://', '').replace('http://', ''))[:100]
    return OUT / f'{slug}.md'

async def convert(client, url, attempt=1):
    out = url_to_path(url)
    if out.exists():
        return ('skip', url)
    async with sem:
        try:
            r = await client.get(url, timeout=60, follow_redirects=True,
                                 headers={'User-Agent': 'Mozilla/5.0 (compatible; URL2MD/1.0)'})
        except (httpx.TimeoutException, httpx.NetworkError):
            if attempt < MAX_RETRIES:
                await asyncio.sleep(2 ** attempt)
                return await convert(client, url, attempt + 1)
            return ('net_fail', url)
    if r.status_code == 429:
        await asyncio.sleep(int(r.headers.get('Retry-After', 5)))
        if attempt < MAX_RETRIES:
            return await convert(client, url, attempt + 1)
        return ('rate_limit', url)
    if r.status_code != 200:
        return (f'http_{r.status_code}', url)
    md = trafilatura.extract(r.text, output_format='markdown',
                             include_links=True, include_tables=True)
    if not md:
        return ('extract_fail', url)
    out.write_text(md, encoding='utf-8')
    return ('ok', url)

async def main(urls):
    async with httpx.AsyncClient() as client:
        results = await tqdm_asyncio.gather(*[convert(client, u) for u in urls])
    counts = {}
    for status, url in results:
        counts[status] = counts.get(status, 0) + 1
        if status not in ('ok', 'skip'):
            print(f'{status}: {url}')
    print(f'Summary: {counts}')

if __name__ == '__main__':
    urls = [u.strip() for u in Path('urls.txt').read_text().splitlines()
            if u.strip() and not u.strip().startswith('#')]
    print(f'Processing {len(urls)} URLs at concurrency={CONCURRENCY}')
    asyncio.run(main(urls))

15 concurrent workers, progress bar, retry-with-backoff for transient failures. 1000 URLs in 3-6 minutes on a typical home connection. The same 80 lines work for 1000 URLs or 100,000.

JS-rendered pages: Playwright + html2text

Trafilatura is fast and good at static HTML. For SPAs (React, Vue, Angular, Docusaurus, Notion-public-pages, GitBook), it'll get the empty shell — you need a real browser. Playwright is the standard OSS answer:

#!/usr/bin/env python
# pip install playwright html2text
# python -m playwright install chromium
import asyncio
import re
from pathlib import Path
from playwright.async_api import async_playwright
import html2text

OUT = Path('./out')
OUT.mkdir(exist_ok=True)
CONCURRENCY = 5  # browser pages are heavier than HTTP requests

h = html2text.HTML2Text()
h.body_width = 0
h.ignore_images = False

def url_to_path(url):
    slug = re.sub(r'[^A-Za-z0-9]+', '_',
                  url.replace('https://', '').replace('http://', ''))[:100]
    return OUT / f'{slug}.md'

async def convert(browser, url, sem):
    out = url_to_path(url)
    if out.exists():
        return
    async with sem:
        ctx = await browser.new_context()
        page = await ctx.new_page()
        try:
            await page.goto(url, wait_until='networkidle', timeout=30000)
            html = await page.content()
        finally:
            await ctx.close()
    md = h.handle(html)
    out.write_text(md, encoding='utf-8')

async def main(urls):
    sem = asyncio.Semaphore(CONCURRENCY)
    async with async_playwright() as p:
        browser = await p.chromium.launch()
        await asyncio.gather(*[convert(browser, u, sem) for u in urls])
        await browser.close()

urls = [u.strip() for u in Path('urls.txt').read_text().splitlines() if u.strip()]
asyncio.run(main(urls))

Slower per page (~3-5 sec each, including browser-context spin-up) but works on every JS-rendered site. For mixed batches, classify URLs first (by domain or by a quick HEAD-check on response size) and route static URLs through the Trafilatura path, JS URLs through Playwright.

Output organization patterns

Flat folder, slugged filenames

Simplest. Every .md in one folder, filename derived from URL. Works up to ~5000 files before filesystem performance degrades on most modern OS.

Mirror the URL path structure

For converting a docs site, mirror the path so files map back to URL structure:

def url_to_mirror_path(url):
    from urllib.parse import urlparse
    p = urlparse(url)
    parts = [p.netloc] + [x for x in p.path.strip('/').split('/') if x]
    if not parts or not parts[-1].endswith('.md'):
        parts.append('index.md')
    parts[-1] = parts[-1].replace('.html', '') + ('.md' if not parts[-1].endswith('.md') else '')
    return OUT.joinpath(*parts)

Result: out/docs.python.org/3/library/json.md — full URL hierarchy preserved, easy to navigate manually.

One file per URL with frontmatter

For knowledge-base ingestion, prepend YAML frontmatter to each file so downstream tools (Obsidian, Hugo, MkDocs) can index by source:

from datetime import date
def write_with_frontmatter(out, url, title, md):
    fm = (
        '---\n'
        f'source: {url}\n'
        f'title: "{title}"\n'
        f'fetched: {date.today().isoformat()}\n'
        '---\n\n'
    )
    out.write_text(fm + md, encoding='utf-8')

Single mega-file for LLM ingestion

For dropping into Claude's 1M-token context:

with open('mega.md', 'w', encoding='utf-8') as f:
    for path in sorted(OUT.glob('*.md')):
        f.write(f'\n\n---\n\n# {path.stem}\n\n')
        f.write(path.read_text(encoding='utf-8'))

Rate-limit hygiene

You're hitting third-party sites, not MDisBetter — so the rate limits in play are theirs, not ours. Concurrency=15 with the User-Agent header set to something honest is gentle enough for almost every public site; for known-aggressive rate limiters (StackOverflow, GitHub, news aggregators), drop to concurrency=3-5 and add a small asyncio.sleep(0.5) per request. Always honour robots.txt on sites you don't own. Pocket / Instapaper exports point at hundreds of distinct domains, so cross-domain rate limits naturally spread; the script above handles 5000 mixed-domain URLs in under 30 minutes without trouble.

Common pitfalls

Filename collisions

Two URLs with similar paths produce identical filenames. Add a hash suffix or include the netloc in the filename to disambiguate.

Charset issues

Always write with encoding='utf-8' in Python. PowerShell users running the scripts on Windows: pass -Encoding utf8 when piping output, otherwise PowerShell will UTF-16 your files and break downstream tools.

Resume after partial failure

The existence check (if out.exists()) means you can re-run the script after a crash and only the missing URLs convert. Don't skip this — at scale, every batch will hit some failures, and the resume capability turns "start over" into "keep going".

Memory for huge batches

The asyncio.gather pattern loads all coroutines into memory before running. For 100K+ URLs, switch to a streaming pattern with asyncio.as_completed or a producer-consumer queue. Reduces memory from O(N) to O(concurrency).

Working with PDFs in batch?

For batch PDF conversion (a folder of 500 papers, an enterprise spec archive), the architecture is similar — swap the URL fetch for a file walk, swap Trafilatura for a PDF extractor (PyMuPDF, marker, pdftotext). See the batch convert 100+ PDFs to Markdown guide. Many workflows mix the two: scrape URLs from a sitemap and grab PDFs from a folder, both going into the same vector DB or vault.

Tracking progress on long-running batches

For batches that take 30+ minutes, you want visible progress and the ability to monitor from another terminal. Two lightweight patterns:

tqdm progress bar

The script above already uses tqdm_asyncio.gather — live progress bar, ETA, throughput rate, one extra import. Best default.

Stats file

Periodically write counts to a JSON file. Tail it from another terminal:

import json
from datetime import datetime
stats = {'ok': 0, 'fail': 0, 'started': datetime.now().isoformat()}

async def convert(client, url):
    # ... do work ...
    if status == 'ok':
        stats['ok'] += 1
    else:
        stats['fail'] += 1
    if (stats['ok'] + stats['fail']) % 10 == 0:
        json.dump(stats, open('progress.json', 'w'))

From another terminal: watch -n 1 cat progress.json on Linux/macOS, or a small PowerShell loop on Windows.

Cost math at scale

Self-rolled batch conversion is essentially free in dollar terms — your only outlays are the bandwidth and the compute of the box you run it on. A 10,000-URL batch fits comfortably on a laptop. A 100,000-URL batch is a few hours on the same laptop, or 30 minutes on a single $20/month VPS.

The trade-off is engineering time: building, testing, and hardening the script costs a few hours up front, plus periodic maintenance when target sites change their HTML. For one-off batch jobs, that cost is paid once. For ongoing high-volume URL ingestion, it amortises to nothing.

Comparing OSS extractors

The four OSS choices for the extraction step:

For 95% of URL-to-Markdown work, Trafilatura is the right default.

Recommendation

For one-shot conversion of <50 URLs: open the MDisBetter web tool in a few tabs. For 50-1000 URLs of static pages: the Trafilatura sequential script. For 1000+ URLs or production workflows: the async httpx + Trafilatura version with tqdm and retry. For JS-heavy targets: the Playwright variant. Build one of these once, reuse forever — they're 30-80 lines whether you're doing 1000 URLs or 1,000,000. See also scrape a website to Markdown for RAG for the full RAG pipeline that uses this batch step as input.

Frequently asked questions

How do I avoid duplicating work when running the script multiple times?
The existence check (`if out.exists()` in Python) provides resume support out of the box — already-converted URLs are skipped. For periodic re-runs against changing source content, store a per-URL fetch timestamp and only re-convert URLs whose source has updated since (HEAD request, compare Last-Modified header).
What if some URLs in my batch require JavaScript rendering and others don't?
Pre-classify by domain or by a HEAD-check on the response size, then route static URLs through the Trafilatura path and JS URLs through the Playwright path. Or run Trafilatura first; for any URL where extraction returns an empty or near-empty result, fall back to Playwright. The fallback approach catches the surprises (a static-looking site that turns out to be client-rendered).
Can I run batch conversion as a scheduled job (cron, GitHub Actions)?
Yes — the Python scripts are fully non-interactive. Schedule the workflow, commit the urls.txt to your repo (or fetch it dynamically from a database/API), and let it run. The output Markdown can be committed back to the repo, uploaded to S3, or pushed to a vector DB depending on your downstream consumer. GitHub Actions handles the Trafilatura path comfortably; for Playwright you'll want a self-hosted runner since the headless browser eats memory beyond the GitHub-hosted runner limits on big batches.
Does MDisBetter offer a programmatic API for this?
Not today. The web tool at /convert/url-to-markdown is the supported surface, designed for ad-hoc single-URL conversions. For batch you build the script yourself with Trafilatura or Playwright as shown above — the OSS path is mature, free, and gives you full control over rate limits, output structure, and authentication for private pages.