Pricing Dashboard Sign up
Recent
· 8 min read · MDisBetter

How to Convert Any URL to Markdown (Step-by-Step Guide)

You found a great article, an API doc page, a Wikipedia entry, or a tutorial — and you want it in Markdown. Maybe to feed an LLM, drop into Obsidian, save in a knowledge base, or just read distraction-free. Doing it manually means copy-pasting, then fighting line breaks, ads, navigation menus, sidebar widgets, and HTML cruft for ten minutes. There is a 30-second version. Here is how it works, plus the advanced controls you eventually want.

The 30-second version

The fastest path: open our URL to Markdown converter, paste the URL, click convert, copy the Markdown or download the .md file. That is it. The output strips menus, ads, footers, cookie banners, and JavaScript widgets, then converts the readable content (article body, headings, lists, code blocks, tables, images) into clean Markdown.

Try it on something real: paste https://en.wikipedia.org/wiki/Markdown. You get the article body in GitHub-Flavored Markdown — H1 down to H4, lists nested correctly, tables intact, internal links rewritten as relative-to-Wikipedia URLs. The right column, navigation, edit links, and language picker are gone.

How it works under the hood (briefly)

Three stages run for every URL:

  1. Fetch. The page is loaded server-side. By default the converter uses a fast HTTP fetch; if the page detects as a JavaScript-only SPA, it transparently falls back to a headless browser (more on that below).
  2. Readability. A boilerplate-removal pass identifies the actual content region versus chrome. This is the same conceptual algorithm Firefox Reader View uses, hardened for edge cases (multi-column blogs, news sites with anti-scraping markup, AMP variants).
  3. Markdown serialization. The cleaned HTML tree is converted to GitHub-Flavored Markdown: headings, lists, tables, fenced code blocks with language hints, images with alt text, links with anchor text.

That is the default. If you need different behavior — render JavaScript explicitly, target a specific section — read on.

Step-by-step walkthrough

Step 1: Paste the URL

Any public HTTP/HTTPS URL works. Logged-in pages (behind authentication) require a publicly-accessible URL — the web tool fetches anonymously. For auth'd pages, see the OSS path further down.

Step 2: Click Convert

One click. Output appears in a preview pane within 1-3 seconds for static pages, 5-10 seconds for SPA-rendered pages.

Step 3: Copy or download

Two options: copy to clipboard (paste straight into ChatGPT, Notion, Obsidian, your editor) or download the .md file (the page title becomes the filename).

Advanced: when auto-detection isn't enough

Auto-detection works ~95% of the time. The 5% where it falls short: heavily customized blog templates, paginated articles where the content is inside a non-standard wrapper, or sites where the readability pass misidentifies a sidebar widget as the main content.

For those cases, the OSS path gives you full control. With BeautifulSoup you can target a specific element and feed just that fragment to a Markdown converter:

# pip install requests beautifulsoup4 trafilatura
import requests
from bs4 import BeautifulSoup
import trafilatura

html = requests.get('https://example.com/article').text
soup = BeautifulSoup(html, 'lxml')
main = soup.select_one('article.main-content')  # your custom selector
md = trafilatura.extract(
    str(main),
    output_format='markdown',
    include_links=True,
    include_tables=True,
)
print(md)

Common selectors to try: article, main article, .post-content, #main-content, div[role="main"]. Inspect the page in your browser DevTools to find the right one.

Advanced: JavaScript rendering

Many modern sites do not render content in the initial HTML. React, Vue, and other SPA frameworks ship a near-empty <div id="root"> in the static HTML and render the actual content client-side after JavaScript executes. Static fetch on these pages returns nothing useful.

The MDisBetter web tool auto-detects this — if the static fetch returns less than a threshold of body text, it falls back to headless browser rendering automatically. You don't need to do anything; the resulting Markdown looks the same as for a static page.

For batch SPA conversion or when you want full control, use Playwright in the OSS path. For a deep dive on which sites need this and the Playwright recipe, see our SPA conversion guide and handling JavaScript-rendered pages for Markdown.

Advanced: batch conversion

One URL at a time is fine for ad-hoc use. For converting an entire blog archive, an API documentation site, or a list of 200 reading-list URLs, you'll want a script. MDisBetter doesn't currently expose a programmatic API, CLI, Python SDK, or MCP server — for true batch you roll your own with OSS extractors. The right tool is Trafilatura (best-in-class readability + Markdown output, MIT-licensed):

# pip install trafilatura
from pathlib import Path
import trafilatura

urls = Path('urls.txt').read_text().splitlines()
for url in urls:
    url = url.strip()
    if not url:
        continue
    downloaded = trafilatura.fetch_url(url)
    md = trafilatura.extract(
        downloaded,
        output_format='markdown',
        include_links=True,
        include_tables=True,
    )
    if not md:
        print(f'EXTRACT_FAIL {url}')
        continue
    slug = url.rstrip('/').split('/')[-1] or 'index'
    Path(f'{slug}.md').write_text(md, encoding='utf-8')
    print(f'OK {url}')

For full batch patterns (concurrency, rate limiting, retries, JS-rendering fallback), see batch convert 100+ URLs to Markdown.

Common use cases

Feed an LLM

The single most common use. ChatGPT and Claude both accept Markdown as input — far more reliably than HTML or PDF. Convert the URL, paste the Markdown into the prompt, ask your question. Token usage drops 60-80% vs pasting raw HTML.

Save to Obsidian

Save the .md file to your vault. Obsidian indexes it instantly, links work, graph view picks it up. Often cleaner than browser-extension web clippers because there's no per-site template guessing — the readability pipeline applies the same logic across every domain. See URL to Markdown for Obsidian.

Import into Notion

Notion's import accepts .md directly. Convert the URL, drop the file into Notion, get a fully editable page with proper block structure (versus the read-only embed of the native web import). See Notion import guide.

Build a documentation archive

Convert an entire docs site (Stripe API, FastAPI docs, your own internal docs) into Markdown for offline reading, AI ingestion, or migration to a docs-as-code workflow. See documentation site conversion.

What about PDFs?

If you have PDFs instead of URLs, our PDF to Markdown converter follows the same pattern: same Markdown quality, same downstream uses, same web-tool surface. Many users mix the two: some sources are URLs, some are PDFs, both end up as Markdown in the same vault or RAG pipeline.

Pitfalls and how to handle them

Paywalled articles

If the content sits behind a paywall, the converter sees what an anonymous browser sees — usually a teaser plus a paywall message. Use the page in a way that respects the publisher's terms; for paywalled content you have legitimate access to, fetch via your authenticated session in your own script, then run Trafilatura on the result.

Anti-scraping protections

Some sites (Cloudflare-protected, Datadome-protected, etc.) block automated fetches. Symptom: the converter returns an error or a near-empty page. There is rarely a clean workaround for these — the publisher is signaling they don't want automated access.

Pagination

Multi-page articles need each page converted separately. For long-form content split across 5+ pages, look for a "View all on one page" link first.

Non-article pages

The converter is tuned for article-shaped content. Home pages, search result pages, dashboards, and forms get poor results because there is no meaningful "main content" region. For these, drop down to the OSS path and target a specific CSS selector with BeautifulSoup.

Auth-walled pages

The MDisBetter web tool fetches anonymously and doesn't accept session cookies. For pages behind a login, use the OSS path with requests (or httpx) plus your session cookie or bearer token in headers, then feed the response HTML to Trafilatura.

Quality benchmarks

How does the auto-detected output compare to hand-curated extraction? On a benchmark of 100 mixed pages (50 articles, 25 docs pages, 15 blog posts, 10 SPAs), measured against manually-cleaned ground truth:

The 5-10% of imperfect cases are exactly where the OSS-path-with-custom-selector earns its place. Two minutes of inspection turns a 92% page into a 100% page.

How this compares to copy-paste

Copy-paste from a browser is the baseline alternative. Three problems with it: (1) the copy includes invisible Unicode (zero-width spaces, non-breaking spaces, smart quotes) that often break downstream tools; (2) line breaks are inconsistent — some browsers preserve paragraph breaks, others collapse everything into one line; (3) you copy the visible page, not the underlying structure, so headings come out as bold text rather than H2/H3, and lists come out as plain paragraphs with bullets in front. Markdown conversion produces structurally-correct output with none of these issues.

Recommendation

For most users: just use the web UI. It handles 95% of pages with no configuration. When you hit a page where auto-detection fails, drop down to the OSS Trafilatura + BeautifulSoup recipe with a custom selector. When you need volume, the OSS Trafilatura batch script scales linearly. The whole approach is built so the easy path stays easy and the advanced path is a few lines of Python away. For more complex workflows, see scraping for RAG, batch conversion, and whole-site documentation conversion.

Frequently asked questions

Does the URL to Markdown converter work on private intranet pages?
Only if the page is reachable from our servers. Internal corporate URLs behind a VPN or firewall aren't accessible — for those, run a small Python script on your own network: requests.get(url) with your session, then trafilatura.extract on the response. Both libraries are pip-installable and work offline.
Will the converted Markdown preserve the original page's images?
Yes — images are preserved as Markdown image syntax with the original URL as the source. The image files themselves are not downloaded by the web tool. If you need self-contained Markdown with images downloaded locally, post-process the .md with a small script: regex-extract every ![alt](url), fetch each image, save locally, rewrite the link.
How do I convert a URL that requires login?
Two options: (1) get a publicly-accessible version of the URL, or (2) write a small Python script using requests with your session cookie or bearer token in headers, then run trafilatura.extract on the response. The MDisBetter web tool fetches anonymously, so auth'd content lives in the OSS-script path.
Does MDisBetter offer a programmatic API, CLI, or Python SDK?
Not today. The supported surface is the web tool at /convert/url-to-markdown — designed for ad-hoc one-off conversions. For automation (batch, scheduled jobs, RAG ingestion), the right path is Trafilatura plus httpx (and optionally Playwright for JS-heavy sites). All MIT-licensed and a few pip-installs away.