Pricing Dashboard Sign up
Recent
· 11 min read · MDisBetter

Scrape a Website to Markdown for RAG (Python Tutorial)

Building a RAG pipeline on top of a website's content is a four-step problem: discover URLs, fetch them, convert to a clean format, chunk for embedding. The format step is where most pipelines silently degrade — raw HTML chunks pollute embeddings with menu text, footer noise, and ad markup, dragging retrieval quality down by 30-50% before you even reach the embedding model. Markdown solves it cleanly.

This tutorial uses entirely open-source tooling — Trafilatura for extraction, ChromaDB for the vector store, OpenAI embeddings — so the whole pipeline is yours to own. MDisBetter ships a one-URL-at-a-time web tool for ad-hoc conversions; for the at-scale RAG case described here you want full programmatic control, which the OSS path gives you.

The pipeline at a glance

sitemap.xml
     ↓
list of URLs
     ↓
[fetch + extract Markdown via Trafilatura] (concurrent)
     ↓
folder of .md files
     ↓
[chunk by H2 headings]
     ↓
list of (chunk_text, source_url, section_title)
     ↓
[embed → vector DB]
     ↓
RAG-ready

We'll build each stage with runnable code. Target site: https://docs.python.org (Python 3 documentation — public, well-structured, has a sitemap, ~600 pages). Install once: pip install requests trafilatura httpx tqdm openai chromadb.

Stage 1: Discover URLs from the sitemap

import requests
import xml.etree.ElementTree as ET

NS = {'sm': 'http://www.sitemaps.org/schemas/sitemap/0.9'}

def fetch_sitemap(url):
    """Recursively expand sitemap indexes into a flat URL list."""
    xml = requests.get(url, timeout=30).text
    root = ET.fromstring(xml)
    urls = []
    for loc in root.findall('.//sm:loc', NS):
        if loc.text.endswith('.xml'):
            urls.extend(fetch_sitemap(loc.text))
        else:
            urls.append(loc.text)
    return urls

all_urls = fetch_sitemap('https://docs.python.org/3/sitemap.xml')
# Filter to tutorial + library + reference sections
docs_urls = [u for u in all_urls if any(
    x in u for x in ('/tutorial/', '/library/', '/reference/', '/howto/')
)]
print(f'{len(docs_urls)} doc URLs')

Stage 2: Fetch + extract Markdown with Trafilatura

Trafilatura is the OSS gold standard for readability extraction — better boilerplate stripping than html2text, native Markdown output, handles most static sites cleanly. Combine it with httpx for concurrent fetching:

import asyncio
import hashlib
from pathlib import Path
import httpx
import trafilatura
from tqdm.asyncio import tqdm_asyncio

OUT = Path('./corpus')
OUT.mkdir(exist_ok=True)
SEM = asyncio.Semaphore(10)  # 10 concurrent

def url_to_filename(url):
    h = hashlib.sha1(url.encode()).hexdigest()[:10]
    safe = url.replace('https://', '').replace('/', '_')[:100]
    return OUT / f'{safe}_{h}.md'

async def convert_one(client, url):
    out = url_to_filename(url)
    if out.exists():
        return  # idempotent
    async with SEM:
        try:
            r = await client.get(
                url, timeout=60, follow_redirects=True,
                headers={'User-Agent': 'Mozilla/5.0 (compatible; RAG-builder/1.0)'},
            )
        except httpx.RequestError as e:
            print(f'NET FAIL {url}: {e}')
            return
    if r.status_code != 200:
        print(f'HTTP FAIL {url}: {r.status_code}')
        return
    md = trafilatura.extract(
        r.text,
        output_format='markdown',
        include_links=True,
        include_tables=True,
        # Help Trafilatura find the main content on Sphinx-themed sites
        favor_precision=True,
    )
    if not md:
        print(f'EXTRACT FAIL {url}')
        return
    out.write_text(
        f'\n\n{md}',
        encoding='utf-8',
    )

async def main(urls):
    async with httpx.AsyncClient() as client:
        await tqdm_asyncio.gather(*[convert_one(client, u) for u in urls])

asyncio.run(main(docs_urls))

At concurrency=10 over a typical home connection, ~600 URLs convert in 3-6 minutes. Cost: zero — everything runs locally.

What about JS-rendered docs?

The Python docs are server-rendered, so Trafilatura works great. For client-rendered docs sites (some Stoplight, ReadMe.io, or Mintlify-built docs), Trafilatura gets the empty shell. Two options:

Stage 3: Chunk by H2 headings

This is the step that most tutorials get wrong. The naive approach is to chunk by character count (every 1000 characters) or token count (every 500 tokens). That destroys semantic boundaries — a single explanation gets sliced mid-sentence, and the embedding model loses meaningful context.

The right pattern for Markdown corpora is to chunk by H2 sections. Each H2 represents a logically self-contained subtopic. Chunks are larger (1000-3000 tokens typically) but each one is a coherent unit:

import re
from dataclasses import dataclass

@dataclass
class Chunk:
    source_url: str
    section_title: str
    text: str
    tokens: int  # approximate

H2_RE = re.compile(r'^## (.+)$', re.MULTILINE)
SOURCE_RE = re.compile(r'')

def approx_tokens(text):
    # rough heuristic: 1 token ≈ 4 chars in English
    return max(1, len(text) // 4)

def chunk_markdown_file(path):
    raw = path.read_text(encoding='utf-8')
    src = SOURCE_RE.search(raw)
    source_url = src.group(1) if src else str(path)

    matches = list(H2_RE.finditer(raw))
    if not matches:
        # No H2s: treat entire file as one chunk
        return [Chunk(source_url, path.stem, raw, approx_tokens(raw))]

    chunks = []
    if matches[0].start() > 0:
        intro = raw[:matches[0].start()].strip()
        if intro:
            chunks.append(Chunk(source_url, '_intro', intro, approx_tokens(intro)))

    for i, m in enumerate(matches):
        title = m.group(1).strip()
        start = m.start()
        end = matches[i+1].start() if i+1 < len(matches) else len(raw)
        text = raw[start:end].strip()
        chunks.append(Chunk(source_url, title, text, approx_tokens(text)))

    return chunks

all_chunks = []
for f in OUT.glob('*.md'):
    all_chunks.extend(chunk_markdown_file(f))

print(f'{len(all_chunks)} chunks from {len(list(OUT.glob("*.md")))} files')
print(f'Avg tokens/chunk: {sum(c.tokens for c in all_chunks) / len(all_chunks):.0f}')

Sub-chunking very large sections

If an H2 section is larger than your embedding model's context (rare but possible — some Python library docs have 10,000-token sections), recursively chunk it by H3, then by paragraph if still too large:

MAX_TOKENS = 4000

def sub_chunk(chunk):
    if chunk.tokens <= MAX_TOKENS:
        return [chunk]
    parts = re.split(r'\n(?=### )', chunk.text)
    if len(parts) > 1:
        return [
            Chunk(
                chunk.source_url,
                f'{chunk.section_title} / {p.split(chr(10))[0][4:]}',
                p,
                approx_tokens(p),
            )
            for p in parts
        ]
    paras = chunk.text.split('\n\n')
    out, cur, cur_tokens = [], [], 0
    for p in paras:
        pt = approx_tokens(p)
        if cur_tokens + pt > MAX_TOKENS and cur:
            out.append(Chunk(
                chunk.source_url, chunk.section_title,
                '\n\n'.join(cur), cur_tokens,
            ))
            cur, cur_tokens = [], 0
        cur.append(p)
        cur_tokens += pt
    if cur:
        out.append(Chunk(
            chunk.source_url, chunk.section_title,
            '\n\n'.join(cur), cur_tokens,
        ))
    return out

final_chunks = [c for chunk in all_chunks for c in sub_chunk(chunk)]

Stage 4: Embed and store

Pick your embedding model and vector DB. The example below uses OpenAI embeddings + ChromaDB (other combinations work identically — just swap the imports):

from openai import OpenAI
import chromadb

client = OpenAI()  # OPENAI_API_KEY in env
chroma = chromadb.PersistentClient(path='./chroma')
collection = chroma.get_or_create_collection('python-docs')

BATCH = 100
for i in range(0, len(final_chunks), BATCH):
    batch = final_chunks[i:i+BATCH]
    resp = client.embeddings.create(
        model='text-embedding-3-small',
        input=[c.text for c in batch],
    )
    collection.add(
        ids=[f'{c.source_url}#{c.section_title}#{i+j}'
             for j, c in enumerate(batch)],
        embeddings=[d.embedding for d in resp.data],
        documents=[c.text for c in batch],
        metadatas=[{
            'source_url': c.source_url,
            'section': c.section_title,
            'tokens': c.tokens,
        } for c in batch],
    )
    print(f'Embedded {i+len(batch)}/{len(final_chunks)}')

Stage 5: Query

def rag_query(question, top_k=5):
    q_embed = client.embeddings.create(
        model='text-embedding-3-small',
        input=[question],
    ).data[0].embedding

    results = collection.query(
        query_embeddings=[q_embed],
        n_results=top_k,
    )

    context = '\n\n---\n\n'.join([
        f"## {meta['section']}\nSource: {meta['source_url']}\n\n{doc}"
        for doc, meta in zip(results['documents'][0], results['metadatas'][0])
    ])

    answer = client.chat.completions.create(
        model='gpt-4o-mini',
        messages=[
            {'role': 'system', 'content':
             'Answer using only the provided sources. Cite source URLs.'},
            {'role': 'user', 'content': f'{context}\n\nQuestion: {question}'},
        ],
    )
    return answer.choices[0].message.content

print(rag_query('How do I create a list comprehension with conditional filtering?'))

That's the whole pipeline. ~150 lines of Python, end-to-end runnable, uses real public docs as a target, zero proprietary dependencies.

Why Markdown gives better retrieval than HTML

Three concrete reasons:

  1. Boilerplate-free embeddings. HTML chunks include navigation, footer, copyright text. These tokens dilute the embedding signal — the chunk's vector reflects the average meaning of "copyright + navigation + actual content" rather than just the content.
  2. Cleaner chunk boundaries. H2 sections in clean Markdown align with semantic boundaries (subtopics). HTML chunks are usually split by character count, breaking semantic coherence.
  3. Smaller chunks for the same information. Markdown is 60-80% smaller than equivalent HTML in tokens. More information fits per chunk; the LLM gets richer context per retrieval.

Empirically, replacing HTML with Markdown in a RAG pipeline improves retrieval precision by 15-30% on the same queries, with no other changes. The biggest single quality win you can get for the smallest engineering effort.

Working with PDFs in your RAG corpus?

If your knowledge source is PDFs rather than (or in addition to) URLs, the same pattern applies — just swap the conversion step. See PDF to Markdown for RAG and the comprehensive PDF RAG pipeline guide. Many production RAG systems mix sources: web docs via this tutorial, PDF policy docs via the PDF guide, all chunked and embedded into the same vector DB.

Adding metadata filters at retrieval time

The chunk metadata (source_url, section title) becomes powerful when you let users filter at query time. Examples:

def rag_query_filtered(question, source_filter=None, top_k=5):
    q_embed = client.embeddings.create(
        model='text-embedding-3-small',
        input=[question],
    ).data[0].embedding

    where = {'source_url': {'$contains': source_filter}} if source_filter else None
    results = collection.query(
        query_embeddings=[q_embed],
        n_results=top_k,
        where=where,
    )
    return results

# Only retrieve from the tutorial section
rag_query_filtered('list comprehensions', source_filter='/tutorial/')
# Only retrieve from a specific module
rag_query_filtered('json parsing', source_filter='/library/json')

For a multi-tenant knowledge base (each customer's docs in the same DB), source_filter becomes a tenant scope filter. Same vector DB, isolated retrieval per tenant.

Hybrid search: dense + keyword

Embedding-only retrieval misses queries with rare terms (proper nouns, error codes, version numbers). Hybrid search combines dense (embedding) with sparse (BM25 keyword) and merges results:

from rank_bm25 import BM25Okapi
import numpy as np

docs = [c.text for c in final_chunks]
tokenized = [d.lower().split() for d in docs]
bm25 = BM25Okapi(tokenized)

def hybrid_query(question, top_k=5, alpha=0.7):
    q_embed = client.embeddings.create(
        model='text-embedding-3-small', input=[question]
    ).data[0].embedding
    dense = collection.query(query_embeddings=[q_embed], n_results=top_k*2)

    sparse_scores = bm25.get_scores(question.lower().split())
    sparse_top = np.argsort(sparse_scores)[::-1][:top_k*2]

    scores = {}
    for i, doc in enumerate(dense['documents'][0]):
        scores[doc] = scores.get(doc, 0) + alpha / (i + 1)
    for rank, idx in enumerate(sparse_top):
        doc = docs[idx]
        scores[doc] = scores.get(doc, 0) + (1 - alpha) / (rank + 1)

    return sorted(scores.items(), key=lambda x: -x[1])[:top_k]

Hybrid typically improves recall by 5-15% over dense-only with no significant cost increase. Worth the extra 30 lines on any production RAG.

Re-ranking the top-K

For higher precision, fetch top-20 from the vector DB and re-rank with a cross-encoder model (Cohere Rerank, BGE Reranker) to top-5 before sending to the LLM:

import cohere, os
co = cohere.Client(os.environ['COHERE_API_KEY'])

def rerank(question, candidates, top_k=5):
    r = co.rerank(
        query=question,
        documents=[c['text'] for c in candidates],
        top_n=top_k,
        model='rerank-english-v3.0',
    )
    return [candidates[hit.index] for hit in r.results]

Re-ranking is the single biggest precision win you can add after good chunking. Empirically improves answer quality 10-25% on hard queries.

Evaluation: don't fly blind

Build a golden-set of 50-100 questions with known good source URLs. After every pipeline change (new chunking, new extractor settings, new embedding model, new retrieval logic), re-run the golden set and measure:

Without an eval set, every change is a vibe-based decision. Spend two hours building the golden set; it pays for itself the first time you avoid a regression.

Refreshing the corpus over time

Re-fetch the sitemap and check each URL's <lastmod> date against your last-converted timestamp. Re-extract only URLs newer than your last sync. Re-chunk and re-embed those updated URLs (delete old chunks first by source_url metadata filter). Schedule the job nightly via cron or GitHub Actions. The whole pipeline is your own Python — no vendor lock-in to worry about.

Recommendation

This pipeline is production-ready as-written for corpora up to ~50K chunks. For larger scale: use a managed vector DB (Pinecone, Qdrant Cloud, Weaviate Cloud) instead of local Chroma, switch to streaming embedding (yield-as-you-go rather than batch), and add observability for retrieval quality regressions. The extraction step (Trafilatura) scales linearly — running 100K URLs through it is the same code, just longer wall-clock time. See also batch convert 100+ URLs for the scaling patterns at the conversion step, and handling JavaScript-rendered pages for the SPA-extraction recipe.

Frequently asked questions

Why chunk by H2 instead of fixed token windows?
Fixed-window chunking (e.g., 'every 500 tokens') breaks across semantic boundaries — a single explanation gets cut mid-sentence, and the embedding model loses coherence. H2 boundaries align with logical subtopics, so each chunk is a self-contained unit. Empirically this improves retrieval precision by 15-30% over fixed-window chunking on the same corpus.
How do I keep the corpus fresh as the source site updates?
Re-fetch the sitemap and check each URL's <lastmod> date against your last-converted timestamp. Re-convert only URLs newer than your last sync. Re-chunk and re-embed those updated URLs (delete old chunks first by source_url metadata filter). Schedule the job nightly via cron, GitHub Actions, or any scheduler.
Which embedding model should I use?
For most cases, OpenAI text-embedding-3-small ($0.02/M tokens, fast, strong baseline). For higher quality at higher cost, text-embedding-3-large or Voyage's voyage-3. For self-hosted: BGE or E5-large. The chunking strategy matters more than the model — clean H2-bounded Markdown chunks with text-embedding-3-small often outperform messy chunks with the priciest model.
Does MDisBetter offer a programmatic URL-to-Markdown API for this kind of pipeline?
Not today. The web tool at /convert/url-to-markdown is the supported surface for ad-hoc one-off conversions. For the at-scale RAG case in this tutorial, Trafilatura (plus Playwright for JS-heavy sites) is the right OSS path — mature, free, and gives you full programmatic control over rate limits, output structure, and authentication for private pages.