Scrape a Website to Markdown for RAG (Python Tutorial)
Building a RAG pipeline on top of a website's content is a four-step problem: discover URLs, fetch them, convert to a clean format, chunk for embedding. The format step is where most pipelines silently degrade — raw HTML chunks pollute embeddings with menu text, footer noise, and ad markup, dragging retrieval quality down by 30-50% before you even reach the embedding model. Markdown solves it cleanly.
This tutorial uses entirely open-source tooling — Trafilatura for extraction, ChromaDB for the vector store, OpenAI embeddings — so the whole pipeline is yours to own. MDisBetter ships a one-URL-at-a-time web tool for ad-hoc conversions; for the at-scale RAG case described here you want full programmatic control, which the OSS path gives you.
The pipeline at a glance
sitemap.xml
↓
list of URLs
↓
[fetch + extract Markdown via Trafilatura] (concurrent)
↓
folder of .md files
↓
[chunk by H2 headings]
↓
list of (chunk_text, source_url, section_title)
↓
[embed → vector DB]
↓
RAG-readyWe'll build each stage with runnable code. Target site: https://docs.python.org (Python 3 documentation — public, well-structured, has a sitemap, ~600 pages). Install once: pip install requests trafilatura httpx tqdm openai chromadb.
Stage 1: Discover URLs from the sitemap
import requests
import xml.etree.ElementTree as ET
NS = {'sm': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
def fetch_sitemap(url):
"""Recursively expand sitemap indexes into a flat URL list."""
xml = requests.get(url, timeout=30).text
root = ET.fromstring(xml)
urls = []
for loc in root.findall('.//sm:loc', NS):
if loc.text.endswith('.xml'):
urls.extend(fetch_sitemap(loc.text))
else:
urls.append(loc.text)
return urls
all_urls = fetch_sitemap('https://docs.python.org/3/sitemap.xml')
# Filter to tutorial + library + reference sections
docs_urls = [u for u in all_urls if any(
x in u for x in ('/tutorial/', '/library/', '/reference/', '/howto/')
)]
print(f'{len(docs_urls)} doc URLs')
Stage 2: Fetch + extract Markdown with Trafilatura
Trafilatura is the OSS gold standard for readability extraction — better boilerplate stripping than html2text, native Markdown output, handles most static sites cleanly. Combine it with httpx for concurrent fetching:
import asyncio
import hashlib
from pathlib import Path
import httpx
import trafilatura
from tqdm.asyncio import tqdm_asyncio
OUT = Path('./corpus')
OUT.mkdir(exist_ok=True)
SEM = asyncio.Semaphore(10) # 10 concurrent
def url_to_filename(url):
h = hashlib.sha1(url.encode()).hexdigest()[:10]
safe = url.replace('https://', '').replace('/', '_')[:100]
return OUT / f'{safe}_{h}.md'
async def convert_one(client, url):
out = url_to_filename(url)
if out.exists():
return # idempotent
async with SEM:
try:
r = await client.get(
url, timeout=60, follow_redirects=True,
headers={'User-Agent': 'Mozilla/5.0 (compatible; RAG-builder/1.0)'},
)
except httpx.RequestError as e:
print(f'NET FAIL {url}: {e}')
return
if r.status_code != 200:
print(f'HTTP FAIL {url}: {r.status_code}')
return
md = trafilatura.extract(
r.text,
output_format='markdown',
include_links=True,
include_tables=True,
# Help Trafilatura find the main content on Sphinx-themed sites
favor_precision=True,
)
if not md:
print(f'EXTRACT FAIL {url}')
return
out.write_text(
f'\n\n{md}',
encoding='utf-8',
)
async def main(urls):
async with httpx.AsyncClient() as client:
await tqdm_asyncio.gather(*[convert_one(client, u) for u in urls])
asyncio.run(main(docs_urls))
At concurrency=10 over a typical home connection, ~600 URLs convert in 3-6 minutes. Cost: zero — everything runs locally.
What about JS-rendered docs?
The Python docs are server-rendered, so Trafilatura works great. For client-rendered docs sites (some Stoplight, ReadMe.io, or Mintlify-built docs), Trafilatura gets the empty shell. Two options:
- Playwright fallback in the same script: render with a headless browser, then run
trafilatura.extracton the post-JS HTML. Adds ~3-5 sec/URL but works on every site. - MDisBetter web tool for the awkward URLs: paste them into /convert/url-to-markdown, save the .md files into the same
./corpusfolder. Fine for small numbers; not a programmatic API.
Stage 3: Chunk by H2 headings
This is the step that most tutorials get wrong. The naive approach is to chunk by character count (every 1000 characters) or token count (every 500 tokens). That destroys semantic boundaries — a single explanation gets sliced mid-sentence, and the embedding model loses meaningful context.
The right pattern for Markdown corpora is to chunk by H2 sections. Each H2 represents a logically self-contained subtopic. Chunks are larger (1000-3000 tokens typically) but each one is a coherent unit:
import re
from dataclasses import dataclass
@dataclass
class Chunk:
source_url: str
section_title: str
text: str
tokens: int # approximate
H2_RE = re.compile(r'^## (.+)$', re.MULTILINE)
SOURCE_RE = re.compile(r'')
def approx_tokens(text):
# rough heuristic: 1 token ≈ 4 chars in English
return max(1, len(text) // 4)
def chunk_markdown_file(path):
raw = path.read_text(encoding='utf-8')
src = SOURCE_RE.search(raw)
source_url = src.group(1) if src else str(path)
matches = list(H2_RE.finditer(raw))
if not matches:
# No H2s: treat entire file as one chunk
return [Chunk(source_url, path.stem, raw, approx_tokens(raw))]
chunks = []
if matches[0].start() > 0:
intro = raw[:matches[0].start()].strip()
if intro:
chunks.append(Chunk(source_url, '_intro', intro, approx_tokens(intro)))
for i, m in enumerate(matches):
title = m.group(1).strip()
start = m.start()
end = matches[i+1].start() if i+1 < len(matches) else len(raw)
text = raw[start:end].strip()
chunks.append(Chunk(source_url, title, text, approx_tokens(text)))
return chunks
all_chunks = []
for f in OUT.glob('*.md'):
all_chunks.extend(chunk_markdown_file(f))
print(f'{len(all_chunks)} chunks from {len(list(OUT.glob("*.md")))} files')
print(f'Avg tokens/chunk: {sum(c.tokens for c in all_chunks) / len(all_chunks):.0f}')
Sub-chunking very large sections
If an H2 section is larger than your embedding model's context (rare but possible — some Python library docs have 10,000-token sections), recursively chunk it by H3, then by paragraph if still too large:
MAX_TOKENS = 4000
def sub_chunk(chunk):
if chunk.tokens <= MAX_TOKENS:
return [chunk]
parts = re.split(r'\n(?=### )', chunk.text)
if len(parts) > 1:
return [
Chunk(
chunk.source_url,
f'{chunk.section_title} / {p.split(chr(10))[0][4:]}',
p,
approx_tokens(p),
)
for p in parts
]
paras = chunk.text.split('\n\n')
out, cur, cur_tokens = [], [], 0
for p in paras:
pt = approx_tokens(p)
if cur_tokens + pt > MAX_TOKENS and cur:
out.append(Chunk(
chunk.source_url, chunk.section_title,
'\n\n'.join(cur), cur_tokens,
))
cur, cur_tokens = [], 0
cur.append(p)
cur_tokens += pt
if cur:
out.append(Chunk(
chunk.source_url, chunk.section_title,
'\n\n'.join(cur), cur_tokens,
))
return out
final_chunks = [c for chunk in all_chunks for c in sub_chunk(chunk)]
Stage 4: Embed and store
Pick your embedding model and vector DB. The example below uses OpenAI embeddings + ChromaDB (other combinations work identically — just swap the imports):
from openai import OpenAI
import chromadb
client = OpenAI() # OPENAI_API_KEY in env
chroma = chromadb.PersistentClient(path='./chroma')
collection = chroma.get_or_create_collection('python-docs')
BATCH = 100
for i in range(0, len(final_chunks), BATCH):
batch = final_chunks[i:i+BATCH]
resp = client.embeddings.create(
model='text-embedding-3-small',
input=[c.text for c in batch],
)
collection.add(
ids=[f'{c.source_url}#{c.section_title}#{i+j}'
for j, c in enumerate(batch)],
embeddings=[d.embedding for d in resp.data],
documents=[c.text for c in batch],
metadatas=[{
'source_url': c.source_url,
'section': c.section_title,
'tokens': c.tokens,
} for c in batch],
)
print(f'Embedded {i+len(batch)}/{len(final_chunks)}')
Stage 5: Query
def rag_query(question, top_k=5):
q_embed = client.embeddings.create(
model='text-embedding-3-small',
input=[question],
).data[0].embedding
results = collection.query(
query_embeddings=[q_embed],
n_results=top_k,
)
context = '\n\n---\n\n'.join([
f"## {meta['section']}\nSource: {meta['source_url']}\n\n{doc}"
for doc, meta in zip(results['documents'][0], results['metadatas'][0])
])
answer = client.chat.completions.create(
model='gpt-4o-mini',
messages=[
{'role': 'system', 'content':
'Answer using only the provided sources. Cite source URLs.'},
{'role': 'user', 'content': f'{context}\n\nQuestion: {question}'},
],
)
return answer.choices[0].message.content
print(rag_query('How do I create a list comprehension with conditional filtering?'))
That's the whole pipeline. ~150 lines of Python, end-to-end runnable, uses real public docs as a target, zero proprietary dependencies.
Why Markdown gives better retrieval than HTML
Three concrete reasons:
- Boilerplate-free embeddings. HTML chunks include navigation, footer, copyright text. These tokens dilute the embedding signal — the chunk's vector reflects the average meaning of "copyright + navigation + actual content" rather than just the content.
- Cleaner chunk boundaries. H2 sections in clean Markdown align with semantic boundaries (subtopics). HTML chunks are usually split by character count, breaking semantic coherence.
- Smaller chunks for the same information. Markdown is 60-80% smaller than equivalent HTML in tokens. More information fits per chunk; the LLM gets richer context per retrieval.
Empirically, replacing HTML with Markdown in a RAG pipeline improves retrieval precision by 15-30% on the same queries, with no other changes. The biggest single quality win you can get for the smallest engineering effort.
Working with PDFs in your RAG corpus?
If your knowledge source is PDFs rather than (or in addition to) URLs, the same pattern applies — just swap the conversion step. See PDF to Markdown for RAG and the comprehensive PDF RAG pipeline guide. Many production RAG systems mix sources: web docs via this tutorial, PDF policy docs via the PDF guide, all chunked and embedded into the same vector DB.
Adding metadata filters at retrieval time
The chunk metadata (source_url, section title) becomes powerful when you let users filter at query time. Examples:
def rag_query_filtered(question, source_filter=None, top_k=5):
q_embed = client.embeddings.create(
model='text-embedding-3-small',
input=[question],
).data[0].embedding
where = {'source_url': {'$contains': source_filter}} if source_filter else None
results = collection.query(
query_embeddings=[q_embed],
n_results=top_k,
where=where,
)
return results
# Only retrieve from the tutorial section
rag_query_filtered('list comprehensions', source_filter='/tutorial/')
# Only retrieve from a specific module
rag_query_filtered('json parsing', source_filter='/library/json')
For a multi-tenant knowledge base (each customer's docs in the same DB), source_filter becomes a tenant scope filter. Same vector DB, isolated retrieval per tenant.
Hybrid search: dense + keyword
Embedding-only retrieval misses queries with rare terms (proper nouns, error codes, version numbers). Hybrid search combines dense (embedding) with sparse (BM25 keyword) and merges results:
from rank_bm25 import BM25Okapi
import numpy as np
docs = [c.text for c in final_chunks]
tokenized = [d.lower().split() for d in docs]
bm25 = BM25Okapi(tokenized)
def hybrid_query(question, top_k=5, alpha=0.7):
q_embed = client.embeddings.create(
model='text-embedding-3-small', input=[question]
).data[0].embedding
dense = collection.query(query_embeddings=[q_embed], n_results=top_k*2)
sparse_scores = bm25.get_scores(question.lower().split())
sparse_top = np.argsort(sparse_scores)[::-1][:top_k*2]
scores = {}
for i, doc in enumerate(dense['documents'][0]):
scores[doc] = scores.get(doc, 0) + alpha / (i + 1)
for rank, idx in enumerate(sparse_top):
doc = docs[idx]
scores[doc] = scores.get(doc, 0) + (1 - alpha) / (rank + 1)
return sorted(scores.items(), key=lambda x: -x[1])[:top_k]
Hybrid typically improves recall by 5-15% over dense-only with no significant cost increase. Worth the extra 30 lines on any production RAG.
Re-ranking the top-K
For higher precision, fetch top-20 from the vector DB and re-rank with a cross-encoder model (Cohere Rerank, BGE Reranker) to top-5 before sending to the LLM:
import cohere, os
co = cohere.Client(os.environ['COHERE_API_KEY'])
def rerank(question, candidates, top_k=5):
r = co.rerank(
query=question,
documents=[c['text'] for c in candidates],
top_n=top_k,
model='rerank-english-v3.0',
)
return [candidates[hit.index] for hit in r.results]
Re-ranking is the single biggest precision win you can add after good chunking. Empirically improves answer quality 10-25% on hard queries.
Evaluation: don't fly blind
Build a golden-set of 50-100 questions with known good source URLs. After every pipeline change (new chunking, new extractor settings, new embedding model, new retrieval logic), re-run the golden set and measure:
- Top-K retrieval recall (does the right source appear in top-K?)
- Answer correctness (graded by an LLM judge or by hand)
- Latency per query
- Cost per query
Without an eval set, every change is a vibe-based decision. Spend two hours building the golden set; it pays for itself the first time you avoid a regression.
Refreshing the corpus over time
Re-fetch the sitemap and check each URL's <lastmod> date against your last-converted timestamp. Re-extract only URLs newer than your last sync. Re-chunk and re-embed those updated URLs (delete old chunks first by source_url metadata filter). Schedule the job nightly via cron or GitHub Actions. The whole pipeline is your own Python — no vendor lock-in to worry about.
Recommendation
This pipeline is production-ready as-written for corpora up to ~50K chunks. For larger scale: use a managed vector DB (Pinecone, Qdrant Cloud, Weaviate Cloud) instead of local Chroma, switch to streaming embedding (yield-as-you-go rather than batch), and add observability for retrieval quality regressions. The extraction step (Trafilatura) scales linearly — running 100K URLs through it is the same code, just longer wall-clock time. See also batch convert 100+ URLs for the scaling patterns at the conversion step, and handling JavaScript-rendered pages for the SPA-extraction recipe.