Build an AI Knowledge Base from Web Sources (Markdown Method)
Building an AI-queryable knowledge base from web content is a five-step problem: pick the sources, fetch them, convert to a uniform format, organize by topic, embed for retrieval. Most tutorials skip steps 3 and 4 — and that's why their RAG pipelines feel mediocre. This guide walks the entire pipeline using the MDisBetter web tool for the conversion step (one URL at a time) and free open-source tools (sentence-transformers, ChromaDB) for the embedding and retrieval steps. Everything runs locally; nothing leaves your machine after step 2.
What we're building
A local knowledge base where you can paste a question and get an answer cited back to specific URLs. The pipeline:
[picked URLs]
↓
[paste each into mdisbetter.com → download .md]
↓
[~/kb//.md folder structure]
↓
[chunk by H2 / H3 headings]
↓
[embed locally via sentence-transformers]
↓
[ChromaDB persistent local store]
↓
[query: question → top-K chunks + citations] Total cost: $0 if you stay local. Total time for a 30-source knowledge base: about 90 minutes the first time, 15 minutes for incremental updates.
Step 1: Pick your sources
The single biggest determinant of knowledge base quality is source selection. Be ruthless. For a knowledge base on, say, "FastAPI in production," the right sources are:
- The official FastAPI docs (
fastapi.tiangolo.com) - Two or three highly-cited blog posts from the FastAPI maintainer
- A handful of production case studies from companies that have written publicly about scaling FastAPI
- The relevant Pydantic and Starlette docs (FastAPI's foundations)
What does not belong: random Medium posts, low-quality YouTube transcripts, Stack Overflow answers older than two years, anything you can't verify the publication date of. Garbage in, garbage out.
Aim for 20-50 sources for a focused topic. Beyond that, retrieval precision drops because too many chunks compete for top-K slots.
Step 2: Convert each source to Markdown
This is the conversion step. For each URL in your list:
- Open mdisbetter.com/convert/url-to-markdown
- Paste the URL
- Click Convert
- Click Download to save as a
.mdfile - Move the file into the appropriate topic folder (we'll organize next)
For 30 URLs this takes about 15 minutes of clicking. Worth it for the quality — the web tool's output is the cleanest you'll get without writing code.
Want to automate this for hundreds of URLs?
The MDisBetter web tool is intentionally one-URL-at-a-time and does not expose a programmatic API. For 100+ URLs, use the OSS path: Trafilatura for the extraction, requests/httpx for the fetching, the same target file structure. The runnable recipe is in scrape a website to Markdown for RAG. The end result — a folder of clean .md files — is identical to running the URLs through the web tool one by one.
Step 3: Organize by topic
The folder structure becomes your retrieval namespace later. Aim for one folder per logical topic:
~/kb/
fastapi-fundamentals/
routing.md
dependency-injection.md
pydantic-models.md
fastapi-deployment/
uvicorn-vs-gunicorn.md
docker-image-sizing.md
aws-fargate-setup.md
fastapi-performance/
async-vs-sync-endpoints.md
background-tasks.md
caching-with-redis.mdEach .md file gets a comment at the top with provenance:
<!-- source: https://fastapi.tiangolo.com/tutorial/dependencies/ -->
<!-- converted: 2026-05-10 via mdisbetter.com -->
# Dependencies
...The provenance lets you re-fetch later when content updates, and lets retrieved chunks cite their original URL.
Step 4: Chunk by H2 (and H3 when needed)
Chunking strategy is the second-biggest quality lever after source selection. The pattern: split by H2 headings (and recursively by H3 if a section is too large). Each chunk becomes a self-contained semantic unit.
import re
from dataclasses import dataclass
from pathlib import Path
@dataclass
class Chunk:
source_url: str
topic: str # folder name
section: str
text: str
H2_RE = re.compile(r'^## (.+)$', re.MULTILINE)
H3_RE = re.compile(r'^### (.+)$', re.MULTILINE)
SOURCE_RE = re.compile(r'<!-- source: (\S+) -->')
MAX_CHARS = 4000 # roughly 1000 tokens
def chunk_by_heading(text, regex, source_url, topic, parent_section=''):
matches = list(regex.finditer(text))
if not matches:
return [Chunk(source_url, topic, parent_section or 'main', text)]
chunks = []
if matches[0].start() > 0:
intro = text[:matches[0].start()].strip()
if intro:
chunks.append(Chunk(
source_url, topic,
parent_section + '/_intro' if parent_section else '_intro',
intro,
))
for i, m in enumerate(matches):
title = m.group(1).strip()
start = m.start()
end = matches[i+1].start() if i+1 < len(matches) else len(text)
section_text = text[start:end].strip()
full_section = f'{parent_section} / {title}' if parent_section else title
chunks.append(Chunk(source_url, topic, full_section, section_text))
return chunks
def chunk_file(path, topic):
raw = path.read_text(encoding='utf-8')
src_match = SOURCE_RE.search(raw)
source_url = src_match.group(1) if src_match else str(path)
h2_chunks = chunk_by_heading(raw, H2_RE, source_url, topic)
final = []
for c in h2_chunks:
if len(c.text) <= MAX_CHARS:
final.append(c)
else:
sub = chunk_by_heading(c.text, H3_RE, source_url, topic, c.section)
final.extend(sub)
return final
KB_ROOT = Path.home() / 'kb'
all_chunks = []
for topic_dir in KB_ROOT.iterdir():
if not topic_dir.is_dir():
continue
topic = topic_dir.name
for f in topic_dir.glob('*.md'):
all_chunks.extend(chunk_file(f, topic))
print(f'{len(all_chunks)} chunks from {len(list(KB_ROOT.rglob("*.md")))} files')
Step 5: Embed locally (no OpenAI needed)
Use sentence-transformers — a free, local embedding library. Models like all-MiniLM-L6-v2 (90 MB, fast) or BAAI/bge-base-en-v1.5 (440 MB, higher quality) run on CPU at acceptable speed for knowledge bases up to ~10K chunks. No API costs, no network dependency, no data leaving your machine.
from sentence_transformers import SentenceTransformer
import chromadb
model = SentenceTransformer('BAAI/bge-base-en-v1.5')
chroma = chromadb.PersistentClient(path=str(KB_ROOT / '.chroma'))
collection = chroma.get_or_create_collection('kb')
BATCH = 64
for i in range(0, len(all_chunks), BATCH):
batch = all_chunks[i:i+BATCH]
embeds = model.encode(
[c.text for c in batch],
normalize_embeddings=True,
show_progress_bar=False,
)
collection.add(
ids=[f'{c.source_url}#{c.section}#{i+j}' for j, c in enumerate(batch)],
embeddings=embeds.tolist(),
documents=[c.text for c in batch],
metadatas=[{
'source_url': c.source_url,
'topic': c.topic,
'section': c.section,
} for c in batch],
)
print(f'Embedded {i+len(batch)}/{len(all_chunks)}')
For a 500-chunk knowledge base on a modern laptop CPU, this runs in 2-4 minutes. GPU available? Add device='cuda' to the SentenceTransformer constructor — drops to under 30 seconds.
Step 6: Query with citations
def query(question, top_k=5, topic_filter=None):
q_embed = model.encode([question], normalize_embeddings=True)[0]
where = {'topic': topic_filter} if topic_filter else None
results = collection.query(
query_embeddings=[q_embed.tolist()],
n_results=top_k,
where=where,
)
for doc, meta, dist in zip(
results['documents'][0],
results['metadatas'][0],
results['distances'][0],
):
print(f"--- score: {1-dist:.3f}")
print(f"topic: {meta['topic']} / section: {meta['section']}")
print(f"source: {meta['source_url']}")
print(doc[:400])
print()
query('How do I run background tasks in FastAPI?')
query('uvicorn vs gunicorn for production', topic_filter='fastapi-deployment')
That's the entire pipeline. ~80 lines of Python total, all running on your laptop.
Optional: layer an LLM on top
For natural-language answers (not just retrieved chunks), feed the top-K chunks plus the question to any chat model. Local options: a quantized Llama 3.1 8B via Ollama or LM Studio. Cloud: any OpenAI/Anthropic chat endpoint. The retrieval logic above is unchanged either way.
import requests
def rag_answer(question, top_k=5):
q_embed = model.encode([question], normalize_embeddings=True)[0]
results = collection.query(
query_embeddings=[q_embed.tolist()], n_results=top_k,
)
context = '\n\n---\n\n'.join([
f"## {m['section']}\nSource: {m['source_url']}\n\n{d}"
for d, m in zip(results['documents'][0], results['metadatas'][0])
])
prompt = (
'Answer using only the provided sources. Cite source URLs.\n\n'
f'{context}\n\nQuestion: {question}'
)
# Local Ollama example:
r = requests.post('http://localhost:11434/api/generate', json={
'model': 'llama3.1:8b', 'prompt': prompt, 'stream': False,
})
return r.json()['response']
Refreshing the knowledge base over time
The web is not static. Sources update; new sources appear. The maintenance pattern:
- Once a month, re-visit your source list. Drop any URL that's gone stale (404, content materially changed for the worse). Add any new URLs that have appeared since last sync.
- For each updated URL, re-convert via the web tool (or re-run your Trafilatura script if you've automated it). Overwrite the old .md file.
- Re-chunk the changed files only.
- In ChromaDB, delete chunks where
source_urlmatches the changed file, then re-embed and add the new chunks.
Incremental updates take 10-15 minutes for a typical 30-source knowledge base.
What about PDF sources?
Many serious knowledge bases mix web sources with PDF whitepapers, RFCs, and design docs. The pipeline above works identically for PDF sources — you just swap the conversion step. See PDF to Markdown for RAG: complete pipeline guide for the PDF half. The chunking, embedding, and retrieval steps are unchanged because everything ends up as Markdown anyway. Many production knowledge bases use both: URL-to-Markdown for the web sources, PDF-to-Markdown for the document sources, all chunked into the same vector store.
How this compares to commercial RAG-as-a-service
Pinecone, Weaviate Cloud, and managed RAG products handle the embedding, storage, and retrieval as a hosted service. Tradeoffs:
- Hosted: faster setup (no model download, no local DB), better scale (millions of chunks), but ongoing cost ($50-500/month typical) and your data leaves your machine.
- Local (this guide): $0 ongoing cost, complete privacy, fits up to ~50K chunks comfortably on a laptop. Above that scale, switch to a managed vector DB but keep the same chunking + embedding logic.
For personal knowledge bases and small-team internal tools, local wins. For multi-tenant SaaS at scale, hosted wins. The conversion and chunking parts of the pipeline are identical either way.
Common pitfalls to avoid
- Over-collecting sources. Beyond ~50 sources for a focused topic, retrieval precision drops. Be ruthless about pruning low-quality URLs from your list before you start converting.
- Skipping the provenance comment. Without `` at the top of each .md file, you can't re-fetch later when content updates. Add it during conversion, not afterward.
- Chunking by character count instead of headings. Fixed-window chunking (every 1000 characters) breaks semantic boundaries. Stick to H2/H3-based chunking even if it produces variable-sized chunks.
- Embedding raw HTML chunks. If you skip the Markdown-conversion step and embed HTML directly, your embeddings get diluted by nav/footer/script noise and retrieval quality drops 15-30%. The conversion step is not optional for quality.
- Not building an evaluation set. Without 30-50 known-good question/answer pairs to test against, every change to the pipeline becomes a vibe-based decision. Spend an hour up front building the eval set; it pays for itself the first time you avoid a regression.
Recommendation
This is the cheapest, most private, most flexible AI knowledge base architecture you can build. Web tool for the conversion (one-time clicks); OSS for everything else. Total runtime cost: $0. Total setup time: a weekend. Quality: equivalent to or better than most commercial RAG products on focused-topic corpora. For more on extending the retrieval logic (hybrid search, re-ranking, golden-set evaluation), see the larger RAG pipeline tutorial and the batch conversion patterns for scaling the source ingestion.