Pricing Dashboard Sign up
Recent
· 12 min read · MDisBetter

Build an AI Knowledge Base from Web Sources (Markdown Method)

Building an AI-queryable knowledge base from web content is a five-step problem: pick the sources, fetch them, convert to a uniform format, organize by topic, embed for retrieval. Most tutorials skip steps 3 and 4 — and that's why their RAG pipelines feel mediocre. This guide walks the entire pipeline using the MDisBetter web tool for the conversion step (one URL at a time) and free open-source tools (sentence-transformers, ChromaDB) for the embedding and retrieval steps. Everything runs locally; nothing leaves your machine after step 2.

What we're building

A local knowledge base where you can paste a question and get an answer cited back to specific URLs. The pipeline:

[picked URLs]
     ↓
[paste each into mdisbetter.com → download .md]
     ↓
[~/kb//.md folder structure]
     ↓
[chunk by H2 / H3 headings]
     ↓
[embed locally via sentence-transformers]
     ↓
[ChromaDB persistent local store]
     ↓
[query: question → top-K chunks + citations]

Total cost: $0 if you stay local. Total time for a 30-source knowledge base: about 90 minutes the first time, 15 minutes for incremental updates.

Step 1: Pick your sources

The single biggest determinant of knowledge base quality is source selection. Be ruthless. For a knowledge base on, say, "FastAPI in production," the right sources are:

What does not belong: random Medium posts, low-quality YouTube transcripts, Stack Overflow answers older than two years, anything you can't verify the publication date of. Garbage in, garbage out.

Aim for 20-50 sources for a focused topic. Beyond that, retrieval precision drops because too many chunks compete for top-K slots.

Step 2: Convert each source to Markdown

This is the conversion step. For each URL in your list:

  1. Open mdisbetter.com/convert/url-to-markdown
  2. Paste the URL
  3. Click Convert
  4. Click Download to save as a .md file
  5. Move the file into the appropriate topic folder (we'll organize next)

For 30 URLs this takes about 15 minutes of clicking. Worth it for the quality — the web tool's output is the cleanest you'll get without writing code.

Want to automate this for hundreds of URLs?

The MDisBetter web tool is intentionally one-URL-at-a-time and does not expose a programmatic API. For 100+ URLs, use the OSS path: Trafilatura for the extraction, requests/httpx for the fetching, the same target file structure. The runnable recipe is in scrape a website to Markdown for RAG. The end result — a folder of clean .md files — is identical to running the URLs through the web tool one by one.

Step 3: Organize by topic

The folder structure becomes your retrieval namespace later. Aim for one folder per logical topic:

~/kb/
  fastapi-fundamentals/
    routing.md
    dependency-injection.md
    pydantic-models.md
  fastapi-deployment/
    uvicorn-vs-gunicorn.md
    docker-image-sizing.md
    aws-fargate-setup.md
  fastapi-performance/
    async-vs-sync-endpoints.md
    background-tasks.md
    caching-with-redis.md

Each .md file gets a comment at the top with provenance:

<!-- source: https://fastapi.tiangolo.com/tutorial/dependencies/ -->
<!-- converted: 2026-05-10 via mdisbetter.com -->

# Dependencies

...

The provenance lets you re-fetch later when content updates, and lets retrieved chunks cite their original URL.

Step 4: Chunk by H2 (and H3 when needed)

Chunking strategy is the second-biggest quality lever after source selection. The pattern: split by H2 headings (and recursively by H3 if a section is too large). Each chunk becomes a self-contained semantic unit.

import re
from dataclasses import dataclass
from pathlib import Path

@dataclass
class Chunk:
    source_url: str
    topic: str  # folder name
    section: str
    text: str

H2_RE = re.compile(r'^## (.+)$', re.MULTILINE)
H3_RE = re.compile(r'^### (.+)$', re.MULTILINE)
SOURCE_RE = re.compile(r'<!-- source: (\S+) -->')
MAX_CHARS = 4000  # roughly 1000 tokens

def chunk_by_heading(text, regex, source_url, topic, parent_section=''):
    matches = list(regex.finditer(text))
    if not matches:
        return [Chunk(source_url, topic, parent_section or 'main', text)]
    chunks = []
    if matches[0].start() > 0:
        intro = text[:matches[0].start()].strip()
        if intro:
            chunks.append(Chunk(
                source_url, topic,
                parent_section + '/_intro' if parent_section else '_intro',
                intro,
            ))
    for i, m in enumerate(matches):
        title = m.group(1).strip()
        start = m.start()
        end = matches[i+1].start() if i+1 < len(matches) else len(text)
        section_text = text[start:end].strip()
        full_section = f'{parent_section} / {title}' if parent_section else title
        chunks.append(Chunk(source_url, topic, full_section, section_text))
    return chunks

def chunk_file(path, topic):
    raw = path.read_text(encoding='utf-8')
    src_match = SOURCE_RE.search(raw)
    source_url = src_match.group(1) if src_match else str(path)
    h2_chunks = chunk_by_heading(raw, H2_RE, source_url, topic)
    final = []
    for c in h2_chunks:
        if len(c.text) <= MAX_CHARS:
            final.append(c)
        else:
            sub = chunk_by_heading(c.text, H3_RE, source_url, topic, c.section)
            final.extend(sub)
    return final

KB_ROOT = Path.home() / 'kb'
all_chunks = []
for topic_dir in KB_ROOT.iterdir():
    if not topic_dir.is_dir():
        continue
    topic = topic_dir.name
    for f in topic_dir.glob('*.md'):
        all_chunks.extend(chunk_file(f, topic))

print(f'{len(all_chunks)} chunks from {len(list(KB_ROOT.rglob("*.md")))} files')

Step 5: Embed locally (no OpenAI needed)

Use sentence-transformers — a free, local embedding library. Models like all-MiniLM-L6-v2 (90 MB, fast) or BAAI/bge-base-en-v1.5 (440 MB, higher quality) run on CPU at acceptable speed for knowledge bases up to ~10K chunks. No API costs, no network dependency, no data leaving your machine.

from sentence_transformers import SentenceTransformer
import chromadb

model = SentenceTransformer('BAAI/bge-base-en-v1.5')
chroma = chromadb.PersistentClient(path=str(KB_ROOT / '.chroma'))
collection = chroma.get_or_create_collection('kb')

BATCH = 64
for i in range(0, len(all_chunks), BATCH):
    batch = all_chunks[i:i+BATCH]
    embeds = model.encode(
        [c.text for c in batch],
        normalize_embeddings=True,
        show_progress_bar=False,
    )
    collection.add(
        ids=[f'{c.source_url}#{c.section}#{i+j}' for j, c in enumerate(batch)],
        embeddings=embeds.tolist(),
        documents=[c.text for c in batch],
        metadatas=[{
            'source_url': c.source_url,
            'topic': c.topic,
            'section': c.section,
        } for c in batch],
    )
    print(f'Embedded {i+len(batch)}/{len(all_chunks)}')

For a 500-chunk knowledge base on a modern laptop CPU, this runs in 2-4 minutes. GPU available? Add device='cuda' to the SentenceTransformer constructor — drops to under 30 seconds.

Step 6: Query with citations

def query(question, top_k=5, topic_filter=None):
    q_embed = model.encode([question], normalize_embeddings=True)[0]
    where = {'topic': topic_filter} if topic_filter else None
    results = collection.query(
        query_embeddings=[q_embed.tolist()],
        n_results=top_k,
        where=where,
    )
    for doc, meta, dist in zip(
        results['documents'][0],
        results['metadatas'][0],
        results['distances'][0],
    ):
        print(f"--- score: {1-dist:.3f}")
        print(f"topic: {meta['topic']} / section: {meta['section']}")
        print(f"source: {meta['source_url']}")
        print(doc[:400])
        print()

query('How do I run background tasks in FastAPI?')
query('uvicorn vs gunicorn for production', topic_filter='fastapi-deployment')

That's the entire pipeline. ~80 lines of Python total, all running on your laptop.

Optional: layer an LLM on top

For natural-language answers (not just retrieved chunks), feed the top-K chunks plus the question to any chat model. Local options: a quantized Llama 3.1 8B via Ollama or LM Studio. Cloud: any OpenAI/Anthropic chat endpoint. The retrieval logic above is unchanged either way.

import requests

def rag_answer(question, top_k=5):
    q_embed = model.encode([question], normalize_embeddings=True)[0]
    results = collection.query(
        query_embeddings=[q_embed.tolist()], n_results=top_k,
    )
    context = '\n\n---\n\n'.join([
        f"## {m['section']}\nSource: {m['source_url']}\n\n{d}"
        for d, m in zip(results['documents'][0], results['metadatas'][0])
    ])
    prompt = (
        'Answer using only the provided sources. Cite source URLs.\n\n'
        f'{context}\n\nQuestion: {question}'
    )
    # Local Ollama example:
    r = requests.post('http://localhost:11434/api/generate', json={
        'model': 'llama3.1:8b', 'prompt': prompt, 'stream': False,
    })
    return r.json()['response']

Refreshing the knowledge base over time

The web is not static. Sources update; new sources appear. The maintenance pattern:

  1. Once a month, re-visit your source list. Drop any URL that's gone stale (404, content materially changed for the worse). Add any new URLs that have appeared since last sync.
  2. For each updated URL, re-convert via the web tool (or re-run your Trafilatura script if you've automated it). Overwrite the old .md file.
  3. Re-chunk the changed files only.
  4. In ChromaDB, delete chunks where source_url matches the changed file, then re-embed and add the new chunks.

Incremental updates take 10-15 minutes for a typical 30-source knowledge base.

What about PDF sources?

Many serious knowledge bases mix web sources with PDF whitepapers, RFCs, and design docs. The pipeline above works identically for PDF sources — you just swap the conversion step. See PDF to Markdown for RAG: complete pipeline guide for the PDF half. The chunking, embedding, and retrieval steps are unchanged because everything ends up as Markdown anyway. Many production knowledge bases use both: URL-to-Markdown for the web sources, PDF-to-Markdown for the document sources, all chunked into the same vector store.

How this compares to commercial RAG-as-a-service

Pinecone, Weaviate Cloud, and managed RAG products handle the embedding, storage, and retrieval as a hosted service. Tradeoffs:

For personal knowledge bases and small-team internal tools, local wins. For multi-tenant SaaS at scale, hosted wins. The conversion and chunking parts of the pipeline are identical either way.

Common pitfalls to avoid

Recommendation

This is the cheapest, most private, most flexible AI knowledge base architecture you can build. Web tool for the conversion (one-time clicks); OSS for everything else. Total runtime cost: $0. Total setup time: a weekend. Quality: equivalent to or better than most commercial RAG products on focused-topic corpora. For more on extending the retrieval logic (hybrid search, re-ranking, golden-set evaluation), see the larger RAG pipeline tutorial and the batch conversion patterns for scaling the source ingestion.

Frequently asked questions

Why use sentence-transformers instead of OpenAI embeddings?
Three reasons: zero ongoing cost (OpenAI charges per token; sentence-transformers is free forever once downloaded), no data leaves your machine (privacy), and quality is genuinely competitive at the small/medium scale (BGE models match text-embedding-3-small on most retrieval benchmarks). Switch to OpenAI embeddings only if you need the absolute highest retrieval quality on heterogeneous English+code corpora and you're already paying for an OpenAI account anyway.
What's the practical limit on knowledge base size for a local setup?
Around 50,000 chunks on a typical laptop with 16GB RAM, holding everything in ChromaDB's persistent local store. Beyond that, embedding latency at query time becomes noticeable (more than 200ms) and RAM pressure shows up. Past 50K chunks, the right move is a managed vector DB (Qdrant Cloud, Pinecone, Weaviate) but you can keep using sentence-transformers locally for the embedding generation if you want.
How do I keep my knowledge base from drifting from the source web pages?
Two pieces. First, the provenance comment at the top of each .md file (`<!-- source: URL -->`) lets you re-fetch any single source on demand. Second, schedule a monthly review where you walk your source list, re-convert any URLs that have meaningfully changed, and re-embed. For larger corpora, the Trafilatura-based scripted version checks each URL's `<lastmod>` from the sitemap and only re-fetches what changed — that pattern is in the linked RAG tutorial.