May 10, 2026 · 11 min read · MDisBetter

Building a Web Knowledge Base for AI: Architecture Guide

Most teams that say "we want to put our content into ChatGPT" eventually mean: we want a retrieval-augmented system over our own corpus. The corpus is rarely a single tidy database — it is a mix of public docs, blog posts, internal wikis, third-party references, and ad-hoc HTML scattered across the web. The architecture that turns this mess into a useful AI knowledge base has seven layers, each with real choices and real trade-offs. Here is the end-to-end build, with code, tool recommendations, and the failure modes to avoid.

The seven layers, at a glance

Source identification: enumerate what goes in, decide what stays out.
Conversion: turn each source into clean Markdown.
Organization: folder structure, naming, frontmatter metadata.
Chunking: split documents into retrievable units.
Embedding: vectorize each chunk.
Storage: put vectors in a database that supports similarity search.
Update strategy: keep the corpus fresh as sources change.

Each layer can be solved with hosted services, OSS run locally, or a hybrid. The right choice depends on volume, sensitivity, latency budget, and team comfort with infrastructure. The walkthrough below assumes a mid-sized corpus (1,000-100,000 documents) and prefers OSS-local for the heavier batch layers and hosted for the lower-volume interactive layers.

Layer 1: source identification

Before any code runs, decide what your knowledge base is for. Common scopes:

Product documentation: your own docs site, your changelog, your API reference, the GitHub README.
Internal wiki: Confluence, Notion, or Markdown files in a Git repo. Often the largest and messiest source.
Third-party references: vendor docs you depend on, regulatory text, standards documents.
Web content you publish: blog posts, conference talks (transcripts), case studies.
Web content you do not own: competitor docs, public research, news coverage. Respect robots.txt and terms of service; do not index what you have no right to.

Build the index of source URLs (or paths) before building any pipeline. A simple sources.yaml that lists each source with its type, scope, and refresh cadence is the right artifact:

sources:
  - id: own_docs
    type: sitemap
    url: https://docs.example.com/sitemap.xml
    refresh: weekly
  - id: own_blog
    type: rss
    url: https://example.com/blog/feed.xml
    refresh: daily
  - id: internal_wiki
    type: filesystem
    path: /mnt/wiki
    refresh: hourly
  - id: vendor_docs
    type: url_list
    file: ./vendor_urls.txt
    refresh: monthly

This single file is the source-of-truth for what the knowledge base contains. Future you will thank present you.

Layer 2: conversion

For one-off conversions of unusual pages, the URL-to-Markdown web tool is the right path — paste, click, save. For hundreds or thousands of pages from a sitemap, run OSS locally. The canonical stack:

Playwright for JS-rendered pages
Trafilatura for extraction (news/blog/article content)
Mozilla Readability as an alternative or fallback for non-news content
BeautifulSoup or lxml for site-specific custom selectors when the generic libraries miss

A batch conversion script for a sitemap-driven source:

import trafilatura
from playwright.sync_api import sync_playwright
from pathlib import Path
import hashlib, json
import requests
from xml.etree import ElementTree

NS = {'sm': 'http://www.sitemaps.org/schemas/sitemap/0.9'}

def urls_from_sitemap(sitemap_url: str) -> list[str]:
    xml = requests.get(sitemap_url).content
    root = ElementTree.fromstring(xml)
    return [loc.text for loc in root.findall('.//sm:loc', NS)]

def render(url: str, browser) -> str:
    page = browser.new_page()
    page.goto(url, wait_until='networkidle', timeout=30000)
    html = page.content()
    page.close()
    return html

def convert(url: str, browser, out_dir: Path):
    try:
        html = render(url, browser)
        md = trafilatura.extract(html, output_format='markdown', with_metadata=True)
        if not md: return None
        meta = trafilatura.extract_metadata(html)
        slug = hashlib.sha1(url.encode()).hexdigest()[:12]
        body = (
            f"---\n"
            f"source_url: {url}\n"
            f"title: {meta.title or ''}\n"
            f"date: {meta.date or ''}\n"
            f"---\n\n{md}"
        )
        (out_dir / f"{slug}.md").write_text(body, encoding='utf-8')
        return slug
    except Exception as e:
        print(f"FAIL {url}: {e}")
        return None

urls = urls_from_sitemap('https://docs.example.com/sitemap.xml')
out = Path('corpus'); out.mkdir(exist_ok=True)

with sync_playwright() as p:
    browser = p.chromium.launch()
    for url in urls:
        convert(url, browser, out)
    browser.close()

For deeper context on extraction tradeoffs, see content extraction: Readability vs Trafilatura vs AI-powered.

Layer 3: organization

The output of layer 2 is a folder of Markdown files. Folder structure for retrievability:

corpus/
  own_docs/
    api-reference/
    guides/
    changelog/
  own_blog/
  internal_wiki/
    engineering/
    product/
    operations/
  vendor_docs/
    stripe/
    aws/
    openai/

The frontmatter is the structured metadata block at the top of each file. Useful fields:

---
source_url: https://docs.example.com/api/auth
title: Authentication API
source_id: own_docs
fetched_at: 2026-05-10T14:00:00Z
last_modified: 2026-04-22
tags: [auth, api, oauth]
section: api-reference
---

The chunker and the retrieval layer can both filter on these fields. "Search only own_docs and vendor_docs/stripe" becomes a metadata filter rather than a string match on file paths.

Layer 4: chunking

Embedding works best on chunks of 200-1000 tokens. A whole document is too large; a single sentence is too small. The art is in where to split.

Fixed-size chunking (every N tokens) is easy and bad — splits mid-sentence, mid-paragraph, mid-thought. The retrieval layer then surfaces awkward fragments.

Markdown structure-aware chunking is the right default. Split at H2 boundaries; if a section is still too long, split at H3; if individual paragraphs exceed the limit, split with overlap. Markdown makes this trivial because the structure is explicit.

import re
from pathlib import Path
import tiktoken

ENC = tiktoken.encoding_for_model('text-embedding-3-small')
TARGET = 500
MAX = 800

def token_len(text: str) -> int:
    return len(ENC.encode(text))

def split_by_heading(md: str, level: int) -> list[str]:
    pattern = rf'(?=^{"#" * level} )'
    return [s for s in re.split(pattern, md, flags=re.M) if s.strip()]

def chunk_markdown(md: str) -> list[str]:
    chunks = []
    h2_sections = split_by_heading(md, 2)
    for sec in h2_sections:
        if token_len(sec) <= MAX:
            chunks.append(sec)
            continue
        h3_sections = split_by_heading(sec, 3)
        for sub in h3_sections:
            if token_len(sub) <= MAX:
                chunks.append(sub)
            else:
                paras = sub.split('\n\n')
                cur, cur_len = [], 0
                for p in paras:
                    pl = token_len(p)
                    if cur_len + pl > MAX and cur:
                        chunks.append('\n\n'.join(cur))
                        cur, cur_len = [p], pl
                    else:
                        cur.append(p); cur_len += pl
                if cur:
                    chunks.append('\n\n'.join(cur))
    return chunks

for f in Path('corpus').rglob('*.md'):
    md = f.read_text(encoding='utf-8')
    for i, chunk in enumerate(chunk_markdown(md)):
        out = Path('chunks') / f.relative_to('corpus').with_suffix('') / f'{i:04d}.md'
        out.parent.mkdir(parents=True, exist_ok=True)
        out.write_text(chunk, encoding='utf-8')

Each chunk inherits the parent document's frontmatter via filename and folder, and carries enough surrounding context to be self-explanatory in retrieval results.

Layer 5: embedding

Two viable paths: local embeddings via sentence-transformers, or hosted embeddings via OpenAI/Cohere/Voyage.

Local with sentence-transformers: free per inference after the one-time model download, runs on CPU or GPU, no data leaves your environment. Quality is excellent for English; multilingual models (paraphrase-multilingual-mpnet-base-v2) cover 50+ languages. Ideal when the corpus is sensitive or the volume is high enough that per-call API costs add up.

from sentence_transformers import SentenceTransformer
import numpy as np
from pathlib import Path
import json

model = SentenceTransformer('all-mpnet-base-v2')

chunks_data = []
for f in Path('chunks').rglob('*.md'):
    text = f.read_text(encoding='utf-8')
    chunks_data.append({'path': str(f), 'text': text})

texts = [c['text'] for c in chunks_data]
embeddings = model.encode(texts, batch_size=32, show_progress_bar=True, normalize_embeddings=True)

for c, e in zip(chunks_data, embeddings):
    c['embedding'] = e.tolist()

Path('embeddings.json').write_text(json.dumps(chunks_data))

Hosted via OpenAI: trivially scalable, no infrastructure, fewer model-management headaches. Per-call cost is small (text-embedding-3-small is fractions of a cent per 1K tokens) but accumulates at high volumes. Quality is excellent and consistent. Use when you do not want to manage a model.

The choice rarely makes a meaningful difference at the retrieval-quality level for typical corpora — both produce good embeddings. The decision usually comes down to data residency and per-call economics.

Layer 6: vector storage

Three reasonable choices, depending on how much infrastructure you want to run.

ChromaDB: the default for local dev and small-to-mid production. Embedded mode runs in-process; client-server mode for shared use. Free, OSS, no managed-service bill. Excellent for corpora up to a few million vectors.

import chromadb

client = chromadb.PersistentClient(path='./chroma_db')
collection = client.get_or_create_collection('knowledge_base')

collection.add(
    ids=[f'chunk_{i}' for i in range(len(chunks_data))],
    documents=[c['text'] for c in chunks_data],
    embeddings=[c['embedding'] for c in chunks_data],
    metadatas=[{'source_path': c['path']} for c in chunks_data],
)

results = collection.query(
    query_texts=['How do I rotate API keys?'],
    n_results=5,
)
for doc, meta in zip(results['documents'][0], results['metadatas'][0]):
    print(meta['source_path'])
    print(doc[:200])
    print('---')

Pinecone: hosted, fully managed, scales effortlessly, good for production where you do not want to operate infrastructure. Pay per index per month plus per-operation. The right choice if your team is small and your traffic is high.

Weaviate: self-hosted or managed, supports hybrid search (vector + BM25) natively, modular vectorization. Strong choice when retrieval quality matters and you want sparse-and-dense hybrid retrieval out of the box.

Other viable options: Qdrant (similar to Chroma, slightly more production-ready), Milvus (when you have hundreds of millions of vectors), pgvector (when you already run Postgres and the volume is modest).

Layer 7: update strategy

The corpus you ship on day one is stale by week four. Sources change; pages get updated; new documents appear; old ones get retired. The update strategy has two dimensions: detection and reprocessing.

Detection: how do you know a source has changed?

Sitemap or RSS lastmod / pubDate fields — cheap, reliable when the source publishes them honestly.
HTTP HEAD with Last-Modified — universal, sometimes lies.
Content hash — fetch and hash; if the hash differs, re-process. Most expensive, most reliable.
Webhook from the source (if you control it) — the gold standard, when available.

Reprocessing: when a source changes, you need to re-convert, re-chunk, re-embed, and replace in the vector store. The re-embedding is the most expensive step; for large corpora, only re-embed the chunks whose text actually changed (compare hashes) rather than all chunks of the changed document.

A scheduled job that processes the sources.yaml file according to each source's refresh cadence is the right shape. Cron + Python script is enough for most teams; for higher reliability use Airflow, Prefect, or a managed equivalent.

Cross-reference: PDFs in the corpus

Most real knowledge bases include PDF sources alongside web sources — vendor manuals, regulatory text, scanned reference material. The conversion layer extends naturally: same Markdown output format, same chunking, same embedding, same storage. See PDF to Markdown for RAG: complete pipeline guide for the parallel walkthrough on the PDF side, including OCR for scanned material and special handling for layout-rich documents.

For a corpus that is 70% web and 30% PDF, both pipelines feed the same chunker and the same vector store. The frontmatter source_type field lets retrieval results indicate whether a chunk came from a web page or a PDF.

Common failure modes

Five recurring mistakes worth avoiding.

Chunking too aggressively: 200-token chunks lose context. Aim for 500-800 tokens with semantic boundaries. The retriever will surface multiple chunks per query — give each chunk enough context to stand on its own.
Embedding the noisy raw HTML: skipping the conversion step and embedding raw HTML pulls in nav-bar text and footer copyright into the vector, polluting the embedding. Convert first, embed clean Markdown.
No frontmatter metadata: without source_id, fetched_at, and section tags on each chunk, you cannot filter retrievals or audit which sources contributed to a generation. Always carry metadata through.
Re-embedding everything on every update: hash-based diffing at the chunk level is the difference between a 5-minute incremental update and a 5-hour full re-index.
No evaluation harness: a knowledge base without a test set of representative queries (with expected sources) cannot be improved. Build the eval harness on day one with 20-50 queries and measure precision-at-5 every time you change the chunker or the embedding model.

For the conversion layer specifically, see content extraction comparisons and JavaScript-rendered page handling. For one-off conversions of unusual sources during corpus development, the web tool is the right shortcut. For everything at scale, run the OSS stack locally — Playwright + Trafilatura + sentence-transformers + ChromaDB covers the entire pipeline with no recurring vendor bill.

The architecture, in one sentence

List your sources, convert each to clean Markdown, organize by source and section with metadata in the frontmatter, chunk by Markdown structure (not fixed size), embed with sentence-transformers locally or OpenAI hosted, store in ChromaDB or Pinecone, and re-process on a per-source cadence with hash-based diffing. Everything else is implementation detail. The substrate that makes the whole thing tractable is Markdown — clean, structured, semantic plaintext is what every layer of the pipeline expects.

Frequently asked questions

How big should each chunk be in tokens?

500-800 tokens is the sweet spot for most embedding models and most retrieval use cases. Smaller chunks (200-300) lose context and produce fragmentary retrievals; larger chunks (1500+) dilute the embedding signal because a single vector represents too many ideas. The exact number matters less than splitting at semantic boundaries — H2/H3 headings, complete paragraphs — rather than at fixed token counts. Always measure on a representative test set; the optimal size depends on your content density.

Should I use local embeddings or a hosted embedding API?

For sensitive content (internal wikis, customer data) where the corpus cannot leave your environment, local with sentence-transformers is the only acceptable choice and it works well. For public content at modest volume (under a few million chunks), hosted (OpenAI text-embedding-3-small, Cohere, Voyage) saves you the model-management overhead with negligible per-call cost. Quality differences between modern embedding models are small on most retrieval tasks; the decision is mostly about operations and data policy, not retrieval accuracy.

How often should I re-process the corpus to keep it fresh?

Per source, not per corpus — a daily blog feed needs daily refresh, a slow-changing vendor docs site needs monthly, an internal wiki being actively edited may benefit from hourly. The sources.yaml file should declare the refresh cadence per source, and the scheduler runs each source according to its declared rhythm. For change detection, prefer source-published lastmod fields (sitemap, RSS) when available and fall back to content hashing. Re-embed only the chunks whose text actually changed, not the entire document.