Building a Web Knowledge Base for AI: Architecture Guide
Most teams that say "we want to put our content into ChatGPT" eventually mean: we want a retrieval-augmented system over our own corpus. The corpus is rarely a single tidy database — it is a mix of public docs, blog posts, internal wikis, third-party references, and ad-hoc HTML scattered across the web. The architecture that turns this mess into a useful AI knowledge base has seven layers, each with real choices and real trade-offs. Here is the end-to-end build, with code, tool recommendations, and the failure modes to avoid.
The seven layers, at a glance
- Source identification: enumerate what goes in, decide what stays out.
- Conversion: turn each source into clean Markdown.
- Organization: folder structure, naming, frontmatter metadata.
- Chunking: split documents into retrievable units.
- Embedding: vectorize each chunk.
- Storage: put vectors in a database that supports similarity search.
- Update strategy: keep the corpus fresh as sources change.
Each layer can be solved with hosted services, OSS run locally, or a hybrid. The right choice depends on volume, sensitivity, latency budget, and team comfort with infrastructure. The walkthrough below assumes a mid-sized corpus (1,000-100,000 documents) and prefers OSS-local for the heavier batch layers and hosted for the lower-volume interactive layers.
Layer 1: source identification
Before any code runs, decide what your knowledge base is for. Common scopes:
- Product documentation: your own docs site, your changelog, your API reference, the GitHub README.
- Internal wiki: Confluence, Notion, or Markdown files in a Git repo. Often the largest and messiest source.
- Third-party references: vendor docs you depend on, regulatory text, standards documents.
- Web content you publish: blog posts, conference talks (transcripts), case studies.
- Web content you do not own: competitor docs, public research, news coverage. Respect robots.txt and terms of service; do not index what you have no right to.
Build the index of source URLs (or paths) before building any pipeline. A simple sources.yaml that lists each source with its type, scope, and refresh cadence is the right artifact:
sources:
- id: own_docs
type: sitemap
url: https://docs.example.com/sitemap.xml
refresh: weekly
- id: own_blog
type: rss
url: https://example.com/blog/feed.xml
refresh: daily
- id: internal_wiki
type: filesystem
path: /mnt/wiki
refresh: hourly
- id: vendor_docs
type: url_list
file: ./vendor_urls.txt
refresh: monthly
This single file is the source-of-truth for what the knowledge base contains. Future you will thank present you.
Layer 2: conversion
For one-off conversions of unusual pages, the URL-to-Markdown web tool is the right path — paste, click, save. For hundreds or thousands of pages from a sitemap, run OSS locally. The canonical stack:
- Playwright for JS-rendered pages
- Trafilatura for extraction (news/blog/article content)
- Mozilla Readability as an alternative or fallback for non-news content
- BeautifulSoup or lxml for site-specific custom selectors when the generic libraries miss
A batch conversion script for a sitemap-driven source:
import trafilatura
from playwright.sync_api import sync_playwright
from pathlib import Path
import hashlib, json
import requests
from xml.etree import ElementTree
NS = {'sm': 'http://www.sitemaps.org/schemas/sitemap/0.9'}
def urls_from_sitemap(sitemap_url: str) -> list[str]:
xml = requests.get(sitemap_url).content
root = ElementTree.fromstring(xml)
return [loc.text for loc in root.findall('.//sm:loc', NS)]
def render(url: str, browser) -> str:
page = browser.new_page()
page.goto(url, wait_until='networkidle', timeout=30000)
html = page.content()
page.close()
return html
def convert(url: str, browser, out_dir: Path):
try:
html = render(url, browser)
md = trafilatura.extract(html, output_format='markdown', with_metadata=True)
if not md: return None
meta = trafilatura.extract_metadata(html)
slug = hashlib.sha1(url.encode()).hexdigest()[:12]
body = (
f"---\n"
f"source_url: {url}\n"
f"title: {meta.title or ''}\n"
f"date: {meta.date or ''}\n"
f"---\n\n{md}"
)
(out_dir / f"{slug}.md").write_text(body, encoding='utf-8')
return slug
except Exception as e:
print(f"FAIL {url}: {e}")
return None
urls = urls_from_sitemap('https://docs.example.com/sitemap.xml')
out = Path('corpus'); out.mkdir(exist_ok=True)
with sync_playwright() as p:
browser = p.chromium.launch()
for url in urls:
convert(url, browser, out)
browser.close()
For deeper context on extraction tradeoffs, see content extraction: Readability vs Trafilatura vs AI-powered.
Layer 3: organization
The output of layer 2 is a folder of Markdown files. Folder structure for retrievability:
corpus/
own_docs/
api-reference/
guides/
changelog/
own_blog/
internal_wiki/
engineering/
product/
operations/
vendor_docs/
stripe/
aws/
openai/
The frontmatter is the structured metadata block at the top of each file. Useful fields:
---
source_url: https://docs.example.com/api/auth
title: Authentication API
source_id: own_docs
fetched_at: 2026-05-10T14:00:00Z
last_modified: 2026-04-22
tags: [auth, api, oauth]
section: api-reference
---
The chunker and the retrieval layer can both filter on these fields. "Search only own_docs and vendor_docs/stripe" becomes a metadata filter rather than a string match on file paths.
Layer 4: chunking
Embedding works best on chunks of 200-1000 tokens. A whole document is too large; a single sentence is too small. The art is in where to split.
Fixed-size chunking (every N tokens) is easy and bad — splits mid-sentence, mid-paragraph, mid-thought. The retrieval layer then surfaces awkward fragments.
Markdown structure-aware chunking is the right default. Split at H2 boundaries; if a section is still too long, split at H3; if individual paragraphs exceed the limit, split with overlap. Markdown makes this trivial because the structure is explicit.
import re
from pathlib import Path
import tiktoken
ENC = tiktoken.encoding_for_model('text-embedding-3-small')
TARGET = 500
MAX = 800
def token_len(text: str) -> int:
return len(ENC.encode(text))
def split_by_heading(md: str, level: int) -> list[str]:
pattern = rf'(?=^{"#" * level} )'
return [s for s in re.split(pattern, md, flags=re.M) if s.strip()]
def chunk_markdown(md: str) -> list[str]:
chunks = []
h2_sections = split_by_heading(md, 2)
for sec in h2_sections:
if token_len(sec) <= MAX:
chunks.append(sec)
continue
h3_sections = split_by_heading(sec, 3)
for sub in h3_sections:
if token_len(sub) <= MAX:
chunks.append(sub)
else:
paras = sub.split('\n\n')
cur, cur_len = [], 0
for p in paras:
pl = token_len(p)
if cur_len + pl > MAX and cur:
chunks.append('\n\n'.join(cur))
cur, cur_len = [p], pl
else:
cur.append(p); cur_len += pl
if cur:
chunks.append('\n\n'.join(cur))
return chunks
for f in Path('corpus').rglob('*.md'):
md = f.read_text(encoding='utf-8')
for i, chunk in enumerate(chunk_markdown(md)):
out = Path('chunks') / f.relative_to('corpus').with_suffix('') / f'{i:04d}.md'
out.parent.mkdir(parents=True, exist_ok=True)
out.write_text(chunk, encoding='utf-8')
Each chunk inherits the parent document's frontmatter via filename and folder, and carries enough surrounding context to be self-explanatory in retrieval results.
Layer 5: embedding
Two viable paths: local embeddings via sentence-transformers, or hosted embeddings via OpenAI/Cohere/Voyage.
Local with sentence-transformers: free per inference after the one-time model download, runs on CPU or GPU, no data leaves your environment. Quality is excellent for English; multilingual models (paraphrase-multilingual-mpnet-base-v2) cover 50+ languages. Ideal when the corpus is sensitive or the volume is high enough that per-call API costs add up.
from sentence_transformers import SentenceTransformer
import numpy as np
from pathlib import Path
import json
model = SentenceTransformer('all-mpnet-base-v2')
chunks_data = []
for f in Path('chunks').rglob('*.md'):
text = f.read_text(encoding='utf-8')
chunks_data.append({'path': str(f), 'text': text})
texts = [c['text'] for c in chunks_data]
embeddings = model.encode(texts, batch_size=32, show_progress_bar=True, normalize_embeddings=True)
for c, e in zip(chunks_data, embeddings):
c['embedding'] = e.tolist()
Path('embeddings.json').write_text(json.dumps(chunks_data))
Hosted via OpenAI: trivially scalable, no infrastructure, fewer model-management headaches. Per-call cost is small (text-embedding-3-small is fractions of a cent per 1K tokens) but accumulates at high volumes. Quality is excellent and consistent. Use when you do not want to manage a model.
The choice rarely makes a meaningful difference at the retrieval-quality level for typical corpora — both produce good embeddings. The decision usually comes down to data residency and per-call economics.
Layer 6: vector storage
Three reasonable choices, depending on how much infrastructure you want to run.
ChromaDB: the default for local dev and small-to-mid production. Embedded mode runs in-process; client-server mode for shared use. Free, OSS, no managed-service bill. Excellent for corpora up to a few million vectors.
import chromadb
client = chromadb.PersistentClient(path='./chroma_db')
collection = client.get_or_create_collection('knowledge_base')
collection.add(
ids=[f'chunk_{i}' for i in range(len(chunks_data))],
documents=[c['text'] for c in chunks_data],
embeddings=[c['embedding'] for c in chunks_data],
metadatas=[{'source_path': c['path']} for c in chunks_data],
)
results = collection.query(
query_texts=['How do I rotate API keys?'],
n_results=5,
)
for doc, meta in zip(results['documents'][0], results['metadatas'][0]):
print(meta['source_path'])
print(doc[:200])
print('---')
Pinecone: hosted, fully managed, scales effortlessly, good for production where you do not want to operate infrastructure. Pay per index per month plus per-operation. The right choice if your team is small and your traffic is high.
Weaviate: self-hosted or managed, supports hybrid search (vector + BM25) natively, modular vectorization. Strong choice when retrieval quality matters and you want sparse-and-dense hybrid retrieval out of the box.
Other viable options: Qdrant (similar to Chroma, slightly more production-ready), Milvus (when you have hundreds of millions of vectors), pgvector (when you already run Postgres and the volume is modest).
Layer 7: update strategy
The corpus you ship on day one is stale by week four. Sources change; pages get updated; new documents appear; old ones get retired. The update strategy has two dimensions: detection and reprocessing.
Detection: how do you know a source has changed?
- Sitemap or RSS lastmod / pubDate fields — cheap, reliable when the source publishes them honestly.
- HTTP HEAD with Last-Modified — universal, sometimes lies.
- Content hash — fetch and hash; if the hash differs, re-process. Most expensive, most reliable.
- Webhook from the source (if you control it) — the gold standard, when available.
Reprocessing: when a source changes, you need to re-convert, re-chunk, re-embed, and replace in the vector store. The re-embedding is the most expensive step; for large corpora, only re-embed the chunks whose text actually changed (compare hashes) rather than all chunks of the changed document.
A scheduled job that processes the sources.yaml file according to each source's refresh cadence is the right shape. Cron + Python script is enough for most teams; for higher reliability use Airflow, Prefect, or a managed equivalent.
Cross-reference: PDFs in the corpus
Most real knowledge bases include PDF sources alongside web sources — vendor manuals, regulatory text, scanned reference material. The conversion layer extends naturally: same Markdown output format, same chunking, same embedding, same storage. See PDF to Markdown for RAG: complete pipeline guide for the parallel walkthrough on the PDF side, including OCR for scanned material and special handling for layout-rich documents.
For a corpus that is 70% web and 30% PDF, both pipelines feed the same chunker and the same vector store. The frontmatter source_type field lets retrieval results indicate whether a chunk came from a web page or a PDF.
Common failure modes
Five recurring mistakes worth avoiding.
- Chunking too aggressively: 200-token chunks lose context. Aim for 500-800 tokens with semantic boundaries. The retriever will surface multiple chunks per query — give each chunk enough context to stand on its own.
- Embedding the noisy raw HTML: skipping the conversion step and embedding raw HTML pulls in nav-bar text and footer copyright into the vector, polluting the embedding. Convert first, embed clean Markdown.
- No frontmatter metadata: without source_id, fetched_at, and section tags on each chunk, you cannot filter retrievals or audit which sources contributed to a generation. Always carry metadata through.
- Re-embedding everything on every update: hash-based diffing at the chunk level is the difference between a 5-minute incremental update and a 5-hour full re-index.
- No evaluation harness: a knowledge base without a test set of representative queries (with expected sources) cannot be improved. Build the eval harness on day one with 20-50 queries and measure precision-at-5 every time you change the chunker or the embedding model.
For the conversion layer specifically, see content extraction comparisons and JavaScript-rendered page handling. For one-off conversions of unusual sources during corpus development, the web tool is the right shortcut. For everything at scale, run the OSS stack locally — Playwright + Trafilatura + sentence-transformers + ChromaDB covers the entire pipeline with no recurring vendor bill.
The architecture, in one sentence
List your sources, convert each to clean Markdown, organize by source and section with metadata in the frontmatter, chunk by Markdown structure (not fixed size), embed with sentence-transformers locally or OpenAI hosted, store in ChromaDB or Pinecone, and re-process on a per-source cadence with hash-based diffing. Everything else is implementation detail. The substrate that makes the whole thing tractable is Markdown — clean, structured, semantic plaintext is what every layer of the pipeline expects.