The pipeline at a glance
Convert each .docx to .md (web tool at Word to Markdown for progressive use, Pandoc locally for batch). Split by heading structure with a Markdown-aware splitter. Embed each chunk with your model of choice (text-embedding-3-large, voyage-3, BGE-large). Upsert to your vector DB with chunk metadata: document title, section path, owner team, last-modified date.
Vector DB choice for Word corpora
Pinecone: managed, fast, no infra. Best for teams that want to ship without operating a database. Chroma: local, simple, ideal for prototyping a few hundred documents before scaling. Weaviate: hybrid retrieval (BM25 + vector) — useful for policy documents where exact phrase matches matter (regulatory clause names, defined terms). Qdrant: filter-heavy retrieval — useful when queries scope to specific document types, owners, or date ranges.
Multi-source corpus pattern
Most enterprise knowledge bases mix modalities: Word policies + PDF contracts + URL reference docs + audio meeting recordings + video training. Convert each modality to Markdown (Word here, PDF, URL, audio, video) with the same Markdown-aware splitter, embed and upsert with source-modality metadata. Retrieval can scope to "policies only" or "all modalities" depending on the query.