Where Word-to-RAG pipelines fall apart without Markdown
Two failure modes show up immediately. First, the standard Docx2txt loaders flatten heading structure — H1/H2/H3 become indistinguishable runs of text, so chunking by character count slices through topic boundaries. Second, the XML overhead in raw .docx confuses embedding models that expect natural-language input.
Pre-converting to clean Markdown solves both. Heading hierarchy is preserved as #/##/### markers, which every modern Markdown-aware splitter respects as semantic boundaries. The XML envelope is stripped — embeddings encode prose, not metadata.
Honest workflow note
The web tool at Word to Markdown converts one document at a time. For a corpus of 20-200 documents, this is a manageable progressive workflow — convert as you onboard each policy or spec into the knowledge base. For true mass migration of thousands of documents, run Pandoc locally (pandoc input.docx -o output.md in a shell loop) or use python-docx for programmatic conversion. The web tool is the right surface for ad-hoc and progressive enterprise use; local OSS is the right surface for batch automation.
Recommended pipeline
Convert each .docx to .md (web tool for progressive, Pandoc locally for batch). Split first by H1/H2 (top-level document and section), then sub-split anything over 800 tokens with a recursive character splitter. Keep the document title and section path as chunk metadata — your retrieval can filter by document, scope to specific sections, or boost by metadata. Building a multi-source pipeline? Combine with PDFs, web pages, audio, and video the same way.