Why MarkdownNodeParser is the right primitive for Word content
LlamaIndex thinks in nodes — discrete units of content with metadata that flow through index, retrieve, and synthesise. MarkdownNodeParser is purpose-built to turn structured Markdown into well-formed nodes: each heading boundary creates a new node, the heading path lives in metadata, and downstream retrieval respects the document's real shape.
Convert each Word document on Word to Markdown, persist the .md, then load and parse with the standard LlamaIndex stack. Hand-correct any conversion errors in the .md before parsing — the corrected version persists, and every subsequent index rebuild benefits.
Multi-source corpus pattern
For knowledge bases that mix Word policies, PDF contracts, web reference docs, and audio meeting notes: convert each modality to Markdown (Word here, PDFs via PDF for LlamaIndex, URLs via URL for LlamaIndex, audio via Audio for LlamaIndex), then run the same MarkdownNodeParser pipeline across all of them. Source-modality metadata (e.g. type: contract vs type: meeting-notes) becomes a retrieval filter.