Clean HTML5, not Word's polluted export
Word's native HTML export ("Save as Web Page") is notorious for producing unusable output — bloated with mso-* classes, <o:p> tags, conditional IE6 comments, and inline styles on every element. mdisbetter's DOCX to HTML pipeline ignores all of that and re-emits the document as semantic HTML5: <h1> through <h6>, <p>, <ul>/<ol>, <table>, <strong>, <em>, <a>. Drop the result into WordPress, Ghost, Contentful, or a static site — it just works.
What gets preserved
Heading hierarchy (Word's Heading 1/2/3 styles map to <h1>/<h2>/<h3>). Paragraph breaks. Bullet and numbered lists with nesting. Tables including merged cells (colspan/rowspan). Bold, italic, underline. Hyperlinks. Embedded images (base64-encoded by default for self-contained output). Document metadata (title, author) emitted as <meta> tags in the <head>.
For AI workflows, Markdown is cheaper
HTML is 30-40% more tokens than the equivalent Markdown for the same content — every <tag> and </tag> pair costs tokens that Markdown collapses into a single character. DOCX to Markdown is the right target when the document is destined for ChatGPT, Claude, or any LLM API.