How the DOCX Format Works Internally (And Why Conversion Is Hard)
If you have only ever interacted with .docx files through Microsoft Word's UI, the format looks simple: a document with text, headings, tables, and images. If you have ever tried to write code that extracts content from a .docx programmatically, the format reveals itself as anything but simple. A .docx file is a ZIP archive containing a dozen or more XML files, each describing a different aspect of the document — the content, the styles, the relationships between parts, the embedded media, the document properties. Understanding this internal structure is the difference between writing a naive text extractor that loses half the semantic information and writing a real conversion pipeline that preserves heading hierarchy, table structure, and the style information that makes Markdown output usable. This article walks through what's actually inside a .docx, why naive extraction fails, and why styles.xml is the secret to high-quality Word-to-Markdown conversion.
The Office Open XML (OOXML) standard
The .docx format is a member of the Office Open XML (OOXML) family, standardized as ECMA-376 and ISO/IEC 29500. Microsoft introduced it with Office 2007, replacing the older binary .doc format. The two-letter difference in the extension obscures a complete architectural rewrite — .doc files were proprietary binary streams that only Microsoft's own libraries could parse reliably; .docx files are ZIP archives of standardized XML that any developer can inspect with built-in tools.
The OOXML standard itself runs to thousands of pages — it covers Word documents (.docx), Excel workbooks (.xlsx), and PowerPoint presentations (.pptx) under one architectural umbrella. The shared structure is called the Open Packaging Convention (OPC): a ZIP container holding XML "parts" that reference each other through "relationships." Word's contribution is the WordprocessingML schema that defines the actual document structure inside the container.
Despite its formal-standards heritage, real .docx files in the wild deviate from the spec in dozens of ways. Word's own implementation is the de facto reference. Conversion tools have to handle the documented spec plus the undocumented deviations Word introduced over the years — which is one of the reasons this is harder than it looks.
What a .docx file actually contains
Take any .docx file and rename it to .zip — most operating systems will then open it as an archive. The contents look something like this:
my-document.docx (renamed to .zip)
├── [Content_Types].xml
├── _rels/
│ └── .rels
├── docProps/
│ ├── app.xml
│ ├── core.xml
│ └── custom.xml
├── word/
│ ├── document.xml ← the actual content
│ ├── styles.xml ← style definitions
│ ├── numbering.xml ← list-numbering definitions
│ ├── settings.xml ← document-wide settings
│ ├── webSettings.xml
│ ├── fontTable.xml
│ ├── theme/
│ │ └── theme1.xml
│ ├── _rels/
│ │ └── document.xml.rels
│ ├── media/
│ │ ├── image1.png
│ │ └── image2.jpeg
│ └── header1.xml, footer1.xml, etc.
└── customXml/The interesting parts for a conversion tool:
- document.xml: the body of the document — paragraphs, runs, tables, in document order
- styles.xml: the style definitions document.xml references (Heading 1, Heading 2, Body Text, etc.)
- numbering.xml: how numbered and bulleted lists are constructed
- media/: the actual binary image files referenced from document.xml
- document.xml.rels: the relationships file that connects references in document.xml (image IDs, hyperlink IDs) to actual targets
Most other parts (settings, webSettings, fontTable, theme) carry information that's irrelevant for Markdown conversion — Markdown has no concept of fonts, themes, or page-rendering settings. The conversion tool ignores them.
Inside document.xml: paragraphs, runs, and tables
The body content of a Word document, in document.xml, is a sequence of paragraphs (<w:p>) and tables (<w:tbl>) and a few other top-level elements. Each paragraph is itself a sequence of "runs" (<w:r>), where a run is a span of text with consistent formatting.
A simple paragraph that says "Hello bold world" looks like this in document.xml:
<w:p>
<w:r>
<w:t xml:space="preserve">Hello </w:t>
</w:r>
<w:r>
<w:rPr>
<w:b/>
</w:rPr>
<w:t>bold</w:t>
</w:r>
<w:r>
<w:t xml:space="preserve"> world</w:t>
</w:r>
</w:p>Three runs, the middle one with <w:b/> in its run properties (rPr) marking it as bold. The conversion tool walking document.xml needs to (a) extract the text from each run's <w:t> child, (b) apply the formatting from rPr, (c) concatenate runs within a paragraph, and (d) emit the paragraph in Markdown — which means emitting Hello **bold** world for this example.
This is already more nuanced than a naive text extractor would handle. The naive approach — scan the XML for <w:t> elements and dump their text content — produces "Hello bold world" with no formatting. The semantic information about which span was bold is in the run-properties metadata that has to be read alongside the text.
The role of styles.xml: where headings come from
Now consider a Heading 1 in document.xml. The XML looks like this:
<w:p>
<w:pPr>
<w:pStyle w:val="Heading1"/>
</w:pPr>
<w:r>
<w:t>Chapter One</w:t>
</w:r>
</w:p>The paragraph's properties (<w:pPr>) reference a style named "Heading1". The run inside contains the heading text. But document.xml does not contain the definition of what Heading1 means — that lives in styles.xml.
Inside styles.xml, the Heading1 style is defined:
<w:style w:type="paragraph" w:styleId="Heading1">
<w:name w:val="heading 1"/>
<w:basedOn w:val="Normal"/>
<w:next w:val="Normal"/>
<w:pPr>
<w:outlineLvl w:val="0"/>
</w:pPr>
<w:rPr>
<w:b/>
<w:sz w:val="32"/>
</w:rPr>
</w:style>The critical piece for Markdown conversion: <w:outlineLvl w:val="0"/>. The outline level (0 for H1, 1 for H2, etc.) is the semantic indicator that this style is a heading and at which level. The visual properties (bold, font size 32) are how Word renders the heading visually but are not what makes it semantically a heading.
This is the secret that separates good and bad Word-to-Markdown converters. A naive converter sees a paragraph styled "Heading1" and might guess it's a heading based on the style name alone. A real converter resolves the style reference into styles.xml, finds the outline level, and emits the right number of pound signs in Markdown. For documents using non-default style names ("MyCustomTitle" instead of "Heading1"), only the styles.xml-aware approach gets it right.
Why naive text extraction fails
Putting the pieces together, the naive approach to Word extraction — "open the .docx, extract document.xml, scan for <w:t> elements, concatenate their content" — fails in many ways:
- Loses heading vs paragraph distinction: every block of text becomes a flat paragraph; H1, H2, H3 all collapse to undifferentiated body text
- Loses bold/italic/underline: the text is preserved but inline formatting disappears
- Loses list structure: numbered and bulleted lists become indistinguishable from paragraphs
- Loses table structure: tables flatten to a stream of cell contents with no row/column structure
- Loses links: hyperlinks become plain text; the URL information is in document.xml.rels which the naive extractor ignored
- Loses images: image references appear in document.xml as relationship IDs that need resolution through .rels to find the actual file in /media/
- Mishandles whitespace:
xml:space="preserve"matters; tools that don't respect it concatenate adjacent runs without spaces
For a one-page document the naive output might look acceptable. For a multi-page document with structure, it produces an undifferentiated wall of text that is meaningfully worse than even a copy-paste from Word into a plain editor. For Markdown conversion, the naive approach is non-viable.
The harder cases: lists, tables, and embedded objects
The basics above (paragraphs, runs, headings) are the easy part. The cases that make .docx parsing genuinely hard:
Numbered and bulleted lists. Lists in Word are not nested elements — each list item is a top-level paragraph with style "ListParagraph" or a similar style, plus a reference to a numbering definition in numbering.xml. The numbering definition specifies the format (decimal, lower-alpha, bullet, etc.), the indent levels, and the per-level format string. Reconstructing the visual list (1. 2. 3., a. b. c., etc.) requires walking numbering.xml and tracking per-level counters across the document. Pandoc and Mammoth both do this; naive extractors don't.
Tables. Tables in OOXML support nested tables, merged cells (vMerge for vertical merges, gridSpan for horizontal), per-cell borders, vertical alignment, and complex row/column structures. Markdown tables support exactly: rows separated by newlines, cells separated by pipes, one header row marked by a dashes row. The mismatch is large. The technical depth is in why Word tables are the hardest conversion problem.
Embedded objects. Word can embed Excel sheets, equations (modern Office Math or legacy MathType), drawings, and arbitrary OLE objects. Each embedded object type has its own XML representation, often referencing additional parts in the .docx archive. Equations may render as MathML (which can convert to LaTeX-syntax math in Markdown) or as bitmap fallback (which extracts as image references with no semantic recovery).
Headers, footers, and footnotes. Headers (header1.xml, header2.xml, etc.) and footers carry per-page or per-section content that's separate from the document body. Most Markdown converters drop them — Markdown doesn't have a native header/footer concept. Footnotes (footnotes.xml) are referenced from document.xml via run-level references and need to be reattached as Markdown footnote syntax in the output.
Comments and revisions. Tracked changes (insertions, deletions) and comments live in their own parts. Most converters either accept the changes (output as if accepted) or drop them entirely; preserving the revision metadata in Markdown is unusual.
The library landscape
Three major libraries handle the heavy lifting of .docx parsing across the ecosystem:
- python-docx (Python): mature, focused on .docx specifically, good API for reading and writing. Used in many Python-based document automation pipelines.
- Mammoth.js (JavaScript / Python ports): semantic-focused, designed specifically for Word-to-HTML conversion with style mapping. Produces cleaner output than naive extraction by default; deep-dive on its tradeoffs is in Mammoth vs Pandoc vs AI.
- Pandoc (Haskell, with command-line interface): the universal document converter. Implements its own .docx reader as part of its multi-format pipeline. Excellent for batch conversion via CLI; the bash batch script throughout these articles uses Pandoc.
For more general-purpose Office-format extraction (text only, no semantic structure), Apache Tika, oletools, and docx2txt are common choices. They're useful for search-indexing pipelines where text-only is fine; not useful for Markdown conversion where structure matters.
The web tool's place in this ecosystem
The web tool at word-to-markdown sits on top of the same library landscape — using semantic .docx parsing to produce structured Markdown output rather than naive text extraction. The choice of underlying engine matters less than the architectural decision to read styles.xml properly and emit semantic Markdown structure. For an individual document conversion through a browser, the web tool produces output equivalent to running Pandoc locally; for bulk migration, run Pandoc directly on a corporate machine as covered in building an enterprise document migration pipeline.
Practical implications for Markdown quality
Knowing how .docx works internally explains the practical Markdown-output failure modes:
- Documents authored without using Heading styles produce flat Markdown — the converter cannot infer headings that aren't styled as headings, regardless of how visually heading-like they appear
- Documents with custom styles based on Heading1 sometimes lose the outline-level information depending on how the style was created; this is why style-guide compliance from contributors matters (covered in word to Markdown for content teams)
- Documents with complex tables often need manual cleanup post-conversion because the OOXML table model has features Markdown cannot represent
- Equation-heavy documents need careful review because the OOXML math representation and LaTeX-in-Markdown have impedance mismatches
- Heavily-styled documents (institutional templates with locked styles) sometimes have styles that look like headings visually but lack the outline-level metadata to convert as headings — these need manual fixup
For users converting a single document and wondering why the output isn't perfect, the answer is almost always: the input document didn't use Word's semantic styles consistently, so the converter couldn't infer the structure. The fix is upstream — train authors to use Heading 1, Heading 2, list buttons, and table inserts instead of manual formatting. This is what makes the conversion meaningfully better across the board.
For more on the underlying architecture see Mammoth vs Pandoc vs AI; for the real-world conversion challenges see why Word tables are the hardest conversion problem; for the enterprise pipeline see building an enterprise document migration pipeline.