Pricing Dashboard Sign up
Recent
· 11 min read · MDisBetter

How the DOCX Format Works Internally (And Why Conversion Is Hard)

If you have only ever interacted with .docx files through Microsoft Word's UI, the format looks simple: a document with text, headings, tables, and images. If you have ever tried to write code that extracts content from a .docx programmatically, the format reveals itself as anything but simple. A .docx file is a ZIP archive containing a dozen or more XML files, each describing a different aspect of the document — the content, the styles, the relationships between parts, the embedded media, the document properties. Understanding this internal structure is the difference between writing a naive text extractor that loses half the semantic information and writing a real conversion pipeline that preserves heading hierarchy, table structure, and the style information that makes Markdown output usable. This article walks through what's actually inside a .docx, why naive extraction fails, and why styles.xml is the secret to high-quality Word-to-Markdown conversion.

The Office Open XML (OOXML) standard

The .docx format is a member of the Office Open XML (OOXML) family, standardized as ECMA-376 and ISO/IEC 29500. Microsoft introduced it with Office 2007, replacing the older binary .doc format. The two-letter difference in the extension obscures a complete architectural rewrite — .doc files were proprietary binary streams that only Microsoft's own libraries could parse reliably; .docx files are ZIP archives of standardized XML that any developer can inspect with built-in tools.

The OOXML standard itself runs to thousands of pages — it covers Word documents (.docx), Excel workbooks (.xlsx), and PowerPoint presentations (.pptx) under one architectural umbrella. The shared structure is called the Open Packaging Convention (OPC): a ZIP container holding XML "parts" that reference each other through "relationships." Word's contribution is the WordprocessingML schema that defines the actual document structure inside the container.

Despite its formal-standards heritage, real .docx files in the wild deviate from the spec in dozens of ways. Word's own implementation is the de facto reference. Conversion tools have to handle the documented spec plus the undocumented deviations Word introduced over the years — which is one of the reasons this is harder than it looks.

What a .docx file actually contains

Take any .docx file and rename it to .zip — most operating systems will then open it as an archive. The contents look something like this:

my-document.docx (renamed to .zip)
├── [Content_Types].xml
├── _rels/
│   └── .rels
├── docProps/
│   ├── app.xml
│   ├── core.xml
│   └── custom.xml
├── word/
│   ├── document.xml          ← the actual content
│   ├── styles.xml            ← style definitions
│   ├── numbering.xml         ← list-numbering definitions
│   ├── settings.xml          ← document-wide settings
│   ├── webSettings.xml
│   ├── fontTable.xml
│   ├── theme/
│   │   └── theme1.xml
│   ├── _rels/
│   │   └── document.xml.rels
│   ├── media/
│   │   ├── image1.png
│   │   └── image2.jpeg
│   └── header1.xml, footer1.xml, etc.
└── customXml/

The interesting parts for a conversion tool:

Most other parts (settings, webSettings, fontTable, theme) carry information that's irrelevant for Markdown conversion — Markdown has no concept of fonts, themes, or page-rendering settings. The conversion tool ignores them.

Inside document.xml: paragraphs, runs, and tables

The body content of a Word document, in document.xml, is a sequence of paragraphs (<w:p>) and tables (<w:tbl>) and a few other top-level elements. Each paragraph is itself a sequence of "runs" (<w:r>), where a run is a span of text with consistent formatting.

A simple paragraph that says "Hello bold world" looks like this in document.xml:

<w:p>
  <w:r>
    <w:t xml:space="preserve">Hello </w:t>
  </w:r>
  <w:r>
    <w:rPr>
      <w:b/>
    </w:rPr>
    <w:t>bold</w:t>
  </w:r>
  <w:r>
    <w:t xml:space="preserve"> world</w:t>
  </w:r>
</w:p>

Three runs, the middle one with <w:b/> in its run properties (rPr) marking it as bold. The conversion tool walking document.xml needs to (a) extract the text from each run's <w:t> child, (b) apply the formatting from rPr, (c) concatenate runs within a paragraph, and (d) emit the paragraph in Markdown — which means emitting Hello **bold** world for this example.

This is already more nuanced than a naive text extractor would handle. The naive approach — scan the XML for <w:t> elements and dump their text content — produces "Hello bold world" with no formatting. The semantic information about which span was bold is in the run-properties metadata that has to be read alongside the text.

The role of styles.xml: where headings come from

Now consider a Heading 1 in document.xml. The XML looks like this:

<w:p>
  <w:pPr>
    <w:pStyle w:val="Heading1"/>
  </w:pPr>
  <w:r>
    <w:t>Chapter One</w:t>
  </w:r>
</w:p>

The paragraph's properties (<w:pPr>) reference a style named "Heading1". The run inside contains the heading text. But document.xml does not contain the definition of what Heading1 means — that lives in styles.xml.

Inside styles.xml, the Heading1 style is defined:

<w:style w:type="paragraph" w:styleId="Heading1">
  <w:name w:val="heading 1"/>
  <w:basedOn w:val="Normal"/>
  <w:next w:val="Normal"/>
  <w:pPr>
    <w:outlineLvl w:val="0"/>
  </w:pPr>
  <w:rPr>
    <w:b/>
    <w:sz w:val="32"/>
  </w:rPr>
</w:style>

The critical piece for Markdown conversion: <w:outlineLvl w:val="0"/>. The outline level (0 for H1, 1 for H2, etc.) is the semantic indicator that this style is a heading and at which level. The visual properties (bold, font size 32) are how Word renders the heading visually but are not what makes it semantically a heading.

This is the secret that separates good and bad Word-to-Markdown converters. A naive converter sees a paragraph styled "Heading1" and might guess it's a heading based on the style name alone. A real converter resolves the style reference into styles.xml, finds the outline level, and emits the right number of pound signs in Markdown. For documents using non-default style names ("MyCustomTitle" instead of "Heading1"), only the styles.xml-aware approach gets it right.

Why naive text extraction fails

Putting the pieces together, the naive approach to Word extraction — "open the .docx, extract document.xml, scan for <w:t> elements, concatenate their content" — fails in many ways:

For a one-page document the naive output might look acceptable. For a multi-page document with structure, it produces an undifferentiated wall of text that is meaningfully worse than even a copy-paste from Word into a plain editor. For Markdown conversion, the naive approach is non-viable.

The harder cases: lists, tables, and embedded objects

The basics above (paragraphs, runs, headings) are the easy part. The cases that make .docx parsing genuinely hard:

Numbered and bulleted lists. Lists in Word are not nested elements — each list item is a top-level paragraph with style "ListParagraph" or a similar style, plus a reference to a numbering definition in numbering.xml. The numbering definition specifies the format (decimal, lower-alpha, bullet, etc.), the indent levels, and the per-level format string. Reconstructing the visual list (1. 2. 3., a. b. c., etc.) requires walking numbering.xml and tracking per-level counters across the document. Pandoc and Mammoth both do this; naive extractors don't.

Tables. Tables in OOXML support nested tables, merged cells (vMerge for vertical merges, gridSpan for horizontal), per-cell borders, vertical alignment, and complex row/column structures. Markdown tables support exactly: rows separated by newlines, cells separated by pipes, one header row marked by a dashes row. The mismatch is large. The technical depth is in why Word tables are the hardest conversion problem.

Embedded objects. Word can embed Excel sheets, equations (modern Office Math or legacy MathType), drawings, and arbitrary OLE objects. Each embedded object type has its own XML representation, often referencing additional parts in the .docx archive. Equations may render as MathML (which can convert to LaTeX-syntax math in Markdown) or as bitmap fallback (which extracts as image references with no semantic recovery).

Headers, footers, and footnotes. Headers (header1.xml, header2.xml, etc.) and footers carry per-page or per-section content that's separate from the document body. Most Markdown converters drop them — Markdown doesn't have a native header/footer concept. Footnotes (footnotes.xml) are referenced from document.xml via run-level references and need to be reattached as Markdown footnote syntax in the output.

Comments and revisions. Tracked changes (insertions, deletions) and comments live in their own parts. Most converters either accept the changes (output as if accepted) or drop them entirely; preserving the revision metadata in Markdown is unusual.

The library landscape

Three major libraries handle the heavy lifting of .docx parsing across the ecosystem:

For more general-purpose Office-format extraction (text only, no semantic structure), Apache Tika, oletools, and docx2txt are common choices. They're useful for search-indexing pipelines where text-only is fine; not useful for Markdown conversion where structure matters.

The web tool's place in this ecosystem

The web tool at word-to-markdown sits on top of the same library landscape — using semantic .docx parsing to produce structured Markdown output rather than naive text extraction. The choice of underlying engine matters less than the architectural decision to read styles.xml properly and emit semantic Markdown structure. For an individual document conversion through a browser, the web tool produces output equivalent to running Pandoc locally; for bulk migration, run Pandoc directly on a corporate machine as covered in building an enterprise document migration pipeline.

Practical implications for Markdown quality

Knowing how .docx works internally explains the practical Markdown-output failure modes:

For users converting a single document and wondering why the output isn't perfect, the answer is almost always: the input document didn't use Word's semantic styles consistently, so the converter couldn't infer the structure. The fix is upstream — train authors to use Heading 1, Heading 2, list buttons, and table inserts instead of manual formatting. This is what makes the conversion meaningfully better across the board.

For more on the underlying architecture see Mammoth vs Pandoc vs AI; for the real-world conversion challenges see why Word tables are the hardest conversion problem; for the enterprise pipeline see building an enterprise document migration pipeline.

Frequently asked questions

Why is .docx a ZIP file? Couldn't Microsoft have used a simpler format?
The ZIP-of-XML structure was a deliberate architectural choice with several benefits: it produces meaningfully smaller files than the legacy .doc binary format thanks to ZIP compression, it's standards-based so any developer can read and write the format without reverse-engineering, it composes well (each part has a single responsibility), and it makes incremental updates possible (Word can rewrite just the parts that changed rather than the whole document). The downside is that the simplicity of the container layer hides genuine complexity in the WordprocessingML schema inside — which is why writing a real .docx parser is harder than the file-format documentation makes it look. The same architectural pattern (ZIP-of-XML) underlies .xlsx and .pptx, so Office's ecosystem benefits from a consistent approach across applications.
Can I write my own Word-to-Markdown converter from scratch?
You can, and many engineers have, but there are pitfalls. The basic case (paragraphs, runs, simple headings) is a few hundred lines of Python with python-docx. The hard cases (lists with nested numbering, tables with merged cells, equations, footnotes, embedded objects, tracked changes) compound complexity quickly and are where most homegrown converters produce buggy output. Practical guidance: if your use case is narrow (your team's documents follow a known template), write a focused converter on top of python-docx that handles your specific patterns well. If your use case is broad (arbitrary documents from arbitrary authors), use Pandoc or Mammoth — they have absorbed years of edge-case handling that's painful to reinvent. Most teams who try the from-scratch approach end up adopting Pandoc or Mammoth within a few months.
How does the .docx format compare to .odt (OpenDocument Format)?
Architecturally they're similar — both are ZIP archives of XML files. ODT was standardized by OASIS as an open alternative to OOXML and is the native format of LibreOffice and OpenOffice. The internal XML schemas differ (OOXML uses w:p, w:r naming; ODT uses text:p, text:span) but the conceptual structure (document content, styles, relationships) is parallel. Pandoc supports both formats with comparable quality; Mammoth focuses on .docx specifically. For most real-world workflows in 2026, .docx is dominant because Microsoft Word is the most-used word processor, but ODT has a meaningful presence in European public-sector environments where OpenDocument adoption was mandated. The Word-to-Markdown conversion principles in this article translate directly to ODT-to-Markdown conversion via Pandoc with -f odt instead of -f docx.