Why Word Tables Are the Hardest Conversion Problem (Technical)
Most parts of Word-to-Markdown conversion are tractable. Paragraphs become paragraphs. Headings become headings. Lists become lists. Bold and italic survive the trip. Hyperlinks transfer cleanly. The format-translation grammar is rich enough that the structural mapping is mostly straightforward. And then you hit a table, and everything breaks. Word's table model — inherited from decades of word-processor evolution and standardized in Office Open XML — supports nested tables, merged cells in both directions, multi-row and multi-column headers, vertical alignment, complex spans, per-cell borders, and arbitrary content in any cell. Markdown's table model — designed for plain-text readability — supports rows separated by newlines, cells separated by pipes, and a single header row. The mismatch is enormous. This article walks through what Word tables can do, what Markdown tables cannot, and the best-effort strategies that working conversion pipelines actually use.
The Word table model in detail
A Word table is a sequence of rows (<w:tr>), each containing a sequence of cells (<w:tc>), each containing arbitrary block content (paragraphs, runs, even nested tables). The cell can have properties (<w:tcPr>) controlling:
- Vertical merge (vMerge): cells span across multiple rows. Implemented as: the first cell has
<w:vMerge w:val="restart"/>, subsequent cells have<w:vMerge/>(continue) and contain no content of their own. - Horizontal merge (gridSpan): cells span across multiple columns. Implemented as: the cell has
<w:gridSpan w:val="3"/>indicating it spans 3 columns. - Cell width: explicit width in twentieths of a point, or percentage of table width
- Vertical alignment: top, center, bottom of the cell
- Borders: per-side, per-cell, with width, color, and style
- Shading: background color or pattern
- Cell padding: per-side margins
Because cells contain arbitrary block content, a cell can hold multiple paragraphs, lists, sub-headings, code blocks, images, and even nested tables. Word documents in the wild routinely contain tables where a single cell has half a page of formatted content. This is where conversion gets genuinely hard — there is no Markdown construct that represents "a cell containing multiple paragraphs and an embedded list."
The Markdown table model in detail
The original Markdown specification (Gruber, 2004) didn't include tables at all. The widely-implemented modern table syntax comes from extensions — primarily PHP Markdown Extra and GitHub Flavored Markdown (GFM). The GFM table syntax is what most Markdown renderers in 2026 support:
| Header 1 | Header 2 | Header 3 |
|----------|----------|----------|
| Cell A | Cell B | Cell C |
| Cell D | Cell E | Cell F |The constraints:
- Tables are rectangular: every row has the same number of cells
- Exactly one header row, marked by the dashes row immediately below
- Cell content is single-line — no paragraph breaks within a cell, no embedded lists, no embedded headings, no embedded tables
- Cell alignment can be specified in the dashes row (
:--left,:-:center,--:right) but vertical alignment is not expressible - No merged cells in any direction
- No cell-level styling (borders, shading, padding)
Compare this to Word's model and the structural mismatch is enormous. Word can represent essentially any tabular layout you can draw on paper; Markdown can represent the simplest possible rectangular grid.
What converts cleanly
The good news first: many real-world Word tables are simple data tables that fit Markdown's model just fine. The pattern that converts well:
- Rectangular: every row has the same number of cells
- One header row at the top, with bold or shaded styling
- No merged cells
- One paragraph per cell, no embedded lists or sub-tables
- Cell content is short text or numbers, no inline images or complex formatting
For tables matching this pattern, Pandoc, Mammoth, and the AI-based converters all produce clean Markdown output that renders identically to the source. Most data tables in technical reports, financial summaries, and reference documents fall in this category. Estimate: 60-70% of real-world Word tables convert cleanly with no manual intervention.
What converts poorly
The harder cases, ranked roughly by frequency:
Multi-row headers. A table where the top two rows together form the column headers — first row says "2025" "2026" spanning multiple data columns each, second row says "Q1 Q2 Q3 Q4" "Q1 Q2 Q3 Q4" providing the per-quarter columns. This pattern doesn't map to Markdown's single-header-row model. Best-effort: flatten to a single header row with concatenated labels ("2025 Q1", "2025 Q2", etc.), or convert to HTML embedded in the Markdown if the reader's renderer supports HTML.
Vertically-merged cells (rowspan). A table where a label cell on the left spans three data rows. Markdown can't represent the merge. Best-effort: repeat the merged cell's content in each of the spanned rows, accepting some visual redundancy. Some converters drop the content into the first row only and leave subsequent rows empty, which is worse for readability.
Horizontally-merged cells (colspan). A table where a section-header cell spans the full width of the data columns below it. Markdown can't represent the merge. Best-effort: convert the merged cell into a separate paragraph above the table, or split the table into multiple sub-tables with the section header as text between them.
Cells with multiple paragraphs. A cell containing two or three paragraphs of explanation. Markdown table cells are single-line. Best-effort: concatenate the paragraphs with explicit line-break syntax (<br> in HTML, or two-trailing-spaces newlines that some renderers respect). The result renders as a multi-line cell in compatible renderers and as concatenated text in stricter ones.
Cells with embedded lists. A cell containing a bulleted list of items. Markdown table cells are single-line; bulleted lists are multi-line block elements. Best-effort: concatenate list items with bullet characters and line breaks ("- item 1
- item 2"), accepting that the rendered result looks less clean than the source.
Cells with images. An image inside a table cell. Most Markdown renderers handle inline image syntax () inside table cells, but the result is often visually different from the source — Word controls the image scaling explicitly per cell; Markdown renderers use whatever default the CSS produces.
Nested tables. A table containing another table in one of its cells. Markdown's table syntax cannot represent this at all — the row-and-column grammar requires each cell to be single-line content. Best-effort: extract the nested table as a separate Markdown table immediately below the outer table, with a note about the original placement.
Tables with mixed content types. A table where some cells contain text, others contain code samples, others contain images, others contain bulleted lists. The combination compounds the per-cell challenges; the resulting Markdown is either heavily-degraded plain text or HTML-embedded-in-Markdown that renderers may or may not handle.
Best-effort strategies
Working conversion pipelines use a mix of strategies depending on the table's complexity:
Strategy 1: simple rectangular conversion
For tables that fit Markdown's model, just convert. This is what Pandoc and Mammoth do by default. Output is clean Markdown table syntax. No special handling needed.
Strategy 2: flatten and degrade gracefully
For tables with merged cells or multi-row headers, flatten the structure. Repeat merged content in each spanned row; concatenate multi-row headers into single-row composite labels. The output is structurally simpler than the source but still readable. Acceptable for reference and reading; problematic if the source's structure carried important semantic information.
Strategy 3: embed HTML in the Markdown
For tables that genuinely cannot be flattened (complex multi-level headers in a financial report, for example), the pragmatic alternative is to embed an HTML table inside the Markdown:
## Quarterly results
<table>
<thead>
<tr>
<th rowspan="2">Region</th>
<th colspan="4">FY2025</th>
<th colspan="4">FY2026</th>
</tr>
<tr>
<th>Q1</th><th>Q2</th><th>Q3</th><th>Q4</th>
<th>Q1</th><th>Q2</th><th>Q3</th><th>Q4</th>
</tr>
</thead>
...
</table>
Narrative continues here.Most Markdown renderers (Pandoc, GFM, MkDocs, Docusaurus) pass HTML through unmodified — it renders correctly in the final output. The source file is uglier than pure-Markdown but the rendered result preserves the source's structure. For accessibility, the HTML table should include proper scope attributes on header cells and a caption element.
Strategy 4: replace the table with a different presentation
The most aggressive strategy: rebuild the content in a non-table form. A complex three-level header table might be better as a series of nested headings with bullet lists below. A table that's really a categorization grid might be better as a multi-section document. The conversion tool can't make this judgment automatically; this is editorial work during the post-conversion review.
Strategy 5: render the table as an image
For tables that are visually structured in ways that no text format can represent (genuinely-complex layouts where the visual structure carries meaning), the honest fallback is to render the table as an image and embed the image in the Markdown with descriptive alt text. The source-of-truth becomes the image; the alt text describes the data textually for accessibility and AI-readability. This is unsatisfying but sometimes the right call.
Real-world output examples
To make this concrete, here's how Pandoc handles a moderately-complex Word table by default. Source: a 5-column financial table with a vertically-merged "Region" column on the left and one row of bold totals at the bottom.
The Pandoc output:
| Region | FY2025 Q1 | FY2025 Q2 | FY2025 Q3 | FY2025 Q4 |
|---------------|-----------|-----------|-----------|-----------|
| North America | $120K | $135K | $142K | $158K |
| | $95K | $102K | $115K | $128K |
| Europe | $80K | $88K | $94K | $103K |
| | $62K | $68K | $72K | $79K |
| **Total** | **$357K** | **$393K** | **$423K** | **$468K** |The vertical merge degraded — the second row of "North America" data has an empty Region label, and the reader has to infer that the empty label means "continuation of North America." For data analysis this is workable; for reading-and-comprehension it's lossy. Manual editorial cleanup might re-fill the Region labels ("North America" repeated, or "NA - subtotal") to make the table self-explanatory.
The AI conversion advantage on tables
This is the area where AI-powered conversion (covered comparatively in Mammoth vs Pandoc vs AI) shows the biggest quality differential over rule-based converters. An LLM reading a complex table can:
- Recognize that two rows together form a multi-row header and produce flattened composite labels intelligently ("FY2025 Q1" rather than just leaving the second-row label)
- Recognize that an empty cell after a merged cell means continuation and re-fill the implied content
- Recognize that a section-header row spanning the full table width should become a Markdown heading above a sub-table
- Produce HTML-embedded tables when the source structure genuinely requires it, with proper accessibility attributes
- Generate a brief textual description of the table's structure to accompany the converted output
The cost trade-offs of AI conversion (per-document API cost, latency, non-determinism) apply, but for documents that are heavy on complex tables, the quality boost is meaningful. For most enterprise migrations the right approach is rule-based for the bulk, AI for the table-heavy edge cases — also covered in the comparative article.
Practical recommendations for authors
The highest-leverage change is upstream — authors who understand the table-conversion challenge can author Word documents that convert better:
- Prefer simple rectangular tables when possible. Resist the urge to use merged cells unless they're genuinely necessary.
- If a table needs complex structure, use a single header row rather than multi-row headers. "FY2025 Q1" instead of "FY2025" spanning over "Q1, Q2, Q3, Q4."
- For categorization grids that aren't really data tables, consider expressing them as headings with content below rather than as tables — better for both Word and Markdown reading
- For tables with rich cell content (multiple paragraphs per cell, embedded lists), consider whether the content really wants to be tabular or whether a different presentation would serve readers better
Style guides for documentation teams often include a "prefer simple tables" rule for exactly this reason. The rule is good Word practice and good Markdown practice simultaneously.
For the broader conversion landscape see Mammoth vs Pandoc vs AI; for the underlying format details see how the DOCX format works internally; for the bulk-migration pipeline that has to handle tables at scale see building an enterprise document migration pipeline; for an industry view of where tables matter most see word to Markdown for academic publishing (data tables in research papers) or word to Markdown for legal contracts (clause-comparison tables).