May 10, 2026 · 10 min read · MDisBetter

Why Word Tables Are the Hardest Conversion Problem (Technical)

Most parts of Word-to-Markdown conversion are tractable. Paragraphs become paragraphs. Headings become headings. Lists become lists. Bold and italic survive the trip. Hyperlinks transfer cleanly. The format-translation grammar is rich enough that the structural mapping is mostly straightforward. And then you hit a table, and everything breaks. Word's table model — inherited from decades of word-processor evolution and standardized in Office Open XML — supports nested tables, merged cells in both directions, multi-row and multi-column headers, vertical alignment, complex spans, per-cell borders, and arbitrary content in any cell. Markdown's table model — designed for plain-text readability — supports rows separated by newlines, cells separated by pipes, and a single header row. The mismatch is enormous. This article walks through what Word tables can do, what Markdown tables cannot, and the best-effort strategies that working conversion pipelines actually use.

The Word table model in detail

A Word table is a sequence of rows (<w:tr>), each containing a sequence of cells (<w:tc>), each containing arbitrary block content (paragraphs, runs, even nested tables). The cell can have properties (<w:tcPr>) controlling:

Vertical merge (vMerge): cells span across multiple rows. Implemented as: the first cell has <w:vMerge w:val="restart"/>, subsequent cells have <w:vMerge/> (continue) and contain no content of their own.
Horizontal merge (gridSpan): cells span across multiple columns. Implemented as: the cell has <w:gridSpan w:val="3"/> indicating it spans 3 columns.
Cell width: explicit width in twentieths of a point, or percentage of table width
Vertical alignment: top, center, bottom of the cell
Borders: per-side, per-cell, with width, color, and style
Shading: background color or pattern
Cell padding: per-side margins

Because cells contain arbitrary block content, a cell can hold multiple paragraphs, lists, sub-headings, code blocks, images, and even nested tables. Word documents in the wild routinely contain tables where a single cell has half a page of formatted content. This is where conversion gets genuinely hard — there is no Markdown construct that represents "a cell containing multiple paragraphs and an embedded list."

The Markdown table model in detail

The original Markdown specification (Gruber, 2004) didn't include tables at all. The widely-implemented modern table syntax comes from extensions — primarily PHP Markdown Extra and GitHub Flavored Markdown (GFM). The GFM table syntax is what most Markdown renderers in 2026 support:

| Header 1 | Header 2 | Header 3 |
|----------|----------|----------|
| Cell A   | Cell B   | Cell C   |
| Cell D   | Cell E   | Cell F   |

The constraints:

Tables are rectangular: every row has the same number of cells
Exactly one header row, marked by the dashes row immediately below
Cell content is single-line — no paragraph breaks within a cell, no embedded lists, no embedded headings, no embedded tables
Cell alignment can be specified in the dashes row (:-- left, :-: center, --: right) but vertical alignment is not expressible
No merged cells in any direction
No cell-level styling (borders, shading, padding)

Compare this to Word's model and the structural mismatch is enormous. Word can represent essentially any tabular layout you can draw on paper; Markdown can represent the simplest possible rectangular grid.

What converts cleanly

The good news first: many real-world Word tables are simple data tables that fit Markdown's model just fine. The pattern that converts well:

Rectangular: every row has the same number of cells
One header row at the top, with bold or shaded styling
No merged cells
One paragraph per cell, no embedded lists or sub-tables
Cell content is short text or numbers, no inline images or complex formatting

For tables matching this pattern, Pandoc, Mammoth, and the AI-based converters all produce clean Markdown output that renders identically to the source. Most data tables in technical reports, financial summaries, and reference documents fall in this category. Estimate: 60-70% of real-world Word tables convert cleanly with no manual intervention.

What converts poorly

The harder cases, ranked roughly by frequency:

Multi-row headers. A table where the top two rows together form the column headers — first row says "2025" "2026" spanning multiple data columns each, second row says "Q1 Q2 Q3 Q4" "Q1 Q2 Q3 Q4" providing the per-quarter columns. This pattern doesn't map to Markdown's single-header-row model. Best-effort: flatten to a single header row with concatenated labels ("2025 Q1", "2025 Q2", etc.), or convert to HTML embedded in the Markdown if the reader's renderer supports HTML.

Vertically-merged cells (rowspan). A table where a label cell on the left spans three data rows. Markdown can't represent the merge. Best-effort: repeat the merged cell's content in each of the spanned rows, accepting some visual redundancy. Some converters drop the content into the first row only and leave subsequent rows empty, which is worse for readability.

Horizontally-merged cells (colspan). A table where a section-header cell spans the full width of the data columns below it. Markdown can't represent the merge. Best-effort: convert the merged cell into a separate paragraph above the table, or split the table into multiple sub-tables with the section header as text between them.

Cells with multiple paragraphs. A cell containing two or three paragraphs of explanation. Markdown table cells are single-line. Best-effort: concatenate the paragraphs with explicit line-break syntax (<br> in HTML, or two-trailing-spaces newlines that some renderers respect). The result renders as a multi-line cell in compatible renderers and as concatenated text in stricter ones.

Cells with embedded lists. A cell containing a bulleted list of items. Markdown table cells are single-line; bulleted lists are multi-line block elements. Best-effort: concatenate list items with bullet characters and line breaks ("- item 1
- item 2"), accepting that the rendered result looks less clean than the source.

Cells with images. An image inside a table cell. Most Markdown renderers handle inline image syntax (![alt](path)) inside table cells, but the result is often visually different from the source — Word controls the image scaling explicitly per cell; Markdown renderers use whatever default the CSS produces.

Nested tables. A table containing another table in one of its cells. Markdown's table syntax cannot represent this at all — the row-and-column grammar requires each cell to be single-line content. Best-effort: extract the nested table as a separate Markdown table immediately below the outer table, with a note about the original placement.

Tables with mixed content types. A table where some cells contain text, others contain code samples, others contain images, others contain bulleted lists. The combination compounds the per-cell challenges; the resulting Markdown is either heavily-degraded plain text or HTML-embedded-in-Markdown that renderers may or may not handle.

Best-effort strategies

Working conversion pipelines use a mix of strategies depending on the table's complexity:

Strategy 1: simple rectangular conversion

For tables that fit Markdown's model, just convert. This is what Pandoc and Mammoth do by default. Output is clean Markdown table syntax. No special handling needed.

Strategy 2: flatten and degrade gracefully

For tables with merged cells or multi-row headers, flatten the structure. Repeat merged content in each spanned row; concatenate multi-row headers into single-row composite labels. The output is structurally simpler than the source but still readable. Acceptable for reference and reading; problematic if the source's structure carried important semantic information.

Strategy 3: embed HTML in the Markdown

For tables that genuinely cannot be flattened (complex multi-level headers in a financial report, for example), the pragmatic alternative is to embed an HTML table inside the Markdown:

## Quarterly results

<table>
  <thead>
    <tr>
      <th rowspan="2">Region</th>
      <th colspan="4">FY2025</th>
      <th colspan="4">FY2026</th>
    </tr>
    <tr>
      <th>Q1</th><th>Q2</th><th>Q3</th><th>Q4</th>
      <th>Q1</th><th>Q2</th><th>Q3</th><th>Q4</th>
    </tr>
  </thead>
  ...
</table>

Narrative continues here.

Most Markdown renderers (Pandoc, GFM, MkDocs, Docusaurus) pass HTML through unmodified — it renders correctly in the final output. The source file is uglier than pure-Markdown but the rendered result preserves the source's structure. For accessibility, the HTML table should include proper scope attributes on header cells and a caption element.

Strategy 4: replace the table with a different presentation

The most aggressive strategy: rebuild the content in a non-table form. A complex three-level header table might be better as a series of nested headings with bullet lists below. A table that's really a categorization grid might be better as a multi-section document. The conversion tool can't make this judgment automatically; this is editorial work during the post-conversion review.

Strategy 5: render the table as an image

For tables that are visually structured in ways that no text format can represent (genuinely-complex layouts where the visual structure carries meaning), the honest fallback is to render the table as an image and embed the image in the Markdown with descriptive alt text. The source-of-truth becomes the image; the alt text describes the data textually for accessibility and AI-readability. This is unsatisfying but sometimes the right call.

Real-world output examples

To make this concrete, here's how Pandoc handles a moderately-complex Word table by default. Source: a 5-column financial table with a vertically-merged "Region" column on the left and one row of bold totals at the bottom.

The Pandoc output:

| Region        | FY2025 Q1 | FY2025 Q2 | FY2025 Q3 | FY2025 Q4 |
|---------------|-----------|-----------|-----------|-----------|
| North America | $120K     | $135K     | $142K     | $158K     |
|               | $95K      | $102K     | $115K     | $128K     |
| Europe        | $80K      | $88K      | $94K      | $103K     |
|               | $62K      | $68K      | $72K      | $79K      |
| **Total**     | **$357K** | **$393K** | **$423K** | **$468K** |

The vertical merge degraded — the second row of "North America" data has an empty Region label, and the reader has to infer that the empty label means "continuation of North America." For data analysis this is workable; for reading-and-comprehension it's lossy. Manual editorial cleanup might re-fill the Region labels ("North America" repeated, or "NA - subtotal") to make the table self-explanatory.

The AI conversion advantage on tables

This is the area where AI-powered conversion (covered comparatively in Mammoth vs Pandoc vs AI) shows the biggest quality differential over rule-based converters. An LLM reading a complex table can:

Recognize that two rows together form a multi-row header and produce flattened composite labels intelligently ("FY2025 Q1" rather than just leaving the second-row label)
Recognize that an empty cell after a merged cell means continuation and re-fill the implied content
Recognize that a section-header row spanning the full table width should become a Markdown heading above a sub-table
Produce HTML-embedded tables when the source structure genuinely requires it, with proper accessibility attributes
Generate a brief textual description of the table's structure to accompany the converted output

The cost trade-offs of AI conversion (per-document API cost, latency, non-determinism) apply, but for documents that are heavy on complex tables, the quality boost is meaningful. For most enterprise migrations the right approach is rule-based for the bulk, AI for the table-heavy edge cases — also covered in the comparative article.

Practical recommendations for authors

The highest-leverage change is upstream — authors who understand the table-conversion challenge can author Word documents that convert better:

Prefer simple rectangular tables when possible. Resist the urge to use merged cells unless they're genuinely necessary.
If a table needs complex structure, use a single header row rather than multi-row headers. "FY2025 Q1" instead of "FY2025" spanning over "Q1, Q2, Q3, Q4."
For categorization grids that aren't really data tables, consider expressing them as headings with content below rather than as tables — better for both Word and Markdown reading
For tables with rich cell content (multiple paragraphs per cell, embedded lists), consider whether the content really wants to be tabular or whether a different presentation would serve readers better

Style guides for documentation teams often include a "prefer simple tables" rule for exactly this reason. The rule is good Word practice and good Markdown practice simultaneously.

For the broader conversion landscape see Mammoth vs Pandoc vs AI; for the underlying format details see how the DOCX format works internally; for the bulk-migration pipeline that has to handle tables at scale see building an enterprise document migration pipeline; for an industry view of where tables matter most see word to Markdown for academic publishing (data tables in research papers) or word to Markdown for legal contracts (clause-comparison tables).

Frequently asked questions

Can Markdown represent tables with merged cells at all?

Not in standard Markdown table syntax. The GitHub Flavored Markdown table grammar requires every row to have the same number of pipe-separated cells with no merge concept. Three workarounds exist: (1) flatten the table by repeating content in merged-row positions or concatenating headers, accepting some redundancy in the output; (2) embed an HTML table inside the Markdown, which most renderers pass through unmodified — this preserves merged-cell structure via rowspan/colspan attributes; (3) replace the table with a non-table presentation (headings + lists) when the merged structure was being used to group conceptually rather than to display tabular data. The HTML-embedded approach is the highest-fidelity for genuinely complex tables; the flatten approach is the most readable in pure Markdown contexts; the replace approach is sometimes the editorially-cleanest answer.

How do I handle a Word table where each cell contains a paragraph of explanatory text?

This is one of the cases where Markdown's table model genuinely doesn't fit. Markdown table cells are single-line; multi-paragraph cells don't translate. Three options based on your priority: (1) if you need pure-Markdown output, concatenate the paragraphs with explicit line-break syntax (HTML <br> or two-trailing-spaces newlines) — readable but lossy; (2) if you need full fidelity, use HTML-embedded-in-Markdown for that table, which lets cells contain real paragraphs at the cost of a less-portable Markdown source; (3) if the table is really being used as a side-by-side comparison rather than as data, consider restructuring as a 'feature comparison' section with headings and bullet lists per item — often this is what readers actually want anyway. The right call depends on whether the table is structurally essential or whether it's a presentation artifact that could be reformulated.

Why don't Markdown extensions support advanced tables like Word does?

The Markdown design philosophy is plain-text readability — the source should be readable as plain text without a renderer. Complex table syntax (merged cells, multi-row headers, nested cells) makes the plain-text source unreadable. The maintainers of GFM, CommonMark, and other extensions have considered enhanced table syntax repeatedly and consistently rejected it on readability grounds. The pragmatic compromise: use Markdown's simple tables for the 70% of cases where they work, and embed HTML tables for the 30% where you need full fidelity. Most modern Markdown renderers pass HTML through unchanged, so the HTML-embedded escape hatch is universally available. AsciiDoc and reStructuredText offer richer table syntax natively; if your team's documents have heavy complex-table content, those formats are sometimes a better fit than Markdown — though they have their own tradeoffs in tooling and ecosystem support.