PDF Tables to Markdown: The Complete Guide
Tables are the canary in the coal mine for PDF conversion. Get tables right and the rest of the document usually works too. Get tables wrong and you have flattened-into-prose data, broken column boundaries, or worse — output that looks correct but has the wrong cells in the wrong rows. Here's the complete picture: why tables are hard, what good extraction looks like, and how to handle the edge cases.
Why tables break in PDF extraction
A PDF table is a visual construct: rectangles drawn on a page with text positioned inside them. There's no schema in the file format saying "this is row 3 of column B" — that has to be inferred from coordinates. Most PDF text extractors treat tables as flat text streams, which destroys row/column structure entirely. The output: a wall of cell values with no boundaries, often interleaved between rows in unpredictable order.
The naïve flat extraction is what tools like pdftotext produce. The result is often unreadable for tables of any complexity:
Q1 Q2 Q3 Q4
2025
12,400 14,200 13,800 15,100
2026
13,700 15,800 14,500 16,300You can see what was supposed to be a table, but extracting clean rows requires guessing where each row starts and ends. Across hundreds of tables in a document, this is impractical.
How MDisBetter preserves tables
Our converter detects tables in three steps. First, line-detection: scan the PDF for horizontal and vertical lines that bound table cells. Second, whitespace clustering: even tables without borders often have consistent spacing patterns that separate cells. Third, text-block grouping: text fragments that align horizontally and vertically are bundled into table cells.
Once cells are identified, the converter emits GitHub-Flavored Markdown (GFM) tables — the same syntax that renders correctly in every modern Markdown viewer. The above example becomes:
| Year | Q1 | Q2 | Q3 | Q4 |
|------|--------|--------|--------|--------|
| 2025 | 12,400 | 14,200 | 13,800 | 15,100 |
| 2026 | 13,700 | 15,800 | 14,500 | 16,300 |This pastes cleanly into spreadsheets, renders identically across Markdown viewers, and feeds to LLMs as actual tables (not prose) — improving downstream answer quality on table-related questions.
Complex tables — merged cells and multi-row headers
Merged cells
Real-world tables often have merged cells ("2025" spanning four quarter columns, for example). GFM doesn't support merged cells natively. Our converter handles this by flattening: the merged value is repeated across the cells it spans, with a comment marker preserving the original merge for downstream tools that care.
| Year | Q1 / Rev | Q1 / Cost | Q2 / Rev | Q2 / Cost |
|------|---------:|----------:|---------:|----------:|
| 2025 | 12.4 | 8.1 | 14.2 | 8.9 |Multi-row headers
Headers that span multiple rows (e.g., "Q1" on top, "Revenue"/"Cost" beneath) are flattened into a single GFM header row using / as a separator. The reasoning: GFM tables only support a single header row, and a flattened compound header preserves the information in a way LLMs and humans both parse correctly.
Footnotes attached to tables
Footnotes like (a), *, or numbered annotations stay attached to their cell content. Footnote text appears as a paragraph immediately after the table.
GFM table syntax explained
If you'll edit converted tables manually, the GFM syntax is straightforward:
| Header 1 | Header 2 | Header 3 |
|----------|---------:|:--------:|
| Left | Right | Center |
| Aligned | Aligned | Aligned ||separates cells- The dash row defines column alignment:
:--left,--:right,:-:center - Pipes inside cell content are escaped as
\| - Whitespace inside cells is collapsed; multi-line content is rare and rendered with
<br>
Troubleshooting common issues
Borderless tables
Some tables (especially modern designs) have no borders at all — just whitespace separating cells. The converter falls back to whitespace-clustering, which works for most consistent layouts. If your table has irregular spacing or uses background colors instead of whitespace, the output may need manual cleanup.
Rotated header text
Headers rotated 90° (vertical text) are recognized via OCR pass and emitted as horizontal text in the GFM output. Stylized labels with unusual fonts may need a manual review.
Tables with embedded images
If a cell contains an image (a chart icon, a logo), the image is extracted separately and the cell shows a [image] placeholder. Reattach images by hand if they're meaningful to your output.
Multi-page tables
Tables that span multiple pages are stitched into one GFM table in the output, with repeating headers (which appear on every page in the source PDF) deduplicated to one header row.
When to use the dedicated PDF-to-CSV tool instead
If you only care about tables and want each one as a separate spreadsheet, use our PDF to CSV tool. It outputs one CSV per detected table, named by page and table index, with no surrounding document content. Useful for invoice line-items, financial-report data extraction, scientific data tables.
Use PDF to Markdown when you want the tables in the context of the surrounding document (for AI summarization, RAG ingestion, or doc-site migration). Use PDF to CSV when you only want the data.
Verifying your converted tables
Quick spot-check workflow: open the converted Markdown in any preview tool (VS Code, Obsidian, GitHub) and visually compare the rendered tables against the source PDF. Look for: row count match, column count match, no shifted values, footnotes attached to the right cell.
For programmatic verification, the GFM tables paste into Excel or Google Sheets cleanly — load both the source PDF and the pasted table, scan for any obvious mismatches. Five minutes of spot-checking saves hours of downstream confusion.