May 10, 2026 · 8 min read · MDisBetter

How to Extract Tables from PDF: The Complete Guide

Tables are the most-frequently-extracted thing from PDFs, and the most-frequently-screwed-up. Bank statements, financial reports, scientific data tables, invoices, schedules — all locked in PDF, all needing to become spreadsheet-shaped data. Five methods that work in 2026, ranked by ease and accuracy.

Why tables are hard to extract

A PDF table is a visual construct: rectangles drawn on a page with text inside them. There's no schema in the PDF saying "this is row 3 of column B" — that has to be inferred from the coordinates of each piece of text. Different table styles produce different challenges:

Bordered tables: relatively easy (line detection finds cell boundaries)
Borderless tables: hard (whitespace clustering needed)
Merged cells: very hard (the merge has to be inferred from missing borders)
Multi-row headers: hard (need to detect that two rows belong together as headers)
Tables spanning pages: hard (need to stitch across pages with repeating headers)

The right tool for your task depends on which of these your tables exhibit.

Method 1: Online converter to CSV

Easiest. Upload PDF, get one CSV per detected table. Best for: one-off extractions, non-technical users.

Our PDF to CSV converter outputs one CSV per table, named by page and table index, with header rows preserved. Output is a ZIP file plus a manifest mapping each CSV back to its location in the source PDF.

Pros: zero setup; handles bordered and borderless tables; OCR included for scanned PDFs

Cons: complex tables (heavy merges, exotic layouts) may need manual cleanup

Method 2: Tabula (free desktop app)

Tabula is the open-source standard for PDF table extraction. Download, open, draw rectangles around tables, click "Preview & Export". Manual but precise.

Pros: free; works offline; you control table boundaries

Cons: manual per-table; doesn't scale to many documents; no OCR

Best for: a few specific tables in known locations, where you'd rather spend 5 minutes per document than risk wrong extraction

Method 3: Camelot (Python library)

Camelot is Tabula's spiritual successor as a Python library. Two extraction algorithms (lattice for bordered, stream for borderless):

import camelot

tables = camelot.read_pdf('document.pdf', pages='all', flavor='lattice')
print(f'Found {len(tables)} tables')

for i, table in enumerate(tables):
    table.to_csv(f'table_{i}.csv')

Pros: programmable; quality is high on the tables it handles; integrates into pipelines

Cons: Python + dependencies; sometimes misses tables that have unusual styling; limited OCR support

Method 4: AI-based extraction (LLM with the right prompt)

Modern LLMs handle tables well when given clean input. Workflow: convert PDF to Markdown first (preserving tables as GFM), then ask Claude or ChatGPT to extract specific tables.

For interactive use: convert your PDF in our web tool, paste the resulting Markdown into ChatGPT/Claude, ask for the table extraction in CSV format. For programmatic pipelines, use an OSS converter (we don't offer an API today):

import pymupdf

# Step 1: PDF to Markdown locally with PyMuPDF
with pymupdf.open('report.pdf') as doc:
    md = '\n\n'.join(p.get_text('markdown') for p in doc)

# Step 2: Pass md to your LLM of choice with extraction instructions
# (OpenAI/Anthropic/Gemini SDKs all accept the Markdown string directly)

Pros: extremely flexible (extract specific tables matching certain criteria); handles complex layouts where regex-based tools fail

Cons: per-call cost; can hallucinate cell values on edge cases; slower than direct extraction

Method 5: Specialized commercial tools

For very high-volume table extraction or particularly hard documents (financial statements with merged cells, scientific data tables with footnotes), commercial tools fill the gap:

Docparser: cloud-based, template-driven extraction
Rossum: invoice-specific, very accurate on that vertical
Hypatos: financial documents, ML-based
Adobe PDF Extract API: Adobe's own extractor, very strong on Adobe-generated PDFs

Pros: strong on their target verticals; enterprise support

Cons: expensive; vertical-specific; vendor lock-in

Comparison table

Method	Setup	OCR	Borderless tables	Best for
Online (PDF to CSV)	None	Yes	Good	One-off use, mixed quality docs
Tabula desktop	Download app	No	Manual	Few tables, precise control
Camelot Python	pip install	Limited	Decent (stream mode)	Programmatic batch
AI-based	API key	Yes (via Markdown)	Excellent	Complex tables, smart filtering
Commercial	Vendor onboarding	Yes	Excellent (vertical-specific)	High-volume specific use cases

What if you want tables AND surrounding context?

If your goal is feeding a document to AI rather than building a spreadsheet pipeline, you don't want CSV — you want the tables in the context of the surrounding document. That's exactly what our PDF tables to Markdown tool does: GFM tables in their original document context.

For pure data-pipeline work, CSV is right. For document understanding (LLM context, knowledge bases, RAG), Markdown with tables-in-context is right. Pick based on what your downstream consumer needs.

Common pitfalls

Currency markers and parenthetical negatives

Financial tables use "$" and "(123)" for negatives. Most extractors preserve these as text; your downstream parser may need to convert them to numeric values. Plan for this in your pipeline.

Multi-page tables

Tables that span pages often have repeating headers on each page. Good extractors stitch them into one logical table; lesser ones produce one table per page that you have to merge.

Merged cells

GFM tables don't support merged cells; CSV doesn't either. Most extractors flatten the merge by repeating the value across the merged range. Verify this matches your downstream expectation.

Footnoted cells

Cells with footnote markers (a, b, *) need special handling — the marker is data; the footnote text is separate but related. Different extractors handle this differently; spot-check your tool's behavior.

Recommendation

For most users: start with our online PDF to CSV converter (no setup) or PDF to Markdown (if context matters). For programmatic pipelines: Camelot for simple tables, Marker or Docling for complex layouts (we don't ship a public API today, so OSS is the path). For very high volume on a specific document type: commercial tools justified by ROI.

Frequently asked questions

Will my extracted tables match the source exactly?

On bordered tables with consistent layout: yes, near-perfect. On complex tables (merged cells, multi-row headers): expect manual cleanup of edge cases. Always spot-check critical numbers against the source.

Can I extract tables from scanned PDFs?

Yes — our online tool runs OCR first, then table detection. Quality depends on scan resolution; bordered tables in 200+ DPI scans usually come through cleanly.

How do I get extracted tables into Excel?

Direct paste from CSV, or open the CSV in Excel as a one-step import. GFM Markdown tables also paste cleanly into Excel — pipes are recognized as cell separators.