How to Extract Tables from PDF: The Complete Guide
Tables are the most-frequently-extracted thing from PDFs, and the most-frequently-screwed-up. Bank statements, financial reports, scientific data tables, invoices, schedules — all locked in PDF, all needing to become spreadsheet-shaped data. Five methods that work in 2026, ranked by ease and accuracy.
Why tables are hard to extract
A PDF table is a visual construct: rectangles drawn on a page with text inside them. There's no schema in the PDF saying "this is row 3 of column B" — that has to be inferred from the coordinates of each piece of text. Different table styles produce different challenges:
- Bordered tables: relatively easy (line detection finds cell boundaries)
- Borderless tables: hard (whitespace clustering needed)
- Merged cells: very hard (the merge has to be inferred from missing borders)
- Multi-row headers: hard (need to detect that two rows belong together as headers)
- Tables spanning pages: hard (need to stitch across pages with repeating headers)
The right tool for your task depends on which of these your tables exhibit.
Method 1: Online converter to CSV
Easiest. Upload PDF, get one CSV per detected table. Best for: one-off extractions, non-technical users.
Our PDF to CSV converter outputs one CSV per table, named by page and table index, with header rows preserved. Output is a ZIP file plus a manifest mapping each CSV back to its location in the source PDF.
Pros: zero setup; handles bordered and borderless tables; OCR included for scanned PDFs
Cons: complex tables (heavy merges, exotic layouts) may need manual cleanup
Method 2: Tabula (free desktop app)
Tabula is the open-source standard for PDF table extraction. Download, open, draw rectangles around tables, click "Preview & Export". Manual but precise.
Pros: free; works offline; you control table boundaries
Cons: manual per-table; doesn't scale to many documents; no OCR
Best for: a few specific tables in known locations, where you'd rather spend 5 minutes per document than risk wrong extraction
Method 3: Camelot (Python library)
Camelot is Tabula's spiritual successor as a Python library. Two extraction algorithms (lattice for bordered, stream for borderless):
import camelot
tables = camelot.read_pdf('document.pdf', pages='all', flavor='lattice')
print(f'Found {len(tables)} tables')
for i, table in enumerate(tables):
table.to_csv(f'table_{i}.csv')Pros: programmable; quality is high on the tables it handles; integrates into pipelines
Cons: Python + dependencies; sometimes misses tables that have unusual styling; limited OCR support
Method 4: AI-based extraction (LLM with the right prompt)
Modern LLMs handle tables well when given clean input. Workflow: convert PDF to Markdown first (preserving tables as GFM), then ask Claude or ChatGPT to extract specific tables.
For interactive use: convert your PDF in our web tool, paste the resulting Markdown into ChatGPT/Claude, ask for the table extraction in CSV format. For programmatic pipelines, use an OSS converter (we don't offer an API today):
import pymupdf
# Step 1: PDF to Markdown locally with PyMuPDF
with pymupdf.open('report.pdf') as doc:
md = '\n\n'.join(p.get_text('markdown') for p in doc)
# Step 2: Pass md to your LLM of choice with extraction instructions
# (OpenAI/Anthropic/Gemini SDKs all accept the Markdown string directly)Pros: extremely flexible (extract specific tables matching certain criteria); handles complex layouts where regex-based tools fail
Cons: per-call cost; can hallucinate cell values on edge cases; slower than direct extraction
Method 5: Specialized commercial tools
For very high-volume table extraction or particularly hard documents (financial statements with merged cells, scientific data tables with footnotes), commercial tools fill the gap:
- Docparser: cloud-based, template-driven extraction
- Rossum: invoice-specific, very accurate on that vertical
- Hypatos: financial documents, ML-based
- Adobe PDF Extract API: Adobe's own extractor, very strong on Adobe-generated PDFs
Pros: strong on their target verticals; enterprise support
Cons: expensive; vertical-specific; vendor lock-in
Comparison table
| Method | Setup | OCR | Borderless tables | Best for |
|---|---|---|---|---|
| Online (PDF to CSV) | None | Yes | Good | One-off use, mixed quality docs |
| Tabula desktop | Download app | No | Manual | Few tables, precise control |
| Camelot Python | pip install | Limited | Decent (stream mode) | Programmatic batch |
| AI-based | API key | Yes (via Markdown) | Excellent | Complex tables, smart filtering |
| Commercial | Vendor onboarding | Yes | Excellent (vertical-specific) | High-volume specific use cases |
What if you want tables AND surrounding context?
If your goal is feeding a document to AI rather than building a spreadsheet pipeline, you don't want CSV — you want the tables in the context of the surrounding document. That's exactly what our PDF tables to Markdown tool does: GFM tables in their original document context.
For pure data-pipeline work, CSV is right. For document understanding (LLM context, knowledge bases, RAG), Markdown with tables-in-context is right. Pick based on what your downstream consumer needs.
Common pitfalls
Currency markers and parenthetical negatives
Financial tables use "$" and "(123)" for negatives. Most extractors preserve these as text; your downstream parser may need to convert them to numeric values. Plan for this in your pipeline.
Multi-page tables
Tables that span pages often have repeating headers on each page. Good extractors stitch them into one logical table; lesser ones produce one table per page that you have to merge.
Merged cells
GFM tables don't support merged cells; CSV doesn't either. Most extractors flatten the merge by repeating the value across the merged range. Verify this matches your downstream expectation.
Footnoted cells
Cells with footnote markers (a, b, *) need special handling — the marker is data; the footnote text is separate but related. Different extractors handle this differently; spot-check your tool's behavior.
Recommendation
For most users: start with our online PDF to CSV converter (no setup) or PDF to Markdown (if context matters). For programmatic pipelines: Camelot for simple tables, Marker or Docling for complex layouts (we don't ship a public API today, so OSS is the path). For very high volume on a specific document type: commercial tools justified by ROI.