PDF to Markdown Benchmark: 10 Tools Tested on 50 Real Documents
Most "best PDF to Markdown" lists are vibes-based — the author tried two tools on one document and ranked them. We took 50 production documents across five categories and ran them through 10 tools, scoring each on accuracy, structure preservation, OCR quality, and speed. Here's what actually performs.
Methodology
Test corpus: 50 documents, 10 each across five categories — academic papers, financial reports, legal contracts, product manuals, scanned documents. Mix of public-domain and synthetically-redacted enterprise sources.
Scoring rubric (0-10 per dimension):
- Heading detection: do
##headings match the source's section hierarchy? - Table fidelity: do GFM tables round-trip into spreadsheets without manual fixup?
- Equation handling: do equations come through as LaTeX (where present)?
- OCR accuracy: character error rate on scanned documents
- Reading order: does multi-column text flow correctly?
Each tool ran with default settings. Tuning would change individual scores but not the broad ranking. Speed measured on a single document conversion (digital PDF, 30 pages, no OCR needed).
Results table
| Tool | Headings | Tables | Equations | OCR | Reading order | Speed (s/30pg) | Total /50 |
|---|---|---|---|---|---|---|---|
| MDisBetter | 9 | 9 | 9 | 9 | 9 | 2.1 | 45 |
| Marker | 8 | 8 | 10 | 9 | 9 | 1.4 (GPU) | 44 |
| Docling | 9 | 8 | 9 | 9 | 8 | 3.2 (GPU) | 43 |
| LlamaParse | 8 | 9 | 7 | 8 | 8 | 2.8 | 40 |
| MarkItDown | 7 | 5 | 0 | 5 | 6 | 1.1 | 23 |
| Adobe PDF Extract | 9 | 7 | 4 | 9 | 9 | 4.5 | 38 |
| PyMuPDF (text only) | 3 | 2 | 0 | 0 | 5 | 0.3 | 10 |
| pdfplumber | 4 | 7 | 0 | 0 | 5 | 0.5 | 16 |
| Pandoc + pdftotext | 3 | 2 | 0 | 0 | 3 | 0.4 | 8 |
| PDF2MD-OSS | 5 | 3 | 0 | 0 | 4 | 0.6 | 12 |
Top tier — the four that work
1. MDisBetter (45/50)
Strongest balance across all dimensions. Tied with Marker on quality, easier to use (no setup), continuously improving. Best for teams that want quality without operating infrastructure. Free tier covers light use; full comparison on the listicle page.
2. Marker (44/50)
Datalab's open-source converter. Best equation handling of the entire field — its specialized math model wins on technical content. Requires Python + GPU + ~5GB model download. Best for teams with ops capacity who need self-hosted.
3. Docling (43/50)
IBM Research's vision-language-based parser. Comparable to Marker on most documents; slight edge on figure-heavy content. Heavier setup. Best for complex layouts where the vision model adds context.
4. LlamaParse (40/50)
LlamaIndex's hosted parser. Strong on academic and structured docs. Tighter LlamaIndex integration. Per-page pricing higher than alternatives at scale, but the integration value matters if you're already deep in LlamaIndex.
Mid tier — adequate for some uses
Adobe PDF Extract (38/50)
Adobe's hosted service. Excellent on Adobe-generated PDFs (unsurprising), weaker on others. Outputs JSON not Markdown, requires an extra conversion step. Mature enterprise offering with steep pricing for casual use.
MarkItDown (23/50)
Microsoft's general-purpose document converter. Wins on format breadth (DOCX + XLSX + PPTX + PDF in one library), loses on PDF-specific quality. Tables flatten, equations dropped, OCR limited. Use it for unified Office ingestion; don't expect best-in-class PDF.
Bottom tier — text only, no structure
PyMuPDF, pdfplumber, Pandoc + pdftotext, PDF2MD-OSS — all give you text out of digital PDFs but produce no real Markdown structure. Tables flatten or are missed. No OCR. No equation support. Useful as primitives in your own pipeline if you'll write the structure layer yourself; not useful as finished tools.
Performance by document type
Different tools win on different categories:
- Academic papers: Marker leads (equations + multi-column), MDisBetter close second
- Financial reports: MDisBetter leads (table preservation), Adobe close
- Legal contracts: MDisBetter and Adobe tied (clause structure preserved)
- Product manuals: MDisBetter leads (mixed layout, embedded images)
- Scanned documents: MDisBetter and Marker tied on OCR quality
If your workload concentrates on one document type, evaluate the top 2-3 tools on your specific category rather than relying on overall rankings.
Cost comparison at scale
For 100,000 pages/month:
- MDisBetter Pro: ~$80/month
- Marker self-hosted: ~$200/month (GPU instance + ops time)
- Docling self-hosted: ~$250/month (heavier model)
- LlamaParse: ~$300/month (per-page pricing)
- Adobe PDF Extract: ~$1,200/month (per-call pricing)
For self-hosted options, the GPU + ops cost crosses MDisBetter's pricing around 50k-100k pages/month. Below that, hosted services win on TCO. Above ~500k pages/month, self-hosting may win on raw compute.
Methodology caveats
Three honest disclosures: (1) we make MDisBetter, so you should weight our scoring with appropriate skepticism — the methodology and corpus are documented so you can replicate. (2) Tool versions matter — this benchmark used the latest stable release of each as of May 2026; quality shifts as new versions ship. (3) Default settings — tuning improves all of these on specific document types.
For a quicker decision aid focused on the top picks, see our Best PDF to Markdown tools 2026 listicle. For free-tier-only options, see 5 best free PDF to Markdown converters.