PDF to Markdown Accuracy: What to Expect by Document Type
"How accurate is PDF to Markdown?" doesn't have one answer. Accuracy varies by document type, source quality, and what you're measuring — character-level recognition, structural fidelity, table preservation, equation handling. Here are realistic expectations across document categories, with guidance on what to spot-check and what to trust.
What "accuracy" actually means
Three separate things people lump under "accuracy":
- Character accuracy: did each letter come through correctly? Mostly relevant for OCR.
- Structural accuracy: do headings, lists, tables match the source's logical structure?
- Semantic accuracy: would a human reader extract the same meaning from the converted Markdown as from the source PDF?
Different document types stress different dimensions. A scanned legal contract: character accuracy critical (legal text is exact). An academic paper: structural and equation accuracy matter most. A financial report: table accuracy is everything.
Accuracy by document type
Clean digital PDFs (99%+ across the board)
PDFs generated by modern Word, LaTeX, or InDesign with proper text layers and clean fonts. Single-column or simple multi-column. No exotic formatting. Examples: most modern reports, well-typeset books, recent academic papers.
Expected: near-perfect character recovery, headings detected correctly, tables preserved, equations as LaTeX. The rare error is in unusual notation or complex multi-row table headers.
What to spot-check: nothing systematic. Sample-check one or two pages if it's a critical document; otherwise trust the output.
Multi-column academic papers (95-98%)
Two- or three-column layouts with figures, equations, and references. Examples: arXiv preprints, IEEE/ACM conference papers, journal articles.
Expected: high character accuracy, good column-order recovery, equations as LaTeX (slight degradation on unusual notation), references preserved as a final section.
What to spot-check: column boundaries on pages with unusual layouts (figures spanning columns, sidebars). Equation accuracy if the paper is math-heavy.
Financial reports (94-98%)
Mixed layouts with extensive tables, charts, narrative text. Examples: 10-Ks, annual reports, market research.
Expected: tables come through as GFM with high fidelity for borderless and bordered alike, charts replaced with placeholder text ("[chart]"), narrative text clean. Footnotes attached correctly to data references.
What to spot-check: complex tables (merged cells, multi-row headers) for cell alignment. Pie/bar chart captions to ensure context isn't lost when the chart itself isn't reproduced.
Legal contracts (96-99%)
Numbered clauses, defined terms, exhibits, signatures. Examples: NDAs, service agreements, vendor contracts.
Expected: clause numbering preserved as Markdown ordered-list nesting or numbered headings (depending on depth), defined terms unchanged, exhibits as separate sections.
What to spot-check: cross-references ("as defined in Section 2.1") for accuracy. Signature blocks for proper formatting. The output is for review and search, not legal force — keep the original PDF as canonical.
Scanned documents (85-98%)
Image-only PDFs that need OCR. Quality varies hugely with source. Examples: faxed contracts, archived documents, photographed pages.
Expected:
- 300+ DPI clean scans: 98%+ character accuracy
- 200 DPI scans: 95-98%
- 150 DPI / fax-quality: 90-95%
- Phone-photographed pages in good light: 95-98%
- Phone-photographed in poor light or skewed: 85-92%
What to spot-check: confused character pairs (l/1, O/0, rn/m), proper nouns, numbers in tables. The OCR engine flags low-confidence regions in the output for inspection.
Slide decks exported as PDF (95-98%)
One slide per page, sparse content, large fonts. Examples: PowerPoint exports, Keynote PDFs, Google Slides PDFs.
Expected: each slide as ## Slide N section, bullets as Markdown lists, speaker notes as blockquotes when present.
What to spot-check: image content (extracted but not described), animation/build-step content (PDF only shows final state).
Handwritten content (50-90%)
Hand-printed: usable, with confused-character corrections. Cursive: variable. Doctor-style or rapid notes: not reliable enough to trust.
What to spot-check: everything. Treat the converted Markdown as a draft; do not rely on it for high-stakes use without close review.
What systematically goes wrong
Tables in scans
OCR + table recovery is harder than either alone. Cleanly-bordered tables in 300 DPI scans usually come through; borderless tables in scans often need manual cleanup.
Equations in scans
Equation OCR on scanned source is a known weak spot. For equation-heavy scanned documents, expect placeholder text where equations should be — manual replacement with proper LaTeX is the realistic finishing step.
Unusual fonts and decorative elements
Display fonts, hand-lettering, calligraphy: OCR struggles. Rotated or vertically-stacked text: usually works but worth verifying.
Multi-language documents
Mixed scripts (e.g., English text with Arabic quotations) reduce accuracy on the secondary language. The OCR pipeline auto-detects and switches per region but isn't perfect.
How to verify your converted output
Quick spot-check workflow:
- Open the converted Markdown in a preview tool (VS Code, Obsidian, GitHub)
- Compare side-by-side with the source PDF
- Sample-check 3-5 pages from different sections
- Look for: missing tables, scrambled paragraphs, garbled text, dropped equations
Five minutes of spot-checking saves hours of downstream confusion. For high-stakes documents (legal, medical, regulatory), do a more thorough review — the source PDF remains canonical.
When to choose a different tool
If your documents systematically test outside our strong zones — pure handwritten content, very low-quality scans, exotic typesetting — consider whether a specialized tool fits better. For handwriting specifically, Google Document AI and AWS Textract have larger models tuned for that case. For commercial OCR on bad scans, ABBYY FineReader is still the gold standard.
For the vast majority of digital and reasonably-scanned PDFs, our converter handles them at the accuracy levels above. For the broader competitive picture, see our 10-tool benchmark.