Pricing Dashboard Sign up
Recent
· 7 min read · MDisBetter

PDF to Markdown Accuracy: What to Expect by Document Type

"How accurate is PDF to Markdown?" doesn't have one answer. Accuracy varies by document type, source quality, and what you're measuring — character-level recognition, structural fidelity, table preservation, equation handling. Here are realistic expectations across document categories, with guidance on what to spot-check and what to trust.

What "accuracy" actually means

Three separate things people lump under "accuracy":

Different document types stress different dimensions. A scanned legal contract: character accuracy critical (legal text is exact). An academic paper: structural and equation accuracy matter most. A financial report: table accuracy is everything.

Accuracy by document type

Clean digital PDFs (99%+ across the board)

PDFs generated by modern Word, LaTeX, or InDesign with proper text layers and clean fonts. Single-column or simple multi-column. No exotic formatting. Examples: most modern reports, well-typeset books, recent academic papers.

Expected: near-perfect character recovery, headings detected correctly, tables preserved, equations as LaTeX. The rare error is in unusual notation or complex multi-row table headers.

What to spot-check: nothing systematic. Sample-check one or two pages if it's a critical document; otherwise trust the output.

Multi-column academic papers (95-98%)

Two- or three-column layouts with figures, equations, and references. Examples: arXiv preprints, IEEE/ACM conference papers, journal articles.

Expected: high character accuracy, good column-order recovery, equations as LaTeX (slight degradation on unusual notation), references preserved as a final section.

What to spot-check: column boundaries on pages with unusual layouts (figures spanning columns, sidebars). Equation accuracy if the paper is math-heavy.

Financial reports (94-98%)

Mixed layouts with extensive tables, charts, narrative text. Examples: 10-Ks, annual reports, market research.

Expected: tables come through as GFM with high fidelity for borderless and bordered alike, charts replaced with placeholder text ("[chart]"), narrative text clean. Footnotes attached correctly to data references.

What to spot-check: complex tables (merged cells, multi-row headers) for cell alignment. Pie/bar chart captions to ensure context isn't lost when the chart itself isn't reproduced.

Legal contracts (96-99%)

Numbered clauses, defined terms, exhibits, signatures. Examples: NDAs, service agreements, vendor contracts.

Expected: clause numbering preserved as Markdown ordered-list nesting or numbered headings (depending on depth), defined terms unchanged, exhibits as separate sections.

What to spot-check: cross-references ("as defined in Section 2.1") for accuracy. Signature blocks for proper formatting. The output is for review and search, not legal force — keep the original PDF as canonical.

Scanned documents (85-98%)

Image-only PDFs that need OCR. Quality varies hugely with source. Examples: faxed contracts, archived documents, photographed pages.

Expected:

What to spot-check: confused character pairs (l/1, O/0, rn/m), proper nouns, numbers in tables. The OCR engine flags low-confidence regions in the output for inspection.

Slide decks exported as PDF (95-98%)

One slide per page, sparse content, large fonts. Examples: PowerPoint exports, Keynote PDFs, Google Slides PDFs.

Expected: each slide as ## Slide N section, bullets as Markdown lists, speaker notes as blockquotes when present.

What to spot-check: image content (extracted but not described), animation/build-step content (PDF only shows final state).

Handwritten content (50-90%)

Hand-printed: usable, with confused-character corrections. Cursive: variable. Doctor-style or rapid notes: not reliable enough to trust.

What to spot-check: everything. Treat the converted Markdown as a draft; do not rely on it for high-stakes use without close review.

What systematically goes wrong

Tables in scans

OCR + table recovery is harder than either alone. Cleanly-bordered tables in 300 DPI scans usually come through; borderless tables in scans often need manual cleanup.

Equations in scans

Equation OCR on scanned source is a known weak spot. For equation-heavy scanned documents, expect placeholder text where equations should be — manual replacement with proper LaTeX is the realistic finishing step.

Unusual fonts and decorative elements

Display fonts, hand-lettering, calligraphy: OCR struggles. Rotated or vertically-stacked text: usually works but worth verifying.

Multi-language documents

Mixed scripts (e.g., English text with Arabic quotations) reduce accuracy on the secondary language. The OCR pipeline auto-detects and switches per region but isn't perfect.

How to verify your converted output

Quick spot-check workflow:

  1. Open the converted Markdown in a preview tool (VS Code, Obsidian, GitHub)
  2. Compare side-by-side with the source PDF
  3. Sample-check 3-5 pages from different sections
  4. Look for: missing tables, scrambled paragraphs, garbled text, dropped equations

Five minutes of spot-checking saves hours of downstream confusion. For high-stakes documents (legal, medical, regulatory), do a more thorough review — the source PDF remains canonical.

When to choose a different tool

If your documents systematically test outside our strong zones — pure handwritten content, very low-quality scans, exotic typesetting — consider whether a specialized tool fits better. For handwriting specifically, Google Document AI and AWS Textract have larger models tuned for that case. For commercial OCR on bad scans, ABBYY FineReader is still the gold standard.

For the vast majority of digital and reasonably-scanned PDFs, our converter handles them at the accuracy levels above. For the broader competitive picture, see our 10-tool benchmark.

Frequently asked questions

How do I check OCR confidence on my converted document?
The API exposes per-region confidence scores; the web UI marks suspect regions with subtle highlighting in the preview. For high-stakes content, request the API with <code>?include_confidence=true</code> and review regions below 0.85 confidence.
Will accuracy improve if I rescan the source at higher DPI?
Yes, especially for the borderline cases. Going from 150 DPI to 300 DPI typically gains 3-5 percentage points of OCR accuracy. Going above 300 DPI helps marginally; going above 600 DPI doesn't help at all.
Can I get an accuracy guarantee for legal or medical work?
We don't make accuracy guarantees on individual documents — OCR is probabilistic. For legal/medical workloads, the right pattern is converted Markdown for search/review combined with the original PDF as canonical. Our Enterprise tier adds BAA and audit logging.