May 10, 2026 · 8 min read · MDisBetter

Scanned PDF to Markdown with OCR — Complete Guide

Q: Do I need to install Tesseract or any OCR tools?

No — OCR runs on our servers as part of the conversion. You just upload the PDF and download Markdown. The OCR engine and language packs are managed for you.

Q: How accurate is OCR on phone-photographed pages?

Variable, mostly determined by lighting and angle. Well-lit, straight-on photos at modern phone resolution: 95–98% accuracy. Tilted, shadowed, or low-light photos: drops to 80–90% with more confused-character errors. Re-shoot in better light if accuracy matters.

Q: Can I batch OCR many scans at once?

Not via MDisBetter directly — we're a web tool today, no API. For true batch OCR + Markdown conversion, use OSS like Marker (built-in OCR via Surya), Docling, or pair Tesseract with PyMuPDF. See our batch conversion guide for code examples.

A scanned PDF is just images of pages — there's no text underneath. Most PDF tools stare at it blankly. The fix is OCR (optical character recognition), but doing OCR well takes more than running Tesseract on each page. Here's the complete workflow that produces usable Markdown from any scanned source, plus the tips that separate "OCR worked" from "OCR worked well".

What is a scanned PDF?

Two PDFs can look identical to a human and be totally different to a machine. A digital PDF generated from Word or LaTeX has a text layer — invisible-to-you data that maps glyphs to actual Unicode characters. A scanned PDF is just rasterized images: photographs of pages, no text layer at all. You can verify by trying to select text in a PDF viewer; if the selection rectangle covers blank space and grabs nothing, it's a scan.

Common sources of scanned PDFs: archived documents, faxes saved as PDF, photos of pages taken with a phone, legal exhibits printed and rescanned, old books digitized before OCR was standard. Modern scanners often add an OCR text layer automatically, but the quality varies wildly — and many older or budget workflows skip it.

How OCR works (briefly)

OCR converts pixels to characters in three stages. First, page segmentation identifies blocks of text vs images vs whitespace. Second, line and word detection groups pixels into reading units. Third, character recognition uses a model (historically a deep CNN, increasingly transformer-based) to predict each character from its pixel pattern.

Modern OCR (Tesseract 5, Google Document AI, AWS Textract, our pipeline) achieves 99%+ accuracy on clean 300+ DPI typed documents and 90–98% on lower-quality sources. Handwriting OCR is a separate model class with much wider variance — neat block printing works, fast cursive remains unreliable.

Step-by-step with MDisBetter

1. Detect that your PDF is scanned

You don't actually need to do this manually — our converter detects scanned PDFs automatically. The detection signal: text-layer density below ~10% of typical page area, or text characters that look like prior bad OCR (mojibake, garbled spacing).

2. Upload to the converter

Drop your PDF into the PDF to Markdown converter. No special flag needed — the converter automatically routes scanned PDFs through the OCR pipeline before applying the same Markdown layout reconstruction as for digital PDFs.

3. Wait for conversion

OCR is the slow step in the pipeline. Expect 3–10× longer processing than digital PDFs of the same length: a 50-page scan typically takes 15–60 seconds depending on quality and language. The progress indicator shows where you are.

4. Review the output

The output is regular Markdown — headings, paragraphs, lists, tables. For high-stakes documents (legal contracts, medical records), spot-check a few sections against the source for OCR errors. The most common errors: similar character pairs (l/1, O/0, rn/m), unusual proper nouns, and column-boundary confusion on complex layouts.

Tips for better OCR results

Source quality matters most

The biggest determinant of OCR accuracy is the quality of the original scan. From best to worst:

Best: 300+ DPI, color or grayscale, typed text, unfolded pages
Good: 200 DPI, B&W, typed text, slight skew
OK: 150 DPI scans, faxes, phone photos in good light
Hard: <100 DPI, photocopies of photocopies, poorly-lit photos, heavily-skewed pages
Borderline: handwriting, decorative fonts, rotated text

If you can re-scan the source at higher quality, do that before converting. The accuracy gains from a 300 DPI rescan compound through the rest of the pipeline.

Language matters

Our OCR pipeline auto-detects language and switches model accordingly. Latin-script languages (English, French, German, Spanish, Italian, Portuguese, Dutch, Polish) are well-supported. Cyrillic, CJK (Chinese, Japanese, Korean), and Arabic are supported with separate language packs and slightly lower accuracy. Mixed-language documents work but at modestly reduced accuracy on the secondary language.

Pre-process if you can

If you have control over the source PDF, simple pre-processing improves OCR results: deskew tilted scans, increase contrast on faded text, remove shadows from phone-photographed pages. Free tools like ScanTailor or commercial alternatives like Adobe Scan handle this well. Our converter does light auto-correction but won't fix severe issues.

Limitations to expect

Tables in scans

OCR + table reconstruction is hard. Cleanly-bordered tables in 300 DPI scans usually come through as GFM tables. Borderless tables, merged cells, or tables with rotated header text are best-effort and often need manual cleanup. For table-critical documents, consider whether the source exists as a digital file somewhere — converting from digital is always cleaner.

Handwriting

Block printing in clean ink: usable, with manual review of confused characters. Cursive: variable. Doctor-style notes: don't trust the output without close review. The OCR engine flags low-confidence regions in the output for inspection.

Equations and special notation

Mathematical equations in scanned source rarely OCR cleanly — most engines treat them as garbled text. For equation-heavy scanned documents (old physics papers, engineering manuals), the output Markdown will have placeholder text where equations should be. Manual replacement with proper LaTeX is the realistic finishing step.

What to do with the output

The Markdown output behaves like any other Markdown — feed it to ChatGPT, drop it in Obsidian, push to a docs site, anything you'd do with hand-written Markdown. For AI workflows specifically, the converted Markdown carries the same token-savings advantage as digital-PDF conversions: 60–80% fewer tokens than re-uploading the scanned PDF directly. See how to reduce ChatGPT token usage for the cost math.

For workflows where you have many scanned documents to process, MDisBetter today is web-only — drag each PDF into the converter one at a time. For true automation across hundreds of scans, open-source OCR pipelines (Tesseract, EasyOCR, or PaddleOCR paired with a Markdown post-processor like Marker, which has built-in OCR via Surya) are the realistic path. The full breakdown is in batch convert 100+ PDFs to Markdown.

Frequently asked questions

Do I need to install Tesseract or any OCR tools?

No — OCR runs on our servers as part of the conversion. You just upload the PDF and download Markdown. The OCR engine and language packs are managed for you.

How accurate is OCR on phone-photographed pages?

Variable, mostly determined by lighting and angle. Well-lit, straight-on photos at modern phone resolution: 95–98% accuracy. Tilted, shadowed, or low-light photos: drops to 80–90% with more confused-character errors. Re-shoot in better light if accuracy matters.

Can I batch OCR many scans at once?

Not via MDisBetter directly — we're a web tool today, no API. For true batch OCR + Markdown conversion, use OSS like Marker (built-in OCR via Surya), Docling, or pair Tesseract with PyMuPDF. See our <a href="/blog/batch-convert-100-pdfs-to-markdown">batch conversion guide</a> for code examples.