Scanned PDF to Markdown with OCR — Complete Guide
A scanned PDF is just images of pages — there's no text underneath. Most PDF tools stare at it blankly. The fix is OCR (optical character recognition), but doing OCR well takes more than running Tesseract on each page. Here's the complete workflow that produces usable Markdown from any scanned source, plus the tips that separate "OCR worked" from "OCR worked well".
What is a scanned PDF?
Two PDFs can look identical to a human and be totally different to a machine. A digital PDF generated from Word or LaTeX has a text layer — invisible-to-you data that maps glyphs to actual Unicode characters. A scanned PDF is just rasterized images: photographs of pages, no text layer at all. You can verify by trying to select text in a PDF viewer; if the selection rectangle covers blank space and grabs nothing, it's a scan.
Common sources of scanned PDFs: archived documents, faxes saved as PDF, photos of pages taken with a phone, legal exhibits printed and rescanned, old books digitized before OCR was standard. Modern scanners often add an OCR text layer automatically, but the quality varies wildly — and many older or budget workflows skip it.
How OCR works (briefly)
OCR converts pixels to characters in three stages. First, page segmentation identifies blocks of text vs images vs whitespace. Second, line and word detection groups pixels into reading units. Third, character recognition uses a model (historically a deep CNN, increasingly transformer-based) to predict each character from its pixel pattern.
Modern OCR (Tesseract 5, Google Document AI, AWS Textract, our pipeline) achieves 99%+ accuracy on clean 300+ DPI typed documents and 90–98% on lower-quality sources. Handwriting OCR is a separate model class with much wider variance — neat block printing works, fast cursive remains unreliable.
Step-by-step with MDisBetter
1. Detect that your PDF is scanned
You don't actually need to do this manually — our converter detects scanned PDFs automatically. The detection signal: text-layer density below ~10% of typical page area, or text characters that look like prior bad OCR (mojibake, garbled spacing).
2. Upload to the converter
Drop your PDF into the PDF to Markdown converter. No special flag needed — the converter automatically routes scanned PDFs through the OCR pipeline before applying the same Markdown layout reconstruction as for digital PDFs.
3. Wait for conversion
OCR is the slow step in the pipeline. Expect 3–10× longer processing than digital PDFs of the same length: a 50-page scan typically takes 15–60 seconds depending on quality and language. The progress indicator shows where you are.
4. Review the output
The output is regular Markdown — headings, paragraphs, lists, tables. For high-stakes documents (legal contracts, medical records), spot-check a few sections against the source for OCR errors. The most common errors: similar character pairs (l/1, O/0, rn/m), unusual proper nouns, and column-boundary confusion on complex layouts.
Tips for better OCR results
Source quality matters most
The biggest determinant of OCR accuracy is the quality of the original scan. From best to worst:
- Best: 300+ DPI, color or grayscale, typed text, unfolded pages
- Good: 200 DPI, B&W, typed text, slight skew
- OK: 150 DPI scans, faxes, phone photos in good light
- Hard: <100 DPI, photocopies of photocopies, poorly-lit photos, heavily-skewed pages
- Borderline: handwriting, decorative fonts, rotated text
If you can re-scan the source at higher quality, do that before converting. The accuracy gains from a 300 DPI rescan compound through the rest of the pipeline.
Language matters
Our OCR pipeline auto-detects language and switches model accordingly. Latin-script languages (English, French, German, Spanish, Italian, Portuguese, Dutch, Polish) are well-supported. Cyrillic, CJK (Chinese, Japanese, Korean), and Arabic are supported with separate language packs and slightly lower accuracy. Mixed-language documents work but at modestly reduced accuracy on the secondary language.
Pre-process if you can
If you have control over the source PDF, simple pre-processing improves OCR results: deskew tilted scans, increase contrast on faded text, remove shadows from phone-photographed pages. Free tools like ScanTailor or commercial alternatives like Adobe Scan handle this well. Our converter does light auto-correction but won't fix severe issues.
Limitations to expect
Tables in scans
OCR + table reconstruction is hard. Cleanly-bordered tables in 300 DPI scans usually come through as GFM tables. Borderless tables, merged cells, or tables with rotated header text are best-effort and often need manual cleanup. For table-critical documents, consider whether the source exists as a digital file somewhere — converting from digital is always cleaner.
Handwriting
Block printing in clean ink: usable, with manual review of confused characters. Cursive: variable. Doctor-style notes: don't trust the output without close review. The OCR engine flags low-confidence regions in the output for inspection.
Equations and special notation
Mathematical equations in scanned source rarely OCR cleanly — most engines treat them as garbled text. For equation-heavy scanned documents (old physics papers, engineering manuals), the output Markdown will have placeholder text where equations should be. Manual replacement with proper LaTeX is the realistic finishing step.
What to do with the output
The Markdown output behaves like any other Markdown — feed it to ChatGPT, drop it in Obsidian, push to a docs site, anything you'd do with hand-written Markdown. For AI workflows specifically, the converted Markdown carries the same token-savings advantage as digital-PDF conversions: 60–80% fewer tokens than re-uploading the scanned PDF directly. See how to reduce ChatGPT token usage for the cost math.
For workflows where you have many scanned documents to process, MDisBetter today is web-only — drag each PDF into the converter one at a time. For true automation across hundreds of scans, open-source OCR pipelines (Tesseract, EasyOCR, or PaddleOCR paired with a Markdown post-processor like Marker, which has built-in OCR via Surya) are the realistic path. The full breakdown is in batch convert 100+ PDFs to Markdown.