How We Built MDisBetter's PDF Converter: Lessons Learned
Building a PDF-to-Markdown converter sounds straightforward — and is, for the simple cases. The interesting engineering is in the long tail: the 5% of PDFs where naive approaches fall apart. After two years of iterating on production traffic, here's what we learned about which problems matter, which approaches generalize, and which optimizations were dead ends.
The architecture that emerged
Our v1 was a single Python service that called PyMuPDF for text extraction and applied regex-based heuristics for structure recovery. It worked on 70% of inputs and failed badly on the rest. The current architecture is a four-stage pipeline:
- Page analysis: classify each page (digital vs scanned, simple vs multi-column, with vs without tables)
- Layout extraction: spatial reading-order analysis, column detection, table-region detection
- Content extraction: text extraction with proper Unicode mapping, OCR fallback for scanned regions, equation recognition for math regions
- Markdown emission: structure recovery (headings, lists, tables), furniture stripping, output formatting
Each stage is replaceable; failures are localized and debuggable. The architectural cost (more moving parts) is paid back in maintainability — when a customer reports an issue, we can usually trace it to one stage and fix it without affecting the others.
Lesson 1: Heuristics scale further than you think, then they don't
For the first year, almost everything was rules-based: regex patterns for heading detection, geometric thresholds for column boundaries, hand-tuned weights for furniture stripping. This got us to ~85% accuracy on our internal benchmark, and we shipped it.
Around 90% accuracy, the heuristic approach started fighting itself. Every fix for one document type broke something else. The font-size threshold for "this is a heading" varied by publisher; the column-detection algorithm worked on academic two-column but broke on magazine three-column.
The breakthrough was replacing the structure-recovery layer with a small layout-aware model — not a giant LLM, just a focused classifier trained on thousands of labeled documents. Rules-based heuristics still handle the simple cases (faster); the model handles the long tail. Hybrid approach beats either pure approach.
Lesson 2: OCR quality matters more than OCR engine choice
We initially obsessed over which OCR engine to use — Tesseract vs PaddleOCR vs Google Document AI vs Surya. The accuracy differences between them on clean inputs are real but small (~1-2% character error rate).
What actually moves the needle is image preprocessing: deskewing, contrast normalization, noise reduction, resolution upscaling. A document that scores 91% on Tesseract with default preprocessing scores 97% on Tesseract with proper preprocessing. The gap dwarfs the inter-engine differences.
Our current pipeline routes per language (different engines have different strengths) but spends most of its compute budget on preprocessing rather than recognition.
Lesson 3: Table extraction is genuinely hard
Tables are the failure mode customers notice first and complain about loudest. We've shipped four different table extraction approaches:
- Whitespace-only: detect tables by looking for grids of whitespace. Worked on 60% of tables, broke on borderless designs.
- Line-detection: find horizontal/vertical lines, infer cell structure. Worked on 80% of tables, broke on borderless and on PDFs where "lines" were drawn as filled rectangles.
- Hybrid (lines + whitespace + clustering): combine all signals. 90% accuracy.
- Layout-aware model: small CNN trained on labeled table regions. 94% accuracy.
The model approach generalized best — it doesn't care whether your table has lines, color backgrounds, indentation, or just whitespace. But the model still gets confused on tables with merged cells across multiple header rows. Table extraction remains an active area; we expect another step-change in 2027.
Lesson 4: Customer-reported edge cases are gold
Most accuracy improvements came from inspecting failed conversions reported by users. Categories of failure that we fixed by adding to the test corpus:
- PDFs from old versions of LaTeX that emit pre-Unicode font encoding
- PDFs generated by 3D CAD software with non-standard text rendering
- PDFs with rotated or vertical text in headers
- Multi-script PDFs (English text with Arabic citations)
- Forms with overlapping fields
- Heavily-stamped or watermarked documents
None of these were anticipated; all of them mattered to specific customers. The lesson: build observability into your tool from day one, capture (with consent) failed conversions, work through them weekly. Synthetic benchmarks don't substitute for production failures.
Lesson 5: Speed vs quality is a real tradeoff
Customers want both. We can't deliver both for the hardest cases — running our most accurate pipeline takes 30+ seconds per page, which would be unacceptable on the typical 30-page document.
Our solution: tiered processing. Quick analysis on every page; route to the heavy pipeline only for pages that need it (scanned, dense layout, equation-heavy). Most pages get fast processing; the slow path only runs where it adds value. Average per-page time is 0.3s; worst-case pages can take several seconds.
Customers care about end-to-end latency, not per-page latency. As long as the overall experience feels fast (sub-5s for most documents, sub-30s for the hardest), the tradeoff is acceptable.
Lesson 6: Markdown is the right output format
An early debate: should we output Markdown, HTML, JSON, or some custom semantic format? Each had advocates.
Markdown won because:
- Every downstream consumer (LLMs, docs sites, knowledge bases) accepts it
- It's human-readable for spot-checking
- It diffs cleanly in Git
- The format itself is small (no schema overhead)
- Conversion to HTML/JSON downstream is one CLI command
JSON is appropriate for some users (structured data pipelines) and we expose it as a query-string option, but Markdown remains the primary output. The clearest signal: when we ask customers what format they want, the answer is overwhelmingly Markdown.
Lesson 7: The boring infrastructure matters most
The most-impactful improvements over the past year weren't in the conversion pipeline — they were in the infrastructure around it:
- Better error reporting (customers know exactly what failed and how to retry)
- Per-tier rate limiting (Pro vs Team vs Enterprise gets predictable throughput)
- Zero-retention guarantees with audit logging (unblocked legal and healthcare customers)
- Streaming for long conversions (no more 30s timeouts on big documents)
- Webhook callbacks (programmatic users no longer poll)
None of this is glamorous. All of it gates whether customers can actually deploy the tool in production. Ship the boring infrastructure; the conversion accuracy gets noticed only when it's wrong, but the infrastructure gets noticed every time it works.
What we'd do differently
Three things, in retrospect:
- Start with observability. We added it in year two. It should have been there from day one.
- Layered approach earlier. The pure-heuristics era worked but spent six months that we could have used building the model-based layer.
- OCR preprocessing earlier. We tried four OCR engines before realizing the engine choice was less important than what we fed them.
If you're building a similar tool: invest in evaluation infrastructure first, build a hybrid (rules + model) architecture from the start, and treat preprocessing as a first-class concern.
For users curious about how the converter compares to alternatives, see our 10-tool benchmark. For our take on the underlying format problem, see how PDF works internally.