Pricing Dashboard Sign up
Recent
· 8 min read · MDisBetter

How We Built MDisBetter's PDF Converter: Lessons Learned

Building a PDF-to-Markdown converter sounds straightforward — and is, for the simple cases. The interesting engineering is in the long tail: the 5% of PDFs where naive approaches fall apart. After two years of iterating on production traffic, here's what we learned about which problems matter, which approaches generalize, and which optimizations were dead ends.

The architecture that emerged

Our v1 was a single Python service that called PyMuPDF for text extraction and applied regex-based heuristics for structure recovery. It worked on 70% of inputs and failed badly on the rest. The current architecture is a four-stage pipeline:

  1. Page analysis: classify each page (digital vs scanned, simple vs multi-column, with vs without tables)
  2. Layout extraction: spatial reading-order analysis, column detection, table-region detection
  3. Content extraction: text extraction with proper Unicode mapping, OCR fallback for scanned regions, equation recognition for math regions
  4. Markdown emission: structure recovery (headings, lists, tables), furniture stripping, output formatting

Each stage is replaceable; failures are localized and debuggable. The architectural cost (more moving parts) is paid back in maintainability — when a customer reports an issue, we can usually trace it to one stage and fix it without affecting the others.

Lesson 1: Heuristics scale further than you think, then they don't

For the first year, almost everything was rules-based: regex patterns for heading detection, geometric thresholds for column boundaries, hand-tuned weights for furniture stripping. This got us to ~85% accuracy on our internal benchmark, and we shipped it.

Around 90% accuracy, the heuristic approach started fighting itself. Every fix for one document type broke something else. The font-size threshold for "this is a heading" varied by publisher; the column-detection algorithm worked on academic two-column but broke on magazine three-column.

The breakthrough was replacing the structure-recovery layer with a small layout-aware model — not a giant LLM, just a focused classifier trained on thousands of labeled documents. Rules-based heuristics still handle the simple cases (faster); the model handles the long tail. Hybrid approach beats either pure approach.

Lesson 2: OCR quality matters more than OCR engine choice

We initially obsessed over which OCR engine to use — Tesseract vs PaddleOCR vs Google Document AI vs Surya. The accuracy differences between them on clean inputs are real but small (~1-2% character error rate).

What actually moves the needle is image preprocessing: deskewing, contrast normalization, noise reduction, resolution upscaling. A document that scores 91% on Tesseract with default preprocessing scores 97% on Tesseract with proper preprocessing. The gap dwarfs the inter-engine differences.

Our current pipeline routes per language (different engines have different strengths) but spends most of its compute budget on preprocessing rather than recognition.

Lesson 3: Table extraction is genuinely hard

Tables are the failure mode customers notice first and complain about loudest. We've shipped four different table extraction approaches:

  1. Whitespace-only: detect tables by looking for grids of whitespace. Worked on 60% of tables, broke on borderless designs.
  2. Line-detection: find horizontal/vertical lines, infer cell structure. Worked on 80% of tables, broke on borderless and on PDFs where "lines" were drawn as filled rectangles.
  3. Hybrid (lines + whitespace + clustering): combine all signals. 90% accuracy.
  4. Layout-aware model: small CNN trained on labeled table regions. 94% accuracy.

The model approach generalized best — it doesn't care whether your table has lines, color backgrounds, indentation, or just whitespace. But the model still gets confused on tables with merged cells across multiple header rows. Table extraction remains an active area; we expect another step-change in 2027.

Lesson 4: Customer-reported edge cases are gold

Most accuracy improvements came from inspecting failed conversions reported by users. Categories of failure that we fixed by adding to the test corpus:

None of these were anticipated; all of them mattered to specific customers. The lesson: build observability into your tool from day one, capture (with consent) failed conversions, work through them weekly. Synthetic benchmarks don't substitute for production failures.

Lesson 5: Speed vs quality is a real tradeoff

Customers want both. We can't deliver both for the hardest cases — running our most accurate pipeline takes 30+ seconds per page, which would be unacceptable on the typical 30-page document.

Our solution: tiered processing. Quick analysis on every page; route to the heavy pipeline only for pages that need it (scanned, dense layout, equation-heavy). Most pages get fast processing; the slow path only runs where it adds value. Average per-page time is 0.3s; worst-case pages can take several seconds.

Customers care about end-to-end latency, not per-page latency. As long as the overall experience feels fast (sub-5s for most documents, sub-30s for the hardest), the tradeoff is acceptable.

Lesson 6: Markdown is the right output format

An early debate: should we output Markdown, HTML, JSON, or some custom semantic format? Each had advocates.

Markdown won because:

JSON is appropriate for some users (structured data pipelines) and we expose it as a query-string option, but Markdown remains the primary output. The clearest signal: when we ask customers what format they want, the answer is overwhelmingly Markdown.

Lesson 7: The boring infrastructure matters most

The most-impactful improvements over the past year weren't in the conversion pipeline — they were in the infrastructure around it:

None of this is glamorous. All of it gates whether customers can actually deploy the tool in production. Ship the boring infrastructure; the conversion accuracy gets noticed only when it's wrong, but the infrastructure gets noticed every time it works.

What we'd do differently

Three things, in retrospect:

  1. Start with observability. We added it in year two. It should have been there from day one.
  2. Layered approach earlier. The pure-heuristics era worked but spent six months that we could have used building the model-based layer.
  3. OCR preprocessing earlier. We tried four OCR engines before realizing the engine choice was less important than what we fed them.

If you're building a similar tool: invest in evaluation infrastructure first, build a hybrid (rules + model) architecture from the start, and treat preprocessing as a first-class concern.

For users curious about how the converter compares to alternatives, see our 10-tool benchmark. For our take on the underlying format problem, see how PDF works internally.

Frequently asked questions

Will you open source the conversion engine?
Not currently — the model weights and the labeled training data represent significant investment. We participate in the OSS ecosystem by sponsoring and contributing to adjacent projects (markdown linters, document-AI evaluation harnesses).
What's the next big improvement coming?
Better handling of complex tables (merged cells, multi-row headers) and improved equation OCR for scanned mathematical content. Both are active development areas; expect noticeable improvements in 2027 releases.
How much of this stack runs on GPUs vs CPUs?
Most of the pipeline is CPU-bound (PDF parsing, layout analysis, Markdown emission). The OCR step uses GPUs when available for speed, but our quality wouldn't change running purely on CPU. Cost vs latency tradeoff.