Pricing Dashboard Sign up
Recent
· 7 min read · MDisBetter

Batch Convert 100+ PDFs to Markdown

Converting one PDF at a time is fine for ad-hoc work. Migrating a research library of 500 papers, an enterprise spec archive, or a year of board decks needs batch. We'll be honest up front: MDisBetter today is a web tool, not a programmatic API. For true batch automation you'll want to combine our web UI for spot conversions with open-source libraries (marker, PyMuPDF, pdftotext) for the bulk pipeline. Here's how to pick.

When you actually need batch

Common scenarios:

For under ~10 PDFs, batch isn't worth the setup — just convert them one at a time in the web UI. For 10+ on a one-off basis, the web tool is still the easiest path. For 100+ on a recurring basis, you need OSS automation.

Option 1 — Web UI for one-off batches

The simplest path. Open the PDF to Markdown converter, drag a PDF onto the dropzone, click Convert, download the resulting .md file. Repeat for each file. Tedious for 100+ files, but if it's a one-time migration and you can spare an afternoon, it requires zero setup and gives you our cleanest output per document.

For 10–30 files, this is genuinely the right answer — total time end-to-end is under an hour, no Python environment to set up, no model weights to download.

Best for: one-off batches of a few dozen PDFs, no recurring automation needs, no engineering capacity to deploy OSS.

Honest limitation: we don't currently offer a public API, CLI, or watch-folder integration. If you need true unattended automation, the OSS routes below are the way.

Option 2 — Marker for high-quality batch (Python, OSS)

Marker (Apache 2.0, github.com/VikParuchuri/marker) is the open-source library most comparable to our web tool's quality on PDF-to-Markdown. It runs locally, requires a GPU for reasonable speed, and handles hundreds of PDFs concurrently without per-page fees.

pip install marker-pdf

# Convert a whole folder of PDFs in parallel
marker ./pdfs ./markdown_out --workers 4

For programmatic use inside your own pipeline:

from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered
from pathlib import Path

converter = PdfConverter(artifact_dict=create_model_dict())

for pdf in Path('./pdfs').glob('*.pdf'):
    rendered = converter(str(pdf))
    md, _, _ = text_from_rendered(rendered)
    pdf.with_suffix('.md').write_text(md, encoding='utf-8')
    print(f'Done: {pdf.name}')

Throughput on a single consumer GPU (RTX 4090): roughly 1–3 PDFs per second on average academic-paper-sized documents. Scales horizontally — run multiple workers across multiple machines for thousands of PDFs.

Best for: teams with engineering capacity and a GPU, recurring large-volume conversion, strict data-residency needs (everything stays local).

Option 3 — pdftotext + a Markdown step (lightweight, no GPU)

If your PDFs are simple (mostly digital, single-column, no critical tables) and you don't need GPU-grade structure recovery, the classic UNIX pipeline is hard to beat:

apt install poppler-utils  # or brew install poppler

# Convert every PDF in a folder, in parallel, no model downloads
find ./pdfs -name '*.pdf' | parallel -j 8 \
  'pdftotext -layout {} {.}.txt'

To get something closer to Markdown (headings inferred from font size, basic structure), use pdftotext as a feeder for a Python-side post-processor with PyMuPDF:

import pymupdf
from pathlib import Path

for pdf_path in Path('./pdfs').glob('*.pdf'):
    with pymupdf.open(pdf_path) as doc:
        # PyMuPDF can emit Markdown-flavored text directly
        md_chunks = []
        for page in doc:
            md_chunks.append(page.get_text('markdown'))
        pdf_path.with_suffix('.md').write_text(
            '\n\n'.join(md_chunks), encoding='utf-8'
        )

Throughput: hundreds of PDFs per minute on a laptop, no GPU required. Quality: lower than Marker on complex layouts, but acceptable for clean digital PDFs.

Best for: clean digital PDFs, no GPU available, willingness to accept lower table fidelity.

Option 4 — Docling for structure-heavy batch (OSS)

IBM's Docling (github.com/DS4SD/docling, MIT) is another strong OSS choice, with particularly good handling of complex layouts and tables. CLI usage:

pip install docling

docling ./pdfs --to md --output ./markdown_out

Heavier to set up than Marker, but worth it for documents with intricate tables or multi-column layouts where every cell matters.

Performance tips for OSS batch

Concurrency, not parallelism

For CPU-bound libraries (pdftotext, PyMuPDF), use GNU parallel with -j set to your core count. For GPU-bound libraries (Marker, Docling), GPU memory is the bottleneck — start with 2–4 workers per GPU and tune up.

Skip already-converted

For incremental batches (re-running the script as new PDFs arrive), check whether the .md already exists before converting. Five lines of Python; saves time on re-runs.

Batch errors gracefully

One PDF in a hundred will fail (corrupted file, password-protected, exotic format). Wrap each conversion in try/except, log failures, continue. Don't abort a 1000-file batch because one document was bad.

Watch for OCR slowness

Scanned PDFs require OCR (Marker has it built in via Surya; pdftotext does not — pair with Tesseract). Expect 3–10× longer processing than digital. For mixed batches, sort by digital-vs-scanned and route accordingly.

Cost math at scale

OSS routes have no per-page cost — you pay only for compute. A consumer GPU at home converts 100k pages per month for the cost of electricity (~$15/month at typical rates). A cloud GPU instance (RTX A4000, ~$0.50/hour) handles the same workload for under $50/month if you only run during conversion windows.

The trade-off is engineering and ops time. If you have it, OSS is dramatically cheaper at scale. If you don't, the web tool covers small batches without setup overhead.

The hybrid pattern

The realistic production setup most teams land on:

  1. Bulk pipeline: Marker (or Docling, or PyMuPDF) running locally on a GPU machine, converting everything in your archive overnight
  2. Spot conversions: web tool at /convert/pdf-to-markdown for the one-off PDF you receive in email and need converted in 30 seconds
  3. Quality check: spot-check 5–10 randomly-sampled OSS conversions against the web tool output to confirm the OSS settings produce comparable quality on your document type

This gives you scalable automation for the bulk of work plus a no-setup option for the edge cases.

Frequently asked questions

Does MDisBetter offer a batch API or CLI?
Not today — we're a web tool. For programmatic batch processing, we recommend OSS libraries like Marker, Docling, or PyMuPDF (linked in this article). They run locally and have no per-page cost.
Can I batch-convert password-protected PDFs?
Most OSS libraries (PyMuPDF, pdftotext) accept a password parameter. Remove the password upstream with <code>qpdf</code> if you'd rather decouple that step. The web tool does not currently accept password-protected PDFs.
What's the realistic throughput for OSS conversion?
Marker on a consumer GPU: 1–3 PDFs/second on academic-paper-sized documents. PyMuPDF on a laptop CPU: 20–50 PDFs/second on simple digital PDFs. Scanned PDFs (any tool, requires OCR) are 3–10× slower.