Batch Convert 100+ PDFs to Markdown
Converting one PDF at a time is fine for ad-hoc work. Migrating a research library of 500 papers, an enterprise spec archive, or a year of board decks needs batch. We'll be honest up front: MDisBetter today is a web tool, not a programmatic API. For true batch automation you'll want to combine our web UI for spot conversions with open-source libraries (marker, PyMuPDF, pdftotext) for the bulk pipeline. Here's how to pick.
When you actually need batch
Common scenarios:
- Migrating a folder of legacy PDF documentation to a docs-as-code workflow
- Building a knowledge base from hundreds of past PDFs at once
- Preparing a corpus for RAG ingestion (academic papers, contracts, reports)
- Bulk-converting team archives during a tooling change
- Continuous ingestion of new PDFs from an inbox folder
For under ~10 PDFs, batch isn't worth the setup — just convert them one at a time in the web UI. For 10+ on a one-off basis, the web tool is still the easiest path. For 100+ on a recurring basis, you need OSS automation.
Option 1 — Web UI for one-off batches
The simplest path. Open the PDF to Markdown converter, drag a PDF onto the dropzone, click Convert, download the resulting .md file. Repeat for each file. Tedious for 100+ files, but if it's a one-time migration and you can spare an afternoon, it requires zero setup and gives you our cleanest output per document.
For 10–30 files, this is genuinely the right answer — total time end-to-end is under an hour, no Python environment to set up, no model weights to download.
Best for: one-off batches of a few dozen PDFs, no recurring automation needs, no engineering capacity to deploy OSS.
Honest limitation: we don't currently offer a public API, CLI, or watch-folder integration. If you need true unattended automation, the OSS routes below are the way.
Option 2 — Marker for high-quality batch (Python, OSS)
Marker (Apache 2.0, github.com/VikParuchuri/marker) is the open-source library most comparable to our web tool's quality on PDF-to-Markdown. It runs locally, requires a GPU for reasonable speed, and handles hundreds of PDFs concurrently without per-page fees.
pip install marker-pdf
# Convert a whole folder of PDFs in parallel
marker ./pdfs ./markdown_out --workers 4For programmatic use inside your own pipeline:
from marker.converters.pdf import PdfConverter
from marker.models import create_model_dict
from marker.output import text_from_rendered
from pathlib import Path
converter = PdfConverter(artifact_dict=create_model_dict())
for pdf in Path('./pdfs').glob('*.pdf'):
rendered = converter(str(pdf))
md, _, _ = text_from_rendered(rendered)
pdf.with_suffix('.md').write_text(md, encoding='utf-8')
print(f'Done: {pdf.name}')Throughput on a single consumer GPU (RTX 4090): roughly 1–3 PDFs per second on average academic-paper-sized documents. Scales horizontally — run multiple workers across multiple machines for thousands of PDFs.
Best for: teams with engineering capacity and a GPU, recurring large-volume conversion, strict data-residency needs (everything stays local).
Option 3 — pdftotext + a Markdown step (lightweight, no GPU)
If your PDFs are simple (mostly digital, single-column, no critical tables) and you don't need GPU-grade structure recovery, the classic UNIX pipeline is hard to beat:
apt install poppler-utils # or brew install poppler
# Convert every PDF in a folder, in parallel, no model downloads
find ./pdfs -name '*.pdf' | parallel -j 8 \
'pdftotext -layout {} {.}.txt'To get something closer to Markdown (headings inferred from font size, basic structure), use pdftotext as a feeder for a Python-side post-processor with PyMuPDF:
import pymupdf
from pathlib import Path
for pdf_path in Path('./pdfs').glob('*.pdf'):
with pymupdf.open(pdf_path) as doc:
# PyMuPDF can emit Markdown-flavored text directly
md_chunks = []
for page in doc:
md_chunks.append(page.get_text('markdown'))
pdf_path.with_suffix('.md').write_text(
'\n\n'.join(md_chunks), encoding='utf-8'
)Throughput: hundreds of PDFs per minute on a laptop, no GPU required. Quality: lower than Marker on complex layouts, but acceptable for clean digital PDFs.
Best for: clean digital PDFs, no GPU available, willingness to accept lower table fidelity.
Option 4 — Docling for structure-heavy batch (OSS)
IBM's Docling (github.com/DS4SD/docling, MIT) is another strong OSS choice, with particularly good handling of complex layouts and tables. CLI usage:
pip install docling
docling ./pdfs --to md --output ./markdown_outHeavier to set up than Marker, but worth it for documents with intricate tables or multi-column layouts where every cell matters.
Performance tips for OSS batch
Concurrency, not parallelism
For CPU-bound libraries (pdftotext, PyMuPDF), use GNU parallel with -j set to your core count. For GPU-bound libraries (Marker, Docling), GPU memory is the bottleneck — start with 2–4 workers per GPU and tune up.
Skip already-converted
For incremental batches (re-running the script as new PDFs arrive), check whether the .md already exists before converting. Five lines of Python; saves time on re-runs.
Batch errors gracefully
One PDF in a hundred will fail (corrupted file, password-protected, exotic format). Wrap each conversion in try/except, log failures, continue. Don't abort a 1000-file batch because one document was bad.
Watch for OCR slowness
Scanned PDFs require OCR (Marker has it built in via Surya; pdftotext does not — pair with Tesseract). Expect 3–10× longer processing than digital. For mixed batches, sort by digital-vs-scanned and route accordingly.
Cost math at scale
OSS routes have no per-page cost — you pay only for compute. A consumer GPU at home converts 100k pages per month for the cost of electricity (~$15/month at typical rates). A cloud GPU instance (RTX A4000, ~$0.50/hour) handles the same workload for under $50/month if you only run during conversion windows.
The trade-off is engineering and ops time. If you have it, OSS is dramatically cheaper at scale. If you don't, the web tool covers small batches without setup overhead.
The hybrid pattern
The realistic production setup most teams land on:
- Bulk pipeline: Marker (or Docling, or PyMuPDF) running locally on a GPU machine, converting everything in your archive overnight
- Spot conversions: web tool at /convert/pdf-to-markdown for the one-off PDF you receive in email and need converted in 30 seconds
- Quality check: spot-check 5–10 randomly-sampled OSS conversions against the web tool output to confirm the OSS settings produce comparable quality on your document type
This gives you scalable automation for the bulk of work plus a no-setup option for the edge cases.