Converting Technical Documentation from PDF to Markdown
Every established product carries a backlog of PDF documentation: user manuals from a previous decade, API references that nobody updates, training materials in binders. Migrating to Markdown turns dead PDFs into living documentation — version-controlled in Git, edit-reviewable in PRs, reusable across docs sites, AI-readable for users querying it. The migration takes weeks, not months.
Why migrate now
Five compounding reasons:
- Search and findability: PDF search is per-document; docs site search is across the entire corpus
- Version control: Git diffs show exactly what changed in any doc, by whom, when
- Translation: translating Markdown is straightforward; translating PDFs requires re-doing layout in InDesign
- AI integration: customers using Cursor/Copilot/Claude expect to drop documentation in their workspace as Markdown
- Style consistency: markdownlint catches style issues automatically; PDF style review is manual
Teams that delay migration accumulate dead documentation that nobody reads or maintains. Teams that migrate find their docs become a competitive advantage in customer experience.
Audit before migrating
The first 30% of the migration is deciding what to migrate. Most teams find half their legacy PDFs are obsolete and can be archived without conversion. Categorize:
- Keep + migrate: still authoritative, customers reference
- Keep + don't migrate: legal/regulatory artifacts that need to stay as PDF
- Archive: outdated, superseded, or never used — don't waste effort migrating
- Replace: rewrite from scratch in Markdown rather than convert (rare but right for very dated content)
This audit usually takes a few days but saves weeks of wasted migration on dead content.
The migration workflow
Step 1: Batch convert the keepers
Use our API with a Python loop (batch conversion guide). For 100-500 documents, the conversion runs in tens of minutes. Output: a folder of .md files matching your PDF inventory.
Step 2: Style review
Run markdownlint with your team's standard config to catch:
- Inconsistent heading hierarchy (skipped levels, multiple H1s)
- Trailing whitespace, tab/space mixing
- Inconsistent list markers
- Long lines (if your style guide has a max width)
Most issues are auto-fixable with markdownlint --fix. Manual review for the residual problems takes a few minutes per document.
Step 3: Reorganize for the docs site
The PDF organization probably doesn't match what makes sense for a web docs site. Re-organize into the structure your target tool expects:
- MkDocs: organize
docs/folder structure, editmkdocs.ymlnav - Docusaurus: organize
docs/+ editsidebars.js - Hugo: organize
content/sections + add front matter - Jekyll:
_posts/for dated content, collections for evergreen
Splitting long PDFs into multiple Markdown files (one per chapter) is often the right call — improves navigability on the web and keeps individual pages digestible.
Step 4: Add docs-as-code infrastructure
Once content is in Markdown:
- Set up CI to build and deploy on every commit
- Add a markdownlint check in CI to enforce style
- Set up PR templates so doc changes get reviewed like code changes
- Tag versions when you ship major updates
The infrastructure is one-time setup; the workflow benefits compound on every subsequent doc change.
Translation strategy
If you have multilingual docs (or want to add languages), do source translation in Markdown, not PDF. Tools like Crowdin, Lokalise, and Phrase accept Markdown natively, preserve formatting through translation memory, and integrate with Git workflows.
Pattern:
- Source language Markdown lives in
docs/en/ - Translation tool pulls source files, distributes to translators
- Translated files land in
docs/{lang}/ - Docs site builds language-specific versions
Far more sustainable than the InDesign-per-language workflow many teams still run for PDF docs.
Handling DITA / DocBook source
If your legacy docs are in DITA or DocBook (XML-based formats), you have two paths:
- Direct XML to Markdown: Pandoc handles DITA and DocBook well. Skip our converter entirely — go XML → Markdown via Pandoc.
- PDF intermediate: if you've lost the source XML and only have the rendered PDF, convert the PDF via our tool. You lose semantic intent (DITA's
<cmd>etc.) but recover most of the structure.
For DITA shops modernizing, the cleanest migration is XML→Markdown via Pandoc. Our PDF converter is the fallback when source XML isn't available.
What about PDFs that customers must continue receiving?
Some doc types still need PDF distribution: regulatory filings, formal contracts, certified training materials. The right pattern: author once in Markdown, generate PDF as a derivative.
- Source of truth: Markdown in your repo
- For web: docs site builds HTML from Markdown
- For PDF: Pandoc or our Markdown to PDF tool generates PDFs from the same source
Single source, multiple outputs. Easier than maintaining parallel Markdown and PDF copies that drift.
Realistic timeline
For a typical migration of 200 PDF documents:
- Week 1: audit and categorization
- Week 2: batch conversion + style review
- Week 3: docs site setup, content reorganization
- Week 4-6: editorial polish, broken-link cleanup, deployment
One technical writer can complete this in 4-6 weeks part-time. Larger migrations (1000+ documents) parallelize well — you can split the corpus across multiple writers and merge in Git.