PDF Formulas to Markdown (LaTeX Output)
Mathematical equations in PDFs are images of glyphs, not text. The default extraction is gibberish — half the symbols missing, the rest garbled. The right approach is equation-aware conversion that detects equation regions, runs them through a math recognizer, and emits valid LaTeX that any modern Markdown viewer renders correctly.
How math is stored in PDF
An equation in a typical PDF is rendered the same way as any other content: glyphs at coordinates. The integral sign is one glyph, the variable is another, the differential is two more glyphs side-by-side. There's no underlying notion of "this is an integral" — to the file format, it's just shapes.
This is why standard PDF text extraction garbles math: the extractor sees individual glyphs out of context, doesn't know they form an equation, and emits them as plain characters. Greek letters become Roman approximations or get dropped. Subscripts and superscripts collapse into the baseline. Multi-line equations (matrices, fractions, summation bounds) flatten into noise.
LaTeX in Markdown
Markdown doesn't define equation syntax in its core spec, but a community convention has standardized: dollar-sign-delimited LaTeX. Inline equations use single dollars: $E = mc^2$. Display equations (their own line, often centered) use double dollars: $$\nabla \cdot \vec{E} = \rho/\epsilon_0$$. The LaTeX between the dollars is rendered by MathJax or KaTeX in the viewer.
Our converter detects equation regions in the PDF, runs them through a math-aware OCR pipeline, and emits valid LaTeX. The result: an inline equation that looked like "E = mc²" in the PDF comes out as $E = mc^2$ in the Markdown — renderable as crisp typeset math in any compatible viewer.
MathJax compatibility
The output works in any environment with MathJax or KaTeX support:
- Obsidian: built-in support for both inline and display math
- GitHub: README rendering supports
$$blocks (since June 2022) and$...$inline - MkDocs Material: enable the
pymdownx.arithmatexextension - Docusaurus:
remark-math+rehype-katexplugins - Jupyter notebooks: native MathJax in Markdown cells
- Quarto: native math rendering
- VS Code: Markdown Preview Enhanced extension
For environments without math rendering, the LaTeX still reads as plain text and round-trips cleanly into proper math software (Mathematica, MATLAB, SymPy).
Examples
Inline equations
Source PDF: "the energy E equals m c squared, where m is mass and c is the speed of light."
Converted Markdown: the energy $E$ equals $mc^2$, where $m$ is mass and $c$ is the speed of light.
Display equations
Source PDF (centered, on its own line): integral from 0 to infinity of e^(-x^2) dx equals sqrt(pi)/2.
Converted Markdown:
$$\int_0^{\infty} e^{-x^2}\, dx = \frac{\sqrt{\pi}}{2}$$Equation numbers
When the source PDF labels equations (e.g., "(3.14)"), we emit them as comments after the closing dollars. Cross-references in the body that say "see equation (3.14)" remain unchanged so they still match.
Matrices and aligned environments
Multi-row math (matrices, aligned equations, cases) is preserved with the appropriate LaTeX environments:
$$\begin{pmatrix} a & b \\ c & d \end{pmatrix}$$Step-by-step workflow
- Take your math-heavy PDF (academic paper, physics textbook, engineering manual)
- Drop it into the PDF with formulas to Markdown converter
- Wait — equation recognition is the slow step (5-30 seconds extra for an equation-heavy document)
- Download the resulting
.md - Open in any MathJax/KaTeX-enabled viewer (Obsidian, GitHub, etc.) — equations render as typeset math
Limitations
Equation OCR is genuinely hard, and no tool is perfect. What works well:
- Standard arithmetic, algebra, calculus, linear algebra
- Set notation, summations, integrals, fractions
- Sub- and superscripts, Greek letters, common operators
- Matrices and aligned multi-line equations
What's harder:
- Commutative diagrams (often need manual recreation)
- Complex tensor notation with multiple indices
- Hand-drawn equations in scanned source
- Custom or non-standard symbols
- Chemical structures (different problem entirely)
For the rare equation that comes out wrong, the surrounding text is unaffected — you can hand-fix the LaTeX without re-running conversion. The OCR engine flags low-confidence equation regions in the output for inspection.
Why this matters for AI use
If you're feeding equation-heavy content to ChatGPT or Claude, LaTeX is the format the models understand best. They'll reason over equations correctly, derive related expressions, and even render LaTeX in their responses. Plain-text approximations ("E = mc^2" without the LaTeX delimiters) work but reduce the model's ability to manipulate the math symbolically.
For data scientists and engineers building RAG pipelines on technical documents, equation preservation is often the difference between a useful retrieval system and one that mangles every math reference. Markdown with proper LaTeX is the right intermediate format.