Pricing Dashboard Sign up
Recent
· 7 min read · MDisBetter

PDF Formulas to Markdown (LaTeX Output)

Mathematical equations in PDFs are images of glyphs, not text. The default extraction is gibberish — half the symbols missing, the rest garbled. The right approach is equation-aware conversion that detects equation regions, runs them through a math recognizer, and emits valid LaTeX that any modern Markdown viewer renders correctly.

How math is stored in PDF

An equation in a typical PDF is rendered the same way as any other content: glyphs at coordinates. The integral sign is one glyph, the variable is another, the differential is two more glyphs side-by-side. There's no underlying notion of "this is an integral" — to the file format, it's just shapes.

This is why standard PDF text extraction garbles math: the extractor sees individual glyphs out of context, doesn't know they form an equation, and emits them as plain characters. Greek letters become Roman approximations or get dropped. Subscripts and superscripts collapse into the baseline. Multi-line equations (matrices, fractions, summation bounds) flatten into noise.

LaTeX in Markdown

Markdown doesn't define equation syntax in its core spec, but a community convention has standardized: dollar-sign-delimited LaTeX. Inline equations use single dollars: $E = mc^2$. Display equations (their own line, often centered) use double dollars: $$\nabla \cdot \vec{E} = \rho/\epsilon_0$$. The LaTeX between the dollars is rendered by MathJax or KaTeX in the viewer.

Our converter detects equation regions in the PDF, runs them through a math-aware OCR pipeline, and emits valid LaTeX. The result: an inline equation that looked like "E = mc²" in the PDF comes out as $E = mc^2$ in the Markdown — renderable as crisp typeset math in any compatible viewer.

MathJax compatibility

The output works in any environment with MathJax or KaTeX support:

For environments without math rendering, the LaTeX still reads as plain text and round-trips cleanly into proper math software (Mathematica, MATLAB, SymPy).

Examples

Inline equations

Source PDF: "the energy E equals m c squared, where m is mass and c is the speed of light."

Converted Markdown: the energy $E$ equals $mc^2$, where $m$ is mass and $c$ is the speed of light.

Display equations

Source PDF (centered, on its own line): integral from 0 to infinity of e^(-x^2) dx equals sqrt(pi)/2.

Converted Markdown:

$$\int_0^{\infty} e^{-x^2}\, dx = \frac{\sqrt{\pi}}{2}$$

Equation numbers

When the source PDF labels equations (e.g., "(3.14)"), we emit them as comments after the closing dollars. Cross-references in the body that say "see equation (3.14)" remain unchanged so they still match.

Matrices and aligned environments

Multi-row math (matrices, aligned equations, cases) is preserved with the appropriate LaTeX environments:

$$\begin{pmatrix} a & b \\ c & d \end{pmatrix}$$

Step-by-step workflow

  1. Take your math-heavy PDF (academic paper, physics textbook, engineering manual)
  2. Drop it into the PDF with formulas to Markdown converter
  3. Wait — equation recognition is the slow step (5-30 seconds extra for an equation-heavy document)
  4. Download the resulting .md
  5. Open in any MathJax/KaTeX-enabled viewer (Obsidian, GitHub, etc.) — equations render as typeset math

Limitations

Equation OCR is genuinely hard, and no tool is perfect. What works well:

What's harder:

For the rare equation that comes out wrong, the surrounding text is unaffected — you can hand-fix the LaTeX without re-running conversion. The OCR engine flags low-confidence equation regions in the output for inspection.

Why this matters for AI use

If you're feeding equation-heavy content to ChatGPT or Claude, LaTeX is the format the models understand best. They'll reason over equations correctly, derive related expressions, and even render LaTeX in their responses. Plain-text approximations ("E = mc^2" without the LaTeX delimiters) work but reduce the model's ability to manipulate the math symbolically.

For data scientists and engineers building RAG pipelines on technical documents, equation preservation is often the difference between a useful retrieval system and one that mangles every math reference. Markdown with proper LaTeX is the right intermediate format.

Frequently asked questions

What math notation works best?
Standard LaTeX — anything you'd find in a physics or math undergraduate textbook converts cleanly. Specialized notation (commutative diagrams, complex tensor indices) may degrade and need manual review.
Will the LaTeX render in Obsidian and GitHub?
Yes — both have native MathJax-compatible rendering. Obsidian supports both inline and display math out of the box; GitHub has rendered math in Markdown since June 2022.
Can I convert physics or engineering papers?
Yes — those are the most common use case. Vector calculus, partial derivatives, tensor notation, and standard physics symbols all convert with high fidelity.