What makes academic PDFs hard
Three problems compound. First, two-column layout: a naïve top-to-bottom extractor reads across columns and produces gibberish. Second, citations: in-text refs like [12] need to stay attached to their sentence, not float to a footnote. Third, math: rendered equations are positioned glyphs, not text — getting LaTeX back requires recognising the equation regions.
Our converter detects multi-column layouts and reads them in correct order, preserves in-text citations as inline references, converts displayed equations to LaTeX ($$...$$) and inline equations to $...$, and keeps figure captions with their figure numbers so you can find them later.
Reading order on column breaks
The converter analyses block bounding boxes per page, identifies column geometry, and emits text in the order a human reader would follow. Footnotes are collected at the end of the section that referenced them; references appear as a bibliography section at the end. The result reads like the paper, not like the file.