The table problem in PDFs
PDF tables are visual constructs: rectangles drawn on a page, with text inside the rectangles. There's no schema saying "this is row 3 of column B"; the converter has to infer that from coordinates. Simple tables (single header row, regular cell sizes) are easy. Tables with merged cells, multi-row headers, financial-report-style nested rows, or rotated header text are the long tail where most converters fall apart.
Our table strategy
We detect tables via line-detection and whitespace clustering, then reconstruct cells based on the bounding boxes of text runs. Merged cells are flattened with the merged value repeated (so the GFM output renders correctly), with a comment marking the original merge for downstream tools that care. Multi-row headers are collapsed into a single header row using a separator (Q1 / Revenue). The result is GFM-valid Markdown that round-trips into spreadsheets cleanly.