Where PDF kills your RAG accuracy
The two failure modes are predictable. First, naive fixed-size chunking on PDF text routinely splits sentences mid-clause and joins unrelated columns — embeddings are then averaged over noise, and retrieval surfaces irrelevant chunks. Second, the chunks that are retrieved often contain page numbers and headers that confuse the LLM during synthesis ("the document mentions page 14 in answer 4…").
Markdown solves both. Headings give you semantic chunk boundaries that respect the document's own structure. Cleaner text gives you embeddings that cluster on meaning instead of layout artefacts.
Recommended chunking strategy
Split first by Markdown headings (header-aware splitter), then sub-split anything still over your token budget with a recursive character splitter. Typical settings: target 800 tokens, overlap 100. Keep the heading path as metadata on each chunk so the LLM gets context for free at synthesis time.