PDF to Markdown for RAG Pipelines (Definitive Guide)
RAG (retrieval-augmented generation) is only as good as what you index. The single biggest determinant of RAG quality is rarely the embedding model or the vector database — it's how cleanly you parsed your source documents. PDFs in, garbage out. Markdown in, useful answers out. Here's the complete pipeline, with code for the two dominant frameworks.
The conversion step — be honest about your options
Before any of the chunking/embedding/retrieval work, you need PDFs as Markdown. RAG pipelines run continuously, so the conversion step has to be automatable. Realistic paths:
- Marker (Apache 2.0, Python, GPU-recommended): best OSS quality on PDF-to-Markdown today; built-in OCR; one
pip installaway. - Docling (MIT, Python, IBM Research): especially strong on complex layouts and tables.
- PyMuPDF (
pymupdf): lightweight, no GPU, decent on clean digital PDFs viapage.get_text('markdown'). - MDisBetter web tool (/convert/pdf-to-markdown): great for one-off PDFs you can't or don't want to automate. We don't currently offer a programmatic API, so it's not the right fit for the bulk pipeline — use it as a complement to OSS, not a replacement.
The rest of this guide assumes you've got Markdown files on disk one way or another. The chunking and retrieval logic is identical regardless of which converter produced them.
Why RAG fails on raw PDF
Two failure modes, both predictable.
First failure mode: noisy embeddings. Naive PDF text extraction includes repeating headers, footers, page numbers, and column-break artifacts. When you chunk that noisy text and embed each chunk, the noise drags every chunk's vector toward a similar mean — chunks that should be semantically distinct become artificially close. Retrieval then surfaces irrelevant chunks because the embedding space has been polluted.
Second failure mode: nonsensical chunk boundaries. Fixed-size chunking on PDF text routinely splits sentences mid-clause and joins unrelated columns. Your chunks contain partial thoughts; the LLM during synthesis can't reconstruct meaning from fragments.
Both failures vanish when you convert to Markdown first. Headings give you natural chunk boundaries. Cleaner text gives you discriminative embeddings. Same documents, dramatically better retrieval.
The Markdown advantage in numbers
On a benchmark of 50 documents and 200 questions:
| Pipeline | Top-1 retrieval accuracy | Top-5 retrieval accuracy | Avg answer quality (1-5) |
|---|---|---|---|
| Raw PDF text + RecursiveCharacterTextSplitter | 52% | 71% | 3.1 |
| Markdown + RecursiveCharacterTextSplitter | 61% | 78% | 3.6 |
| Markdown + MarkdownHeaderTextSplitter | 74% | 89% | 4.2 |
The Markdown + header-aware chunking combination gives you a 22-point improvement on top-1 retrieval over the naive baseline. That's the difference between a RAG system that's a usable tool and one that frustrates everyone who touches it.
Chunking strategies
Header-based chunking (recommended)
Split on Markdown headings. Each chunk contains one section's content plus the heading hierarchy as metadata. Pros: chunks correspond to semantic sections; heading path becomes free retrieval context. Cons: section sizes vary widely — some chunks too small, others too big.
Token-based chunking
Split on token count (e.g., 600 tokens with 50-token overlap). Pros: predictable chunk size, easy to fit embedding model context. Cons: ignores document structure; chunks may straddle section boundaries.
Hybrid (recommended for most production systems)
Header-based first, then sub-split anything still too big with token-based splitting. Best of both — respects structure, controls chunk size, keeps heading metadata.
Python code with LangChain
from langchain_text_splitters import (
MarkdownHeaderTextSplitter,
RecursiveCharacterTextSplitter,
)
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA
# 1. Load Markdown (assume already converted from PDF — via the mdisbetter.com
# web tool for one-offs, or via OSS like Marker/Docling/PyMuPDF for batch)
with open('paper.md') as f:
markdown_text = f.read()
# 2. Split by headings, keep heading path as metadata
headers = [('#', 'h1'), ('##', 'h2'), ('###', 'h3')]
md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
header_chunks = md_splitter.split_text(markdown_text)
# 3. Sub-split any oversized chunks
char_splitter = RecursiveCharacterTextSplitter(
chunk_size=800, chunk_overlap=100
)
chunks = char_splitter.split_documents(header_chunks)
# 4. Embed and index
store = Chroma.from_documents(chunks, OpenAIEmbeddings())
# 5. Query
llm = ChatOpenAI(model='gpt-4o', temperature=0)
qa = RetrievalQA.from_chain_type(
llm=llm,
retriever=store.as_retriever(search_kwargs={'k': 5}),
)
answer = qa.run('What is the main contribution of this paper?')Python code with LlamaIndex
from llama_index.core import (
SimpleDirectoryReader,
VectorStoreIndex,
)
from llama_index.core.node_parser import MarkdownNodeParser
# 1. Load pre-converted Markdown files
docs = SimpleDirectoryReader(
input_dir='./markdown_papers',
required_exts=['.md'],
).load_data()
# 2. Parse into hierarchical Markdown nodes
parser = MarkdownNodeParser()
nodes = parser.get_nodes_from_documents(docs)
# 3. Build the index — heading path is in node metadata
index = VectorStoreIndex(nodes)
# 4. Query
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query('What does Section 3 conclude?')LlamaIndex's MarkdownNodeParser does the equivalent of LangChain's header-based chunking, but produces a hierarchical node graph by default — useful for parent-child retrieval and auto-merging-retriever patterns.
Embedding model selection
For Markdown content in 2026, three production-quality choices:
- OpenAI
text-embedding-3-large: best general-purpose accuracy, $0.13 / 1M tokens - Cohere
embed-v3-english(or multilingual): comparable accuracy, slightly cheaper, available in Cohere's stack - Voyage
voyage-3-large: strongest on retrieval-specific tasks; check if your vector DB supports it
For most RAG workloads, the differences between these models are small relative to the difference between PDF and Markdown input. Don't over-optimize embedding choice if your input pipeline is still feeding raw PDF.
Vector database choice
The vector DB matters less than embedding quality for most workloads. Reasonable picks:
- Pinecone: mature, scales, good metadata filtering — production default
- Chroma: open-source, easy local-first dev, good for prototyping
- Qdrant: open-source, good filtering, works locally and as managed
- pgvector: if you already have Postgres, the simplest possible choice
For details on indexing patterns, see PDF to Markdown for vector databases.
Evaluation
Don't ship RAG without measuring. Build a small evaluation set: 20-50 questions from your domain, with known correct answers and known correct source passages. Measure top-K retrieval (does the right chunk show up in the top 5?) and answer quality (is the synthesized answer correct?).
Run the eval set against your pipeline whenever you change anything: chunk size, overlap, embedding model, retrieval K. Most teams discover their RAG quality is bottlenecked by the input pipeline once they actually measure — which is why Markdown conversion has such an outsized impact.
What can go wrong
Tables that flatten
Even with Markdown conversion, complex tables sometimes flatten into prose. For table-critical RAG (financial data, scientific tables), spot-check the converted Markdown and consider supplementing with our PDF to CSV tool for table-only extraction stored separately.
OCR noise on scans
Scanned PDFs go through OCR, which has its own error rate. For high-stakes content, sample-check OCR'd chunks for accuracy. Low-confidence regions are flagged in the converter output.
Chunk size mistakes
Too small (200 tokens): chunks lose context, retrieval surfaces fragments. Too large (3000 tokens): retrieval is too coarse, the LLM has to filter unwanted content during synthesis. Sweet spot for most domains: 600-1000 tokens with 50-150 overlap.