Pricing Dashboard Sign up
Recent
· 10 min read · MDisBetter

PDF to Markdown for RAG Pipelines (Definitive Guide)

RAG (retrieval-augmented generation) is only as good as what you index. The single biggest determinant of RAG quality is rarely the embedding model or the vector database — it's how cleanly you parsed your source documents. PDFs in, garbage out. Markdown in, useful answers out. Here's the complete pipeline, with code for the two dominant frameworks.

The conversion step — be honest about your options

Before any of the chunking/embedding/retrieval work, you need PDFs as Markdown. RAG pipelines run continuously, so the conversion step has to be automatable. Realistic paths:

The rest of this guide assumes you've got Markdown files on disk one way or another. The chunking and retrieval logic is identical regardless of which converter produced them.

Why RAG fails on raw PDF

Two failure modes, both predictable.

First failure mode: noisy embeddings. Naive PDF text extraction includes repeating headers, footers, page numbers, and column-break artifacts. When you chunk that noisy text and embed each chunk, the noise drags every chunk's vector toward a similar mean — chunks that should be semantically distinct become artificially close. Retrieval then surfaces irrelevant chunks because the embedding space has been polluted.

Second failure mode: nonsensical chunk boundaries. Fixed-size chunking on PDF text routinely splits sentences mid-clause and joins unrelated columns. Your chunks contain partial thoughts; the LLM during synthesis can't reconstruct meaning from fragments.

Both failures vanish when you convert to Markdown first. Headings give you natural chunk boundaries. Cleaner text gives you discriminative embeddings. Same documents, dramatically better retrieval.

The Markdown advantage in numbers

On a benchmark of 50 documents and 200 questions:

PipelineTop-1 retrieval accuracyTop-5 retrieval accuracyAvg answer quality (1-5)
Raw PDF text + RecursiveCharacterTextSplitter52%71%3.1
Markdown + RecursiveCharacterTextSplitter61%78%3.6
Markdown + MarkdownHeaderTextSplitter74%89%4.2

The Markdown + header-aware chunking combination gives you a 22-point improvement on top-1 retrieval over the naive baseline. That's the difference between a RAG system that's a usable tool and one that frustrates everyone who touches it.

Chunking strategies

Header-based chunking (recommended)

Split on Markdown headings. Each chunk contains one section's content plus the heading hierarchy as metadata. Pros: chunks correspond to semantic sections; heading path becomes free retrieval context. Cons: section sizes vary widely — some chunks too small, others too big.

Token-based chunking

Split on token count (e.g., 600 tokens with 50-token overlap). Pros: predictable chunk size, easy to fit embedding model context. Cons: ignores document structure; chunks may straddle section boundaries.

Hybrid (recommended for most production systems)

Header-based first, then sub-split anything still too big with token-based splitting. Best of both — respects structure, controls chunk size, keeps heading metadata.

Python code with LangChain

from langchain_text_splitters import (
    MarkdownHeaderTextSplitter,
    RecursiveCharacterTextSplitter,
)
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import RetrievalQA

# 1. Load Markdown (assume already converted from PDF — via the mdisbetter.com
#    web tool for one-offs, or via OSS like Marker/Docling/PyMuPDF for batch)
with open('paper.md') as f:
    markdown_text = f.read()

# 2. Split by headings, keep heading path as metadata
headers = [('#', 'h1'), ('##', 'h2'), ('###', 'h3')]
md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
header_chunks = md_splitter.split_text(markdown_text)

# 3. Sub-split any oversized chunks
char_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800, chunk_overlap=100
)
chunks = char_splitter.split_documents(header_chunks)

# 4. Embed and index
store = Chroma.from_documents(chunks, OpenAIEmbeddings())

# 5. Query
llm = ChatOpenAI(model='gpt-4o', temperature=0)
qa = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=store.as_retriever(search_kwargs={'k': 5}),
)
answer = qa.run('What is the main contribution of this paper?')

Python code with LlamaIndex

from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
)
from llama_index.core.node_parser import MarkdownNodeParser

# 1. Load pre-converted Markdown files
docs = SimpleDirectoryReader(
    input_dir='./markdown_papers',
    required_exts=['.md'],
).load_data()

# 2. Parse into hierarchical Markdown nodes
parser = MarkdownNodeParser()
nodes = parser.get_nodes_from_documents(docs)

# 3. Build the index — heading path is in node metadata
index = VectorStoreIndex(nodes)

# 4. Query
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query('What does Section 3 conclude?')

LlamaIndex's MarkdownNodeParser does the equivalent of LangChain's header-based chunking, but produces a hierarchical node graph by default — useful for parent-child retrieval and auto-merging-retriever patterns.

Embedding model selection

For Markdown content in 2026, three production-quality choices:

For most RAG workloads, the differences between these models are small relative to the difference between PDF and Markdown input. Don't over-optimize embedding choice if your input pipeline is still feeding raw PDF.

Vector database choice

The vector DB matters less than embedding quality for most workloads. Reasonable picks:

For details on indexing patterns, see PDF to Markdown for vector databases.

Evaluation

Don't ship RAG without measuring. Build a small evaluation set: 20-50 questions from your domain, with known correct answers and known correct source passages. Measure top-K retrieval (does the right chunk show up in the top 5?) and answer quality (is the synthesized answer correct?).

Run the eval set against your pipeline whenever you change anything: chunk size, overlap, embedding model, retrieval K. Most teams discover their RAG quality is bottlenecked by the input pipeline once they actually measure — which is why Markdown conversion has such an outsized impact.

What can go wrong

Tables that flatten

Even with Markdown conversion, complex tables sometimes flatten into prose. For table-critical RAG (financial data, scientific tables), spot-check the converted Markdown and consider supplementing with our PDF to CSV tool for table-only extraction stored separately.

OCR noise on scans

Scanned PDFs go through OCR, which has its own error rate. For high-stakes content, sample-check OCR'd chunks for accuracy. Low-confidence regions are flagged in the converter output.

Chunk size mistakes

Too small (200 tokens): chunks lose context, retrieval surfaces fragments. Too large (3000 tokens): retrieval is too coarse, the LLM has to filter unwanted content during synthesis. Sweet spot for most domains: 600-1000 tokens with 50-150 overlap.

Frequently asked questions

Should I always use header-based chunking for RAG?
For Markdown input, yes — it consistently beats fixed-size chunking on retrieval accuracy. The exception: if your content has no meaningful headings (chat logs, transcripts, novels), token-based or paragraph-based chunking is more sensible.
How does this compare to using LlamaParse on raw PDFs?
LlamaParse is a hosted PDF parser similar to ours. The key choice is between hosted services (us, LlamaParse, etc.) vs OSS (Marker, Docling). Quality differences are document-specific; test both on your corpus. The Markdown-first principle holds either way.
Can I use this pipeline with Claude or local Llama models?
Yes — the conversion + chunking + embedding + retrieval pipeline is model-agnostic. Swap in Claude (via the Anthropic SDK) or a local Llama model for the synthesis step; the rest is unchanged.