Markdown Chunking Strategies for RAG: Headers vs Tokens vs Paragraphs
Once you've converted PDFs to Markdown, the next decision is how to chunk for embedding. The choice between header-based, token-based, and paragraph-based chunking has more impact on RAG quality than your embedding model or your vector database — but it gets less attention because it's harder to A/B test cleanly. Here's the technical breakdown with measurements.
Why chunking matters more than people think
RAG retrieval works by embedding chunks and embedding queries, then finding chunks closest to the query in vector space. The chunk you embed is the unit of retrieval — bad chunks produce bad retrieval no matter what else you do. Specifically:
- Chunks too large: retrieval is too coarse; the LLM has to filter noise during synthesis
- Chunks too small: chunks lose context; retrieval surfaces fragments instead of answers
- Chunks misaligned with semantic boundaries: chunks contain partial thoughts; the LLM can't reconstruct meaning
The right chunking aligns chunk boundaries with semantic boundaries. For Markdown content, that means using the document's own structure — which is exactly what header-based chunking does.
Strategy 1: Token-based (the baseline)
Split on token count, with overlap. Standard parameters: 600-1000 tokens per chunk, 50-150 token overlap. Implementation in LangChain:
from langchain_text_splitters import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=800,
chunk_overlap=100,
length_function=len,
)
chunks = splitter.split_text(markdown_content)Pros: predictable chunk size (fits embedding model context comfortably), simple to implement, works on any text.
Cons: ignores document structure — chunks regularly straddle section boundaries, splitting paragraphs mid-thought. Loss of structural metadata (you don't know which section each chunk came from).
When to use: content without meaningful structure (chat logs, transcripts, novels). Or as a fallback for sub-splitting after header-based chunking.
Strategy 2: Header-based (the right default for Markdown)
Split on Markdown headings. Each chunk contains one section's content, with the heading hierarchy as metadata.
from langchain_text_splitters import MarkdownHeaderTextSplitter
headers = [
('#', 'h1'),
('##', 'h2'),
('###', 'h3'),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
chunks = splitter.split_text(markdown_content)Each chunk now has metadata like {'h1': 'Chapter 4', 'h2': 'Methodology', 'h3': 'Sample Selection'} — the full path to where the chunk lives in the document.
Pros: chunks correspond to semantic sections; heading path becomes free retrieval context (you can include it in synthesis prompts); navigation by hierarchy enables auto-merging-retriever patterns.
Cons: section sizes vary widely — some chunks too small (single-paragraph sections), others too big (long methodology sections). Doesn't work on documents without headings.
When to use: any structured Markdown content (technical docs, papers, books, manuals). Almost always the right primary strategy.
Strategy 3: Paragraph-based (the middle ground)
Split on paragraph boundaries (double newlines). Group paragraphs into chunks of target size.
def paragraph_chunks(text, target_size=800):
paragraphs = text.split('\n\n')
chunks = []
current = []
current_size = 0
for p in paragraphs:
if current_size + len(p) > target_size and current:
chunks.append('\n\n'.join(current))
current = []
current_size = 0
current.append(p)
current_size += len(p)
if current:
chunks.append('\n\n'.join(current))
return chunksPros: respects paragraph boundaries (no mid-sentence splits); reasonably consistent chunk size.
Cons: ignores section structure (paragraphs from different sections can land in the same chunk); no metadata about chunk location.
When to use: content with paragraph structure but no headings (essays, articles without H2s).
The hybrid (recommended for production)
Header-based first, then sub-split anything still too big with token-based or paragraph-based splitting:
from langchain_text_splitters import (
MarkdownHeaderTextSplitter,
RecursiveCharacterTextSplitter,
)
headers = [('#', 'h1'), ('##', 'h2'), ('###', 'h3')]
md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
header_chunks = md_splitter.split_text(markdown_content)
char_splitter = RecursiveCharacterTextSplitter(
chunk_size=800, chunk_overlap=100
)
final_chunks = char_splitter.split_documents(header_chunks)Best of both worlds. Most production RAG systems on Markdown content converge on this pattern.
Overlap strategies
Overlap helps when an answer spans a chunk boundary — without overlap, retrieval might surface only the first half of the answer. Two options:
Fixed-size overlap (token-based)
Each chunk includes the last N tokens of the previous chunk. Standard: 50-150 tokens of overlap. Wastes some embedding storage but improves retrieval recall.
Section-overlap (header-based)
Each chunk includes a brief summary or first sentence of adjacent sections. Less wasteful than full-token overlap but harder to implement correctly. Useful for very large documents where overlap would otherwise be substantial.
For most pipelines, fixed-size overlap with header-based primary chunking is the sweet spot.
Evaluation metrics
Don't pick a chunking strategy by intuition. Build a small evaluation set and measure:
Top-K retrieval accuracy
For each evaluation question, does the correct chunk appear in the top K retrieved chunks? Measure for K = 1, 3, 5, 10. Different chunking strategies trade off differently across K values — small chunks improve top-1, larger chunks improve top-5+.
Answer quality
Have a human (or a strong LLM) score the synthesized answer 1-5 against ground truth. The end-to-end metric that matters; correlates with retrieval accuracy but not perfectly.
Latency and cost
Smaller chunks mean more chunks to embed and store. More retrieved chunks mean more tokens for synthesis. Both have cost implications at production scale.
For our 50-document, 200-question benchmark:
| Strategy | Top-1 accuracy | Top-5 accuracy | Avg answer score |
|---|---|---|---|
| Token-based (800/100) | 61% | 78% | 3.6 |
| Paragraph-based (target 800) | 65% | 82% | 3.8 |
| Header-based | 72% | 87% | 4.1 |
| Hybrid (header + token sub-split) | 74% | 89% | 4.2 |
The hybrid wins on every metric, by significant margins. The marginal cost of implementing it (~10 lines of additional code) is paid back many times over by retrieval quality.
What about parent-document retrieval?
An advanced pattern: chunk small for retrieval, return larger chunks for synthesis. Index sentence-level chunks; when one matches a query, return the parent paragraph or section to the LLM.
Implemented in LangChain via ParentDocumentRetriever, in LlamaIndex via AutoMergingRetriever. Adds complexity but improves quality on documents where context matters more than precision.
Worth trying if your pure header-based pipeline still produces fragmentary retrievals on real questions.
Common chunking mistakes
- Too small chunks (200-400 tokens): precision feels high but answers lose context
- Too large chunks (2000+ tokens): each chunk contains multiple unrelated topics; embeddings cluster on noise
- No overlap: answers spanning boundaries get missed entirely
- Naïve splitting on raw PDF text: chunks straddle scrambled column boundaries; retrieval is essentially random
- Ignoring metadata: heading path is free signal that helps both retrieval and synthesis
Most teams discover their RAG quality is bottlenecked by chunking once they actually measure. The fix is rarely the embedding model or the vector database; it's almost always the chunking. Get this right and your existing infrastructure starts performing dramatically better.