May 10, 2026 · 9 min read · MDisBetter

Markdown Chunking Strategies for RAG: Headers vs Tokens vs Paragraphs

Once you've converted PDFs to Markdown, the next decision is how to chunk for embedding. The choice between header-based, token-based, and paragraph-based chunking has more impact on RAG quality than your embedding model or your vector database — but it gets less attention because it's harder to A/B test cleanly. Here's the technical breakdown with measurements.

Why chunking matters more than people think

RAG retrieval works by embedding chunks and embedding queries, then finding chunks closest to the query in vector space. The chunk you embed is the unit of retrieval — bad chunks produce bad retrieval no matter what else you do. Specifically:

Chunks too large: retrieval is too coarse; the LLM has to filter noise during synthesis
Chunks too small: chunks lose context; retrieval surfaces fragments instead of answers
Chunks misaligned with semantic boundaries: chunks contain partial thoughts; the LLM can't reconstruct meaning

The right chunking aligns chunk boundaries with semantic boundaries. For Markdown content, that means using the document's own structure — which is exactly what header-based chunking does.

Strategy 1: Token-based (the baseline)

Split on token count, with overlap. Standard parameters: 600-1000 tokens per chunk, 50-150 token overlap. Implementation in LangChain:

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
    length_function=len,
)
chunks = splitter.split_text(markdown_content)

Pros: predictable chunk size (fits embedding model context comfortably), simple to implement, works on any text.

Cons: ignores document structure — chunks regularly straddle section boundaries, splitting paragraphs mid-thought. Loss of structural metadata (you don't know which section each chunk came from).

When to use: content without meaningful structure (chat logs, transcripts, novels). Or as a fallback for sub-splitting after header-based chunking.

Strategy 2: Header-based (the right default for Markdown)

Split on Markdown headings. Each chunk contains one section's content, with the heading hierarchy as metadata.

from langchain_text_splitters import MarkdownHeaderTextSplitter

headers = [
    ('#', 'h1'),
    ('##', 'h2'),
    ('###', 'h3'),
]
splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
chunks = splitter.split_text(markdown_content)

Each chunk now has metadata like {'h1': 'Chapter 4', 'h2': 'Methodology', 'h3': 'Sample Selection'} — the full path to where the chunk lives in the document.

Pros: chunks correspond to semantic sections; heading path becomes free retrieval context (you can include it in synthesis prompts); navigation by hierarchy enables auto-merging-retriever patterns.

Cons: section sizes vary widely — some chunks too small (single-paragraph sections), others too big (long methodology sections). Doesn't work on documents without headings.

When to use: any structured Markdown content (technical docs, papers, books, manuals). Almost always the right primary strategy.

Strategy 3: Paragraph-based (the middle ground)

Split on paragraph boundaries (double newlines). Group paragraphs into chunks of target size.

def paragraph_chunks(text, target_size=800):
    paragraphs = text.split('\n\n')
    chunks = []
    current = []
    current_size = 0
    for p in paragraphs:
        if current_size + len(p) > target_size and current:
            chunks.append('\n\n'.join(current))
            current = []
            current_size = 0
        current.append(p)
        current_size += len(p)
    if current:
        chunks.append('\n\n'.join(current))
    return chunks

Pros: respects paragraph boundaries (no mid-sentence splits); reasonably consistent chunk size.

Cons: ignores section structure (paragraphs from different sections can land in the same chunk); no metadata about chunk location.

When to use: content with paragraph structure but no headings (essays, articles without H2s).

The hybrid (recommended for production)

Header-based first, then sub-split anything still too big with token-based or paragraph-based splitting:

from langchain_text_splitters import (
    MarkdownHeaderTextSplitter,
    RecursiveCharacterTextSplitter,
)

headers = [('#', 'h1'), ('##', 'h2'), ('###', 'h3')]
md_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
header_chunks = md_splitter.split_text(markdown_content)

char_splitter = RecursiveCharacterTextSplitter(
    chunk_size=800, chunk_overlap=100
)
final_chunks = char_splitter.split_documents(header_chunks)

Best of both worlds. Most production RAG systems on Markdown content converge on this pattern.

Overlap strategies

Overlap helps when an answer spans a chunk boundary — without overlap, retrieval might surface only the first half of the answer. Two options:

Fixed-size overlap (token-based)

Each chunk includes the last N tokens of the previous chunk. Standard: 50-150 tokens of overlap. Wastes some embedding storage but improves retrieval recall.

Section-overlap (header-based)

Each chunk includes a brief summary or first sentence of adjacent sections. Less wasteful than full-token overlap but harder to implement correctly. Useful for very large documents where overlap would otherwise be substantial.

For most pipelines, fixed-size overlap with header-based primary chunking is the sweet spot.

Evaluation metrics

Don't pick a chunking strategy by intuition. Build a small evaluation set and measure:

Top-K retrieval accuracy

For each evaluation question, does the correct chunk appear in the top K retrieved chunks? Measure for K = 1, 3, 5, 10. Different chunking strategies trade off differently across K values — small chunks improve top-1, larger chunks improve top-5+.

Answer quality

Have a human (or a strong LLM) score the synthesized answer 1-5 against ground truth. The end-to-end metric that matters; correlates with retrieval accuracy but not perfectly.

Latency and cost

Smaller chunks mean more chunks to embed and store. More retrieved chunks mean more tokens for synthesis. Both have cost implications at production scale.

For our 50-document, 200-question benchmark:

Strategy	Top-1 accuracy	Top-5 accuracy	Avg answer score
Token-based (800/100)	61%	78%	3.6
Paragraph-based (target 800)	65%	82%	3.8
Header-based	72%	87%	4.1
Hybrid (header + token sub-split)	74%	89%	4.2

The hybrid wins on every metric, by significant margins. The marginal cost of implementing it (~10 lines of additional code) is paid back many times over by retrieval quality.

What about parent-document retrieval?

An advanced pattern: chunk small for retrieval, return larger chunks for synthesis. Index sentence-level chunks; when one matches a query, return the parent paragraph or section to the LLM.

Implemented in LangChain via ParentDocumentRetriever, in LlamaIndex via AutoMergingRetriever. Adds complexity but improves quality on documents where context matters more than precision.

Worth trying if your pure header-based pipeline still produces fragmentary retrievals on real questions.

Common chunking mistakes

Too small chunks (200-400 tokens): precision feels high but answers lose context
Too large chunks (2000+ tokens): each chunk contains multiple unrelated topics; embeddings cluster on noise
No overlap: answers spanning boundaries get missed entirely
Naïve splitting on raw PDF text: chunks straddle scrambled column boundaries; retrieval is essentially random
Ignoring metadata: heading path is free signal that helps both retrieval and synthesis

Most teams discover their RAG quality is bottlenecked by chunking once they actually measure. The fix is rarely the embedding model or the vector database; it's almost always the chunking. Get this right and your existing infrastructure starts performing dramatically better.

Frequently asked questions

Should I use the same chunk size for all document types?

No — academic papers benefit from larger chunks (1200-1500 tokens) because their sections carry more context. Conversational content (transcripts, support tickets) does better with smaller chunks (400-600 tokens). Tune per document type if your corpus is heterogeneous.

How much overlap is too much?

Above ~25% of chunk size, overlap costs more than it gains. 50-150 tokens of overlap on 800-token chunks (6-19%) is the standard range. If you find yourself wanting 50% overlap, consider parent-document retrieval instead.

Does chunking strategy matter more than embedding model choice?

On most corpora, yes. Going from naive token-chunking to hybrid header-chunking improves top-1 retrieval by 10-15 points; switching from text-embedding-3-large to a competitor changes top-1 by 1-3 points. Fix chunking first.