Better chunks, better RAG
Retrieval-augmented generation is only as good as your chunking strategy. Naive chunkers split mid-sentence and destroy context; size-only chunkers ignore document structure. Ours splits along semantic boundaries — paragraphs, sections, headings — while honoring your token budget and overlap settings.
You get back a list of chunks ready to embed, each tagged with its source position so you can re-construct context for the LLM at query time. Use it to prepare data for OpenAI, Anthropic, Cohere, or any embedding model.
Chunking strategies
- Token-based with configurable model tokenizer (GPT-4, Claude, Llama)
- Recursive character splitting that respects newlines, sentences, words
- Markdown-aware splitting along headings, lists, and code blocks
- Configurable overlap (in tokens or percentage)
- Min and max chunk size with smart merging of small tail chunks
- Optional metadata per chunk (source file, heading path, position)
Export as JSON, JSONL, CSV, or one Markdown file per chunk. The output drops straight into LangChain, LlamaIndex, or your custom pipeline.