Chunking strategies for production RAG
Why chunking matters
In a RAG pipeline, you embed each chunk into a vector and store it in a vector database. At query time you find the K closest chunks to the question and pass them to the LLM as context.
If chunks are wrong, retrieval is wrong. If retrieval is wrong, the answer is wrong. Chunking is silent infrastructure that determines 30 to 50% of your RAG quality.
Most teams underweight it. They use the default chunker from their RAG library, get mediocre answers, and blame the LLM.
The three failure modes of naive chunking
Naive chunking splits every 500 tokens, ignoring document structure. Produces three predictable failure modes.
Context-breaking splits. A 7-step password-reset procedure gets split at step 4. The bot retrieves steps 1 to 4 and answers incompletely. The customer follows the partial instructions, fails, comes back angry.
Hierarchical signal loss. A "Returns Policy" page with "Refunds", "Exchanges", and "Damaged Items" sub-sections gets split into pieces. Each piece's parent context ("this is about returns") is gone. Retrieval finds the chunk but can't tell which sub-policy it covers.
Vector noise. Chunks under 100 tokens produce low-quality embeddings that match too liberally. Many false-positive retrievals. Chunks over 1,500 tokens dilute the semantic signal because the vector averages too many ideas.
All three are silent failures. The bot still returns answers; they're just wrong.
What good chunking looks like
Structure-aware chunking respects how the document is organized. Three principles:
- Split on natural boundaries. Paragraph breaks, list item boundaries, heading transitions, table edges. Never mid-sentence.
- Preserve hierarchical context. Each chunk carries its parent section's heading as a prefix. The chunk knows it's "Returns Policy > Damaged Items > Step 3".
- Right-sized chunks. 400 to 800 tokens. Smaller pieces fragment too much; larger pieces dilute too much.
For most production B2B SaaS workloads, these three principles cover 90% of the win. The remaining 10% comes from format-specific handling.
Chunk size and overlap
The 400-to-800-token range is empirical, not theoretical. Smaller chunks (100 to 300 tokens) produce noisy embeddings that retrieve too broadly. Larger chunks (1,500 tokens+) dilute the semantic signal because each chunk's vector averages too many ideas.
Overlap of 50 to 100 tokens between adjacent chunks helps context survive boundaries. Example: chunk N ends with "...the policy requires manager approval." Chunk N+1 starts with "...manager approval. Approval takes 2 business days..." instead of jumping straight to a new sentence.
Without overlap, queries that hinge on cross-boundary context (rare but real) get incomplete results.
Format-specific rules
Different document types need different chunking strategies.
HTML and web pages
Strip navigation, header, footer, ads, and sidebar before chunking. These add noise that distorts embeddings. After stripping, split on heading boundaries (<h2>, <h3>).
For long articles, prefer splitting on <h2> rather than <h3> to keep chunks coherent. A 30-page article might produce 8 to 15 chunks at the <h2> level vs 50+ at the <h3> level.
PDFs are the hardest format because they encode visual layout, not semantic structure. Three strategies:
- Reconstruct logical reading order. Multi-column layouts need to be re-flowed to single-column text. Footnotes need to be separated from body text. Header/footer repetition (page numbers, document title) needs to be stripped.
- Detect and preserve tables. PDFs encode tables as positioned text. A naive text extraction returns column 1 + column 2 + column 3 as a single line. Use a table detector and serialize tables to Markdown table syntax.
- Handle images. Optionally OCR embedded screenshots if they contain text relevant to retrieval (manual screenshots in technical docs often do).
Markdown
Markdown is the easiest case. Headings, lists, and code blocks are already semantic. Split on ## boundaries; keep code blocks intact (never split mid-code-block).
DOCX
DOCX preserves styles (Heading 1, Heading 2, etc.) which are direct semantic markers. Split on heading style transitions. Preserve tables as-is. Bullet lists stay together as a chunk if they're under 1,500 tokens; split list items into their own chunks if longer.
Tables specifically
Tables are special. LLMs cite tables verbatim at much higher rates than fragmented data. Three rules:
- Never split a table mid-row. A table row is the atomic unit.
- Serialize to Markdown. Even if the source is PDF or DOCX, output Markdown table syntax. LLMs handle Markdown tables fluently.
- Include the table header in every chunk. If a 50-row table gets split into 3 chunks (rows 1-17, 18-34, 35-50), prepend the header row to each chunk.
Including parent-section context
Every chunk carries its parent heading hierarchy as a prefix. Example:
[Returns Policy > Damaged Items]
If your item arrived damaged, you have 30 days to file a claim. To start:1. Take 3 photos of the damage from different angles.2. Email photos@acme.co with your order number.3. We respond within 2 business days with a return label.This chunk's embedding incorporates "Returns Policy > Damaged Items" context. A query like "what do I do if my package is broken?" retrieves it correctly because the embedding knows it's about returns and damage, not just generic shipping.
Without the parent prefix, the same chunk reads as just "If your item arrived damaged..." which embeds with less context.
Strategies that mostly waste time
A few approaches we've seen waste effort without measurable retrieval improvement:
Sentence-level chunking. Embed every sentence separately. Sounds clever; produces noisy retrieval because individual sentences don't carry enough context.
Hand-tuned chunk sizes per document. Spending an hour per document to pick the "perfect" chunk size doesn't move the needle. The 400-to-800 default works for almost every B2B document.
Multi-vector indexing. Indexing the same chunk multiple times with different prefixes or summaries. Doubles index cost, marginally improves retrieval, rarely worth it.
Domain-specific embedding models. Unless you're in a specialized field (legal, medical) with significant jargon, generic embedding models work fine. Domain-specific models add complexity for small gains.
The big wins come from: respecting document structure, right-sized chunks, parent-section prefixes, and proper table handling. Everything else is incremental.
Re-chunking when the strategy changes
If you change chunking strategy mid-deployment (different chunk size, different format handling), you need to re-embed all existing content. The new chunks produce different vectors; the old vectors are now wrong relative to your retrieval logic.
This is one reason to get chunking right at the start. Re-chunking a 100 MB knowledge base costs roughly 30 to 60 minutes of compute and refreshes every vector in the index.
AskVault's recurring crawl auto-re-chunks when a document changes. Strategy changes that affect all documents require an explicit reindex via Knowledge Hub > Reindex All.
How AskVault chunks
AskVault's chunker is structure-aware, format-specific, and table-preserving. The chunker:
- Detects document format (HTML, PDF, Markdown, DOCX) and applies format-specific rules.
- Splits on heading boundaries (
<h2>for HTML, Heading 2 style for DOCX,##for Markdown). - Targets 400 to 800 tokens per chunk with about 75 tokens of overlap.
- Serializes tables to Markdown syntax and never splits mid-row.
- Prepends parent heading hierarchy to each chunk.
- Strips navigation, footer, and ads from HTML.
Customers don't configure any of this. The defaults work for almost every B2B SaaS workload.
How to evaluate your chunking
Three diagnostic tests if your bot's answer quality seems off:
- Sample retrievals. For 20 representative queries, inspect the retrieved chunks. Do they contain the right content? If chunks are off-topic, chunking is at fault.
- Chunk-size histogram. Plot the token count of every chunk in your index. Bimodal distribution (lots of tiny chunks + lots of huge chunks) usually indicates a chunker bug. Healthy distributions cluster around 600 tokens.
- Source-citation hit rate. If the bot retrieves 5 chunks per query and the answer cites 0 or 1, retrieval is dragging dead weight. If it cites all 5, your chunks are well-sized.
For AskVault users, all three are surfaced under Knowledge Hub > Diagnostics.
FAQ
Should I chunk my data before uploading to AskVault?
No. AskVault chunks for you. Upload raw documents (PDF, DOCX, HTML) and the platform handles chunking. Pre-chunking by hand usually hurts quality.
Can I see the chunks AskVault produced?
Yes. Knowledge Hub > [document] > View Chunks shows every chunk with its parent heading, token count, and surrounding context.
What if my document is mostly numerical tables?
AskVault preserves tables and serializes them to Markdown. For pure-numeric documents (spreadsheets, CSV), upload as CSV directly; the parser handles the row-as-record model natively.
How does the chunker handle code blocks?
Code blocks are preserved intact, never split mid-block. Each code block + surrounding explanation typically becomes one chunk.
What's the max chunk size I should use?
In practice, 1,200 tokens is a good ceiling. Beyond that, embedding quality starts to degrade because the chunk's average meaning gets too diluted.
Related guides
- What is Retrieval-Augmented Generation (RAG)?
- How vector databases work
- HTTP-first scraping with automatic browser escalation
- How AskVault prevents hallucinations