Chunking for RAG
9 min read
How to break documents into chunks ready for embedding, and how to tune your chunking strategy
Recap
What we're building
Our fictitious airline is looking to build a world-class service agent.
There are agents for both airline staff and customers. Depending on who is asking, different information is surfaced.
The agent(s) must be:
- Grounded in knowledge from the thousands of documents the airline holds: FAQs, PDFs, knowledge bases, and more.
- Connected to structured data stores including booking systems, scheduling, inventory, loyalty, CDP, and CRM.
The agent(s) should return accurate, helpful answers quickly, but must gate what is shown to whom. A passenger can only access their own data, not another passenger's, and never internal staff-only content.
Full scenario: The Running Example
Chunking is the step that makes large documents searchable. A 200-page airline manual isn't useful as one block. However broken into well-sized, well-labelled pieces, it becomes something an AI can retrieve from precisely.
Let's take a long-form document: our Irregular Ops (IRROPS) manual and break it into smaller chunks, ready for embedding into vectors.
Getting Chunking Right
As we design a chunking strategy, there are a few things to think about:
- What are the natural chunking boundaries in our content? (Bullets, headers, sections, etc.)
- How large should our chunks be?
- What is the amount of overlap per chunk? (and why do we overlap chunks?)
- What metadata do our chunks need?
And we need to make these decisions because they impact:
- Recall: we retrieve the right material
- Precision: we don't retrieve a load of irrelevant text
- Answer quality: chunks are coherent enough to quote/use
- Cost/latency: fewer/larger chunks means fewer embeddings, but more noise.
What is a Token?
Put simply, a token is a unit a language model reads and generates. It's not exactly a word. Models use a tokenizer that splits text into pieces that can be:
- whole words (
flight) - parts of words (
re,book,ing) - punctuation (
.) - spaces/newlines (often part of tokens)
So 'rebooking' might be 1 token in one tokenizer, or multiple tokens in another.
This matters for chunking because:
- Models have context limits measured in tokens.
- Embedding models also take input in tokens.
- So you target chunk sizes in tokens because character counts are a crude proxy.
Rule of thumb (English language): 1 token ≈ 4 characters on average, or ~0.75 words. Not exact, but good enough for planning.
An Example Approach
-
Decide on retrieval unit boundaries. Ideally we split on semantic structure (section breaks, bullet lists, subsection headings) rather than just a character count. Chunking by section often beats chunking by token length, as you don't split a single subject over multiple chunks or combine several subjects into one. The two approaches aren't mutually exclusive: tokens act as a stop condition to prevent excessively long chunks.
-
Pick a target size in tokens. Start at 500 tokens per chunk and iterate based on what you observe; adjust up or down. Broadly, this is the impact of each direction:
Symptom Chunk size Result Relevant content is missing from answers Too small Low recall: key passages aren't being retrieved Answers are correct but vague or over-broad Too large Low precision: chunks cover too many topics -
Decide on metadata. Metadata is 'data about data'. For our final application, it provides important information about where a chunk came from, whether the information is current, the intended audience, and more. Your chunking script will generate most of this from the source document.
"[[metadata]]": {
"policy_id": "LOYALTY-050",
"version": "1.7",
"effective": "2026-01-01",
"author": "Loyalty Ops Team",
"region": "AUS-NZ",
"owner": "Loyalty",
"doc_title": "Loyalty Tier Entitlements During Disruption",
"classification": "CUSTOMER-FACING"
}
Notice the classification field. This is what allows the retrieval layer to filter chunks by intended audience, ensuring internal-only content is never returned to a customer-facing agent. The LLM doesn't make this decision; the metadata does.
Boundaries & Token Length
In reality, most source documents are imperfect. Some sections are very long, others very short, and some have no clear structure at all. Here is how to handle each case:
- Well-structured documents: split primarily on boundaries (headings, sections, bullet lists). Use a token maximum as a guardrail for sections that run long.
- Poorly-structured documents: rely more heavily on fixed token splits. Always specify an overlap (e.g. 15%) so important context isn't lost at chunk boundaries. In essence, the last N tokens of chunk A become the first N tokens of chunk B.
- Very short sections: merge with the adjacent sibling rather than creating a near-empty chunk.
- Tables: treat as atomic units and never split mid-table. A token split that lands inside a table produces a fragment with orphaned rows and no headers. The embedding of that fragment is near-meaningless, and an LLM asked to reason from it may return confidently wrong answers. Keep the entire table in one chunk, even if it exceeds your token target. If a table is genuinely enormous, serialise it row-by-row as prose instead.
Note
The delay compensation table in 70_passenger_rights_and_compensation.md maps delay durations to compensation amounts across domestic and international routes. If our chunker splits that table mid-row, a chunk might contain the column headers and the first two rows, enough to look plausible, but missing the rows that answer the most common query: "what does a passenger get for a 3-hour domestic delay?" The retrieval step finds the chunk. The LLM reads it. The answer is wrong.
-
Images: For example, diagrams that contain text, or that are important as part of an answer. These may still need a chunk object, even though you are not chunking the image itself. This is quite a specialist thing in itself and goes outside the scope of this article:
- For text in images: OCR could be used to extract this and chunk/ embed it as if it's text
- For diagrams related to answers: e.g "Map of baggage offices at LA Airport Terminal 7" will be important in certain retrieval scenarios:
- The embedder covers the image itself.
- We create a chunk with metadata about the image such as
- Where it came from
- Is it customer facing?
- Is it up to date?
- What is the surrounding context: What text came before or after the image? (We need to only return this image when talking about Terminal 7 at LA Airport)
Impact on Embedding
Some embedding models have a maximum input of around 768 tokens. If your chunk size plus overlap approaches that ceiling, you'll hit errors — and you'll have no headroom to increase chunk size if recall is poor. Check your embedder's token limit before settling on a chunk size.
Remember
Chunking is typically structure-first: split by headings, sections, and bulleted lists. The token budget (e.g. 500) is the stop condition.
- If a section is under your target size, keep it as one chunk.
- If a section exceeds your max size, split it further by token count with overlap (e.g. 15%).
- If a section is very short, merge it with the next sibling.
- If a section contains a table, keep the table intact — never split across it.
Overlap is Your Friend (Mostly...)
Overlap helps us in RAG retrieval for a number of reasons:
-
Preserves context across boundaries: A query may match language that sits right on a split. Overlap ensures at least one chunk contains enough context to be meaningful.
-
Improves recall: You're less likely to miss relevant content because the "right" phrase was split awkwardly across chunks.
-
Produces better embeddings: As we know, we create one embedding per chunk. If a chunk starts mid-thought, its embedding can be weak or misleading. Overlap gives each chunk a more coherent semantic unit.
-
Helps answer generation: Even if retrieval finds a chunk, the LLM still needs surrounding context to answer correctly. Overlap increases the chance the retrieved chunk is self-contained enough to use.
However, there are some things to consider:
- Storage (more duplicated text). Possibly not a concern unless you are consuming huge volumes of text
- Embedding cost. Unlikely to be significant unless you are running a significant number of embeddings.
- Index size. This could impact retrieval.
Commercial Chunkers
For a real airline, you would almost certainly use an off-the-shelf chunker rather than building your own. These are purpose-built to handle the full range of document formats an enterprise produces: PDFs, Word files, HTML, scanned images, and they already incorporate best practice around metadata, overlap, and boundary detection.
The main options worth knowing about:
- Unstructured.io: Strong choice for an airline. Purpose-built for messy enterprise documents across multiple formats (HTML, Markdown, PDF, DOC). Handles tables, headers, and mixed layouts well.
- LlamaIndex: Sophisticated node parsing with good metadata handling. Works well as part of a broader RAG pipeline.
- LangChain Text Splitters: Widely used, solid ecosystem, and includes a SemanticChunker for more advanced use cases.
- Cloud-native: AWS Bedrock Knowledge Bases and Azure AI Document Intelligence both offer managed chunking if you're already on those stacks.
Commercial chunkers also support more advanced chunking techniques.
Beyond Basic Chunking
Semantic Chunking
Rather than splitting on structure or token count, semantic chunking uses embeddings to find natural breakpoints in meaning. You embed sentences, find where semantic similarity drops sharply, and split there. More expensive as you're embedding twice. However. this produces more coherent chunks. LangChain's SemanticChunker implements this.
Agentic / Propositional Chunking
An LLM reads the document and decides both where to split and rewrites each chunk as a self-contained proposition. This produces the highest quality output, but at significant cost, and your chunks are no longer verbatim extracts from the source, which may or may not matter depending on your use case.
Both approaches are worth considering once you've validated your retrieval pipeline with a simpler strategy first.
Quick Summary: Chunking Strategy
Taking all the above into account, this is how we will chunk our IRROPS manual:
- Primarily by section
- Large sections will be split at the 500 token limit with 15% overlap
- Metadata will be generated and attached, including:
- Source document and section
- Intended audience
- Privacy classification
- Dates & versions
We build it from scratch so you can see exactly how each of these decisions plays out in code. In production, you'd reach for one of the commercial tools above — but understanding the mechanics first means you'll know what to configure, and why.
Ready to see these principles in code? Next in the Chunking mini-series: Chunking in Practice — a working chunker built from scratch, three progressively complete examples that show how each of these decisions plays out in real output.
If you have not set up a Neon database yet, follow pgvector Setup for Airline RAG