Chunking for RAG

36 min read

How to break documents into chunks ready for embedding, and how to tune your chunking strategy

Recap

What we're building

Our fictitious airline is looking to build a world-class service agent.

There are agents for both airline staff and customers. Depending on who is asking, different information is surfaced.

The agent(s) must be:

  • Grounded in knowledge from the thousands of documents the airline holds: FAQs, PDFs, knowledge bases, and more.
  • Connected to structured data stores including booking systems, scheduling, inventory, loyalty, CDP, and CRM.

The agent(s) should return accurate, helpful answers quickly, but must gate what is shown to whom. A passenger can only access their own data, not another passenger's, and never internal staff-only content.

Full scenario: The Running Example

Let's take a long-form document: our Irregular Ops (IRROPS) manual and break it into smaller chunks, ready for embedding into vectors.

Chunking a document

Getting Chunking Right

As we design a chunking strategy, there are a few things to think about:

  • What are the natural chunking boundaries in our content? (Bullets, headers, sections, etc.)
  • How large should our chunks be?
  • What is the amount of overlap per chunk? (and why do we overlap chunks?)
  • What metadata do our chunks need?

And we need to make these decisions because they impact:

  • Recall: we retrieve the right material
  • Precision: we don't retrieve a load of irrelevant text
  • Answer quality: chunks are coherent enough to quote/use
  • Cost/latency: fewer/larger chunks means fewer embeddings, but more noise.

What is a Token?

Put simply, a token is a unit a language model reads and generates. It's not exactly a word. Models use a tokenizer that splits text into pieces that can be:

  • whole words (flight)
  • parts of words (re, book, ing)
  • punctuation (.)
  • spaces/newlines (often part of tokens)

So 'rebooking' might be 1 token in one tokenizer, or multiple tokens in another.

This matters for chunking because:

  • Models have context limits measured in tokens.
  • Embedding models also take input in tokens.
  • So you target chunk sizes in tokens because character counts are a crude proxy.

Rule of thumb (English language): 1 token ≈ 4 characters on average, or ~0.75 words. Not exact, but good enough for planning.

An Example Approach

  1. Decide on retrieval unit boundaries. Ideally we split on semantic structure (section breaks, bullet lists, subsection headings) rather than just a character count. Chunking by section often beats chunking by token length, as you don't split a single subject over multiple chunks or combine several subjects into one. The two approaches aren't mutually exclusive: tokens act as a stop condition to prevent excessively long chunks.

  2. Pick a target size in tokens. Start at 500 tokens per chunk and iterate based on what you observe; adjust up or down. Broadly, this is the impact of each direction:

    SymptomChunk sizeResult
    Relevant content is missing from answersToo smallLow recall: key passages aren't being retrieved
    Answers are correct but vague or over-broadToo largeLow precision: chunks cover too many topics
  3. Decide on metadata. Metadata is 'data about data'. For our final application, it provides important information about where a chunk came from, whether the information is current, the intended audience, and more. Your chunking script will generate most of this from the source document.

"[metadata](glossary:metadata)": {
  "policy_id": "LOYALTY-050",
  "version": "1.7",
  "effective": "2026-01-01",
  "author": "Loyalty Ops Team",
  "region": "AUS-NZ",
  "owner": "Loyalty",
  "doc_title": "Loyalty Tier Entitlements During Disruption",
  "classification": "CUSTOMER-FACING"
}

Notice the classification field. This is what allows the retrieval layer to filter chunks by intended audience, ensuring internal-only content is never returned to a customer-facing agent. The LLM doesn't make this decision; the metadata does.

Boundaries & Token Length

In reality, most source documents are imperfect. Some sections are very long, others very short, and some have no clear structure at all. Here is how to handle each case:

  • Well-structured documents: split primarily on boundaries (headings, sections, bullet lists). Use a token maximum as a guardrail for sections that run long.

  • Poorly-structured documents: rely more heavily on fixed token splits. Always specify an overlap (e.g. 15%) so important context isn't lost at chunk boundaries. In essence, the last N tokens of chunk A become the first N tokens of chunk B.

  • Very short sections: merge with the adjacent sibling rather than creating a near-empty chunk.

  • Tables: treat as atomic units and never split mid-table. A token split that lands inside a table produces a fragment with orphaned rows and no headers. The embedding of that fragment is near-meaningless, and an LLM asked to reason from it may return confidently wrong answers. Keep the entire table in one chunk, even if it exceeds your token target. If a table is genuinely enormous, serialise it row-by-row as prose instead.

    Note

The delay compensation table in 70_passenger_rights_and_compensation.md maps delay durations to compensation amounts across domestic and international routes. If our chunker splits that table mid-row, a chunk might contain the column headers and the first two rows, enough to look plausible, but missing the rows that answer the most common query: "what does a passenger get for a 3-hour domestic delay?" The retrieval step finds the chunk. The LLM reads it. The answer is wrong.

  • Images: For example, diagrams that contain text, or that are important as part of an answer. These may still need a chunk object, even though you are not chunking the image itself. This is quite a specialist thing in itself and goes outside the scope of this article:

    • For text in images: OCR could be used to extract this and chunk/ embed it as if it's text
    • For diagrams related to answers: e.g "Map of baggage offices at LA Airport Terminal 7" will be important in certain retrieval scenarios:
      • The embedder covers the image itself.
      • We create a chunk with metadata about the image such as
        • Where it came from
        • Is it customer facing?
        • Is it up to date?
        • What is the surrounding context: What text came before or after the image? (We need to only return this image when talking about Terminal 7 at LA Airport)

Remember

Chunking is typically structure-first: split by headings, sections, and bulleted lists. The token budget (e.g. 500) is the stop condition.

  • If a section is under your target size, keep it as one chunk.
  • If a section exceeds your max size, split it further by token count with overlap (e.g. 15%).
  • If a section is very short, merge it with the next sibling.
  • If a section contains a table, keep the table intact — never split across it.

Overlap is Your Friend (Mostly...)

Overlap helps us in RAG retrieval for a number of reasons:

  1. Preserves context across boundaries: A query may match language that sits right on a split. Overlap ensures at least one chunk contains enough context to be meaningful.

  2. Improves recall: You’re less likely to miss relevant content because the “right” phrase was split awkwardly across chunks.

  3. Produces better embeddings: As we know, we create one embedding per chunk. If a chunk starts mid-thought, its embedding can be weak or misleading. Overlap gives each chunk a more coherent semantic unit.

  4. Helps answer generation: Even if retrieval finds a chunk, the LLM still needs surrounding context to answer correctly. Overlap increases the chance the retrieved chunk is self-contained enough to use.

However, there are some things to consider:

  • Storage (more duplicated text). Possibly not a concern unless you are consuming huge volumes of text
  • Embedding cost. Unlikely to be significant unless you are running a significant number of embeddings.
  • Index size. This could impact retrieval.

Quick Summary: Chunking Strategy

Taking all the above into account, this is how we will chunk our IRROPS manual:

  • Primarily by section
  • Large sections will be split at the 500 token limit with 15% overlap
  • Metadata will be generated and attached, including:
    • Source document and section
    • Intended audience
    • Privacy classification
    • Dates & versions

Our First Chunker

Note

Python is the lingua franca of AI/ ML - so why is most of this TypeScript?

  1. For those that want to try this themselves, getting Node running locally or on a hosted service is generally smoother than navigating Python environments, pip conflicts, and virtual environments. If you have lost hours to a Python setup issue, you know what I mean!

  2. The SDKs are first-class. Claude and ChatGPT have official TypeScript SDKs and MCP has native TypeScript support: Typescript is a supported, idiomatic path.

  3. TypeScript's enforcement of types make the data structures visible. This helps me show you the shape of what is flowing through a system. You can easily see what a vector looks like, what a tool call returns, what metadata attaches to a chunk.

Throughout the series I have written the code in a way that is easy to understand rather than syntactically optimal - e.g.:

  • Avoiding certain TypeScript patterns, even where they could reduce code repetition, so each example stays self-contained and easier to read.
  • Minimal error handling.
  • Avoiding code shortcuts.

I wrote a lot about chunking above, but at its core, it can be as simple as a regular expression wrapped around some file operations! I would certainly not recommend using this, as we are ignoring token counts and metadata completely, but we'll start very very simple and build it up.

  • The Code Explorer contains all the code for the chunking examples.
  • The Playground allows you to execute the examples and see the output - no need to install anything locally!
  • The Examples walk you through each chunking script and build on concepts

Note

You can find all the code on Github

The documents our airline uses can be found in the Code Playground under data/policies/simple_example. For this article, we are sticking to a simple set of Markdown documents. Of course, we will also have PDFs, word documents, HTML on web pages, notes from customer service cases, and lots more sources to ingest. That goes outside the scope of this document, but we would typically build integrations to pull these types of data in and parse (read out) the text for our chunker.

Remember

Markdown is a very commonly used format for FAQs, manuals, help pages, etc. as it is easy for these systems to consume and understand.

Markdown uses simple strings of characters to mark sections, and so on - perfect for our chunker! Some quick examples:

  • A subsection header would look like this ## Lost Bags Policy
  • A subsection header under 'Cancellations' would be ### Lost Bags Compensation
  • A bullet would look like this - Send confirmation the booking main email address

Code Explorer

Remember

This series provides working examples to demonstrate the core concepts in a clear way. This is not production-ready code.

All the code used for our chunking examples

README.md

Explorer

# AI For the Curious: RAG Tools

Companion code for the **[AI For the Curious](https://sambessey.com/articles)** series on [sambessey.com](https://sambessey.com).

This repo covers the **Chunking for RAG** article: how to break long documents into smaller pieces ready for embedding into a vector database, and how the strategy you pick affects retrieval quality.

---

## What is chunking and why does it matter?

Before an LLM can answer questions about your documents, those documents need to be turned into vectors and stored in a database. The problem: you can't embed a 50-page PDF as one blob. You slice it into chunks first, embed each chunk, and store them individually.

The chunks you retrieve at query time are what the LLM actually reads. Get chunking wrong and you get:

- **Missed answers** (the relevant content was split across a boundary)
- **Noisy answers** (too much irrelevant text crammed into one chunk)
- **Broken context** (a sentence retrieved mid-thought, with no surrounding information to anchor it)

Chunking strategy directly controls recall, precision, and coherence.

---

## The running example

The code works on a synthetic airline policy corpus: eight internal documents covering topics like irregular operations, fee waivers, loyalty tier entitlements, and passenger rights. Each document includes structured metadata fields (Policy ID, Owner, Audience, Sensitivity, Topics) modelled on how real enterprise knowledge bases are tagged.

The corpus lives in `data/policies/simple_example/`.

---

## Three chunkers, iterated

The code is structured as three progressively more capable chunkers. Each one builds on the last.

### chunker-1: structure-first splitting

Splits the document at every H2 heading (`##`). One section, one chunk. No metadata. No size control.

This is the simplest possible approach and the right starting point for understanding the problem. It fails when sections are wildly different lengths.

### chunker-2: adding metadata

Same H2-based split, but now each chunk carries:

- **headingPath**: the chain of headings leading to this section (document title + H2 heading)
- **metadata**: structured fields extracted from the policy header block (policy ID, owner, classification, audience, sensitivity, topics, and more)

A chunk without metadata is just text. A chunk with metadata becomes addressable: you can filter by audience, sensitivity level, or policy owner before the LLM ever sees it.

### chunker-3: token budget and overlap

Extends chunker-2 with two important mechanics:

**Token budget**: if a section exceeds the target token count (default: 500), it gets split into smaller sub-chunks at paragraph boundaries. Splitting on blank lines rather than raw character positions keeps paragraphs, list items, and table rows intact.

**Overlap**: the tail of each chunk is copied to the start of the next. This prevents a sentence or key fact from being stranded at a boundary where neither the preceding nor the following chunk contains enough context to understand it.

Token estimation uses a simple 4-chars-per-token approximation. The comments in the code explain how to swap in a real tokenizer (tiktoken, `@anthropic-ai/tokenizer`) for production use.

---

## Running the code

**Prerequisites:** Node.js 18+ and a package manager that supports `tsx`.

```bash
npm install
```

Run a chunker against one of the sample documents:

```bash
npx tsx index.ts chunker-1 10_irrops_reaccommodation.md
npx tsx index.ts chunker-2 10_irrops_reaccommodation.md
npx tsx index.ts chunker-3 10_irrops_reaccommodation.md
```

The first argument is the chunker (`chunker-1`, `chunker-2`, `chunker-3`). The second is any file in `data/policies/simple_example/`.

Output is printed with `console.dir` so nested objects are shown in full.

---

## File structure

```
.
├── chunking/
│   ├── chunker-1.ts      # H2-based split, minimal output
│   ├── chunker-2.ts      # Adds heading path and metadata extraction
│   └── chunker-3.ts      # Adds token budget enforcement and overlap
├── data/
│   └── policies/
│       └── simple_example/
│           ├── 00_readme.md
│           ├── 10_irrops_reaccommodation.md
│           ├── 20_fee_waivers_and_fare_rules.md
│           ├── 30_partner_airline_guidelines.md
│           ├── 40_service_recovery_vouchers.md
│           ├── 50_loyalty_tier_entitlements.md
│           ├── 60_flight_disruption_faq.md
│           ├── 70_passenger_rights_and_compensation.md
│           └── contents.md
├── index.ts              # Entry point: picks the chunker and file from CLI args
└── package.json
```

---

## Part of a series

This repo sits inside the RAG track of the *AI For the Curious* series:

| Article | What it covers |
|---|---|
| Primer: Databases | The main database technologies and how they relate to LLMs |
| RAG and MCP 101 | A quick introduction to two key concepts underpinning the series |
| pgvector Setup for Airline RAG | Setting up pgvector locally and running similarity queries |
| **Chunking for RAG** | **This repo: breaking documents into chunks ready for embedding** |

Read the full article at [sambessey.com/articles](https://sambessey.com/articles).

Code Playground

See each example come to life in your browser

$ npx tsx index.ts chunker-1 10_irrops_reaccommodation.md
↵ click to focus, then press Enter to run

Examples Explained

Expand each section for a walkthrough of what is happening

Example 1- Simple Chunking

Files to Explore

In the Code Explorer above, review the three main files used in this example:

  • data/policies/simple_example contains all of our airline policy documents.
  • index.ts is the script we run and calls our chunker, logging the output. It takes in the file to be chunked.
  • [chunking](glossary:chunking)/chunker-1.ts is the actual chunking function we run to split the document into chunks.

Click the Execute button above and run example 1.

This command is run:

npx tsx index.ts chunker-1 10_irrops_reaccommodation.md
  1. It uses npx to execute the contents of index.ts, using the chunker chunker-1 and reads the file you pass in (A Markdown file)

  2. This takes an input file and chunks it at each logical break (In Markdown this is indicated by ##).

This output gives us one chunk per section - but this is no good for embedding as there is no context about these chunks:

  • Where did it come from?
  • Is it safe to show the customer?
  • Does it apply to customers globally or by geo?

Example 2 deals with Metadata to help us solve this.

Example 2- Metadata

Files to Explore

In the Code Explorer above, review chunker-2 - everything else is the same. The file is almost the same, but adds metadata to our chunk. The chunker might read this from the file headers or frontmatter, or pull the title by looking for # in the Markdown.

Here are some of the fields we'll create. This was discussed above, but fields like classification are vital here - they prevent us returning internal-facing data to customers in our responses. The LLM doesn't make this decision; the metadata does

{
  policy_id: 'OPS-PARTNER-030',
  version: '1.0',
  effective: '2026-01-01',
  region: 'Global',
  owner: 'Alliances & Partnerships',
  classification: 'INTERNAL-OPS',
  audience: ['[agent](glossary:agent)'],
  sensitivity: 'INTERNAL',
  topics: ['partner rebooking', 'interline', 'alliances', 'IRROPS'],
  applies_to: ['all passengers']
}

Run chunker-2. Looks great, we can see metadata against our chunks!

Now run chunker-2-long-content... We have a problem!

This chunks 70_passenger_rights_and_compensation.md. It has six distinct sub-sections covering delay compensation thresholds, cancellation entitlements, denied boarding, tarmac delay procedures, force majeure, and escalation. Each covers completely different agent obligations, but they all sit under one ## heading, so chunker-2 treats them as a single chunk. It's about 1,500 tokens: a lot of text.

Consider the query "what compensation does a passenger get for a 3-hour domestic delay?".

  • We have a precision problem. The retrieval layer finds this chunk, and your LLM receives the delay threshold table it needed along with the IDB compensation scale, the four-step cancellation procedure, tarmac delay timelines, and a list of force majeure conditions. The answer might still be correct, but it arrived with enormous noise.

  • The recall problem is subtler. An over-large chunk competes poorly against more focused ones. A specific query about denied boarding might rank a tightly-scoped chunk on that topic higher than this catch-all section, and miss it entirely.

Example 3 fixes this: chunker-3 enforces a 500-token limit and adds 15% overlap, so that massive section gets split into six coherent chunks instead of one bloated one

Example 3- Token Boundaries

Files to Explore

In the Code Explorer above, review chunker-3 - everything else is the same. The file has changed quite significantly, but now adds tokens and overlap to our files. This breaks up long sections into shorter pieces, to help recall, and repeats some of the previous chunk in the next.

In this example, we are passing in a custom token length (500), and overlapping 15% (75 tokens). You can see this on line 18 of index.ts: console.dir(chunkPolicyMarkdown(fileContents, 500, 75), { depth: null });. This will be important in a future article when we test precision and recall against our embedded chunks.

Remember

Good middle ground values to start testing are around:

  • 500 tokens per chunk
  • 15% (75 tokens) overlap

In the output of chunker-3 above (or in the screenshot below), you can see our overlaps and token length in action. Notice how the text from the end of chunk 5 appears in chunk 6.

Overlapping text between two chunks

I have also added code to ensure tables are chunked cleanly.

That's chunking covered! The chunker is ready to hand off to the embedding step. The next article will cover:

  • The major embedding algorithms
  • Embedding the chunker output into vectors
  • Retrieving the K nearest-neighbour chunks using a query vector
  • Ranking results with multiple distance and similarity functions
  • Precision and recall in practice