Chunking in Practice
11 min read
A working chunker built from scratch. Three examples that progressively add metadata and token boundaries
Recap
What we're building
Our fictitious airline is looking to build a world-class service agent.
There are agents for both airline staff and customers. Depending on who is asking, different information is surfaced.
The agent(s) must be:
- Grounded in knowledge from the thousands of documents the airline holds: FAQs, PDFs, knowledge bases, and more.
- Connected to structured data stores including booking systems, scheduling, inventory, loyalty, CDP, and CRM.
The agent(s) should return accurate, helpful answers quickly, but must gate what is shown to whom. A passenger can only access their own data, not another passenger's, and never internal staff-only content.
Full scenario: The Running Example
Our First Chunker
Note
Python is the lingua franca of AI/ ML - so why is most of this TypeScript?
-
For those that want to try this themselves, getting Node running locally or on a hosted service is generally smoother than navigating Python environments, pip conflicts, and virtual environments. If you have lost hours to a Python setup issue, you know what I mean!
-
The SDKs are first-class. Claude and ChatGPT have official TypeScript SDKs and MCP has native TypeScript support: Typescript is a supported, idiomatic path.
-
TypeScript's enforcement of types make the data structures visible. This helps me show you the shape of what is flowing through a system. You can easily see what a vector looks like, what a tool call returns, what metadata attaches to a chunk.
-
As we progress through the topics, we will develop a full application, TypeScript (and Node) gives us a solid base.
Throughout the series I have written the code in a way that is easy to understand rather than syntactically optimal - e.g.:
- Avoiding certain TypeScript patterns, even where they could reduce code repetition, so each example stays self-contained and easier to read.
- Minimal error handling.
- Avoiding code shortcuts.
I wrote a lot about chunking previously, but at its core, it can be as simple as a regular expression wrapped around some file operations! I would certainly not recommend using this, as we are ignoring token counts and metadata completely, but we'll start very very simple and build it up.
- The Code Explorer contains all the code for the chunking examples.
- The Playground allows you to execute the examples and see the output - no need to install anything locally!
- The Examples walk you through each chunking script and build on concepts
Note
You can find all the code on Github
The documents our airline uses can be found in the Code Playground under data/policies/simple_example. For this article, we are sticking to a simple set of Markdown documents. Of course, we will also have PDFs, word documents, HTML on web pages, notes from customer service cases, and lots more sources to ingest. That goes outside the scope of this document, but we would typically build integrations to pull these types of data in and parse (read out) the text for our chunker.
Remember
Markdown is a very commonly used format for FAQs, manuals, help pages, etc. as it is easy for these systems to consume and understand.
Markdown uses simple strings of characters to mark sections, and so on - perfect for our chunker! Some quick examples:
- A subsection header would look like this
## Lost Bags Policy - A subsection header under 'Cancellations' would be
### Lost Bags Compensation - A bullet would look like this
- Send confirmation the booking main email address
Code Explorer
Remember
This series provides working examples to demonstrate the core concepts in a clear way. This is not production-ready code.
All the code used for our chunking examples
Explorer
# AI For the Curious: RAG Tools
Companion code for the **[AI For the Curious](https://sambessey.com/articles)** series on [sambessey.com](https://sambessey.com).
This repo covers the **Chunking for RAG** article: how to break long documents into smaller pieces ready for embedding into a vector database, and how the strategy you pick affects retrieval quality.
---
## What is chunking and why does it matter?
Before an LLM can answer questions about your documents, those documents need to be turned into vectors and stored in a database. The problem: you can't embed a 50-page PDF as one blob. You slice it into chunks first, embed each chunk, and store them individually.
The chunks you retrieve at query time are what the LLM actually reads. Get chunking wrong and you get:
- **Missed answers** (the relevant content was split across a boundary)
- **Noisy answers** (too much irrelevant text crammed into one chunk)
- **Broken context** (a sentence retrieved mid-thought, with no surrounding information to anchor it)
Chunking strategy directly controls recall, precision, and coherence.
---
## The running example
The code works on a synthetic airline policy corpus: eight internal documents covering topics like irregular operations, fee waivers, loyalty tier entitlements, and passenger rights. Each document includes structured metadata fields (Policy ID, Owner, Audience, Sensitivity, Topics) modelled on how real enterprise knowledge bases are tagged.
The corpus lives in `data/policies/simple_example/`.
---
## Three chunkers, iterated
The code is structured as three progressively more capable chunkers. Each one builds on the last.
### chunker-1: structure-first splitting
Splits the document at every H2 heading (`##`). One section, one chunk. No metadata. No size control.
This is the simplest possible approach and the right starting point for understanding the problem. It fails when sections are wildly different lengths.
### chunker-2: adding metadata
Same H2-based split, but now each chunk carries:
- **headingPath**: the chain of headings leading to this section (document title + H2 heading)
- **metadata**: structured fields extracted from the policy header block (policy ID, owner, classification, audience, sensitivity, topics, and more)
A chunk without metadata is just text. A chunk with metadata becomes addressable: you can filter by audience, sensitivity level, or policy owner before the LLM ever sees it.
### chunker-3: token budget and overlap
Extends chunker-2 with two important mechanics:
**Token budget**: if a section exceeds the target token count (default: 500), it gets split into smaller sub-chunks at paragraph boundaries. Splitting on blank lines rather than raw character positions keeps paragraphs, list items, and table rows intact.
**Overlap**: the tail of each chunk is copied to the start of the next. This prevents a sentence or key fact from being stranded at a boundary where neither the preceding nor the following chunk contains enough context to understand it.
Token estimation uses a simple 4-chars-per-token approximation. The comments in the code explain how to swap in a real tokenizer (tiktoken, `@anthropic-ai/tokenizer`) for production use.
---
## Running the code
**Prerequisites:** Node.js 18+ and a package manager that supports `tsx`.
```bash
npm install
```
Run a chunker against one of the sample documents:
```bash
npx tsx index.ts chunker-1 10_irrops_reaccommodation.md
npx tsx index.ts chunker-2 10_irrops_reaccommodation.md
npx tsx index.ts chunker-3 10_irrops_reaccommodation.md
```
The first argument is the chunker (`chunker-1`, `chunker-2`, `chunker-3`). The second is any file in `data/policies/simple_example/`.
Output is printed with `console.dir` so nested objects are shown in full.
---
## File structure
```
.
├── chunking/
│ ├── chunker-1.ts # H2-based split, minimal output
│ ├── chunker-2.ts # Adds heading path and metadata extraction
│ └── chunker-3.ts # Adds token budget enforcement and overlap
├── data/
│ └── policies/
│ └── simple_example/
│ ├── 00_readme.md
│ ├── 10_irrops_reaccommodation.md
│ ├── 20_fee_waivers_and_fare_rules.md
│ ├── 30_partner_airline_guidelines.md
│ ├── 40_service_recovery_vouchers.md
│ ├── 50_loyalty_tier_entitlements.md
│ ├── 60_flight_disruption_faq.md
│ ├── 70_passenger_rights_and_compensation.md
│ └── contents.md
├── index.ts # Entry point: picks the chunker and file from CLI args
└── package.json
```
---
## Part of a series
This repo sits inside the RAG track of the *AI For the Curious* series:
| Article | What it covers |
|---|---|
| Primer: Databases | The main database technologies and how they relate to LLMs |
| RAG and MCP 101 | A quick introduction to two key concepts underpinning the series |
| pgvector Setup for Airline RAG | Setting up pgvector locally and running similarity queries |
| **Chunking for RAG** | **This repo: breaking documents into chunks ready for embedding** |
Read the full article at [sambessey.com/articles](https://sambessey.com/articles).
Code Playground
See each example come to life in your browser
Examples Explained
Expand each section for a walkthrough of what is happening
Example 1- Simple Chunking
Files to Explore
In the Code Explorer above, review the three main files used in this example:
data/policies/simple_examplecontains all of our airline policy documents.index.tsis the script we run and calls our chunker, logging the output. It takes in the file to be chunked.[chunking](glossary:chunking)/chunker-1.tsis the actual chunking function we run to split the document into chunks.
Click the Execute button above and run example 1.
This command is run:
npx tsx index.ts chunker-1 10_irrops_reaccommodation.md
-
It uses
npxto execute the contents ofindex.ts, using the chunkerchunker-1and reads the file you pass in (A Markdown file) -
This takes an input file and chunks it at each logical break (In Markdown this is indicated by
##).
This output gives us one chunk per section - but this is no good for embedding as there is no context about these chunks:
- Where did it come from?
- Is it safe to show the customer?
- Does it apply to customers globally or by geo?
Example 2 deals with Metadata to help us solve this.
Example 2- Metadata
Files to Explore
In the Code Explorer above, review chunker-2 - everything else is the same.
The file is almost the same, but adds metadata to our chunk. The chunker might read this from the file headers or frontmatter, or pull the title by looking for # in the Markdown.
Here are some of the fields we'll create. This was discussed above, but fields like classification are vital here - they prevent us returning internal-facing data to customers in our responses. The LLM doesn't make this decision; the metadata does
{
policy_id: 'OPS-PARTNER-030',
version: '1.0',
effective: '2026-01-01',
region: 'Global',
owner: 'Alliances & Partnerships',
classification: 'INTERNAL-OPS',
audience: ['[[agent]]'],
sensitivity: 'INTERNAL',
topics: ['partner rebooking', 'interline', 'alliances', 'IRROPS'],
applies_to: ['all passengers']
}
Run chunker-2. Looks great, we can see metadata against our chunks!
Now run chunker-2-long-content... We have a problem!
This chunks 70_passenger_rights_and_compensation.md.
It has six distinct sub-sections covering delay compensation thresholds, cancellation entitlements, denied boarding, tarmac delay procedures, force majeure, and escalation. Each covers completely different agent obligations, but they all sit under one ## heading, so chunker-2 treats them as a single chunk. It's about 1,500 tokens: a lot of text.
Consider the query "what compensation does a passenger get for a 3-hour domestic delay?".
-
We have a precision problem. The retrieval layer finds this chunk, and your LLM receives the delay threshold table it needed along with the IDB compensation scale, the four-step cancellation procedure, tarmac delay timelines, and a list of force majeure conditions. The answer might still be correct, but it arrived with enormous noise.
-
The recall problem is subtler. An over-large chunk competes poorly against more focused ones. A specific query about denied boarding might rank a tightly-scoped chunk on that topic higher than this catch-all section, and miss it entirely.
Example 3 fixes this: chunker-3 enforces a 500-token limit and adds 15% overlap, so that massive section gets split into six coherent chunks instead of one bloated one
Example 3- Token Boundaries
Files to Explore
In the Code Explorer above, review chunker-3 - everything else is the same.
The file has changed quite significantly, but now adds tokens and overlap to our files. This breaks up long sections into shorter pieces, to help recall, and repeats some of the previous chunk in the next.
In this example, we are passing in a custom token length (500), and overlapping 15% (75 tokens). You can see this on line 18 of index.ts: console.dir(chunkPolicyMarkdown(fileContents, 500, 75), { depth: null });. This will be important in a future article when we test precision and recall against our embedded chunks.
Remember
Good middle ground values to start testing are around:
- 500 tokens per chunk
- 15% (75 tokens) overlap
In the output of chunker-3 above (or in the screenshot below), you can see our overlaps and token length in action. Notice how the text from the end of chunk 5 appears in chunk 6.
I have also added code to ensure tables are chunked cleanly.
That wraps up the Chunking mini-series. The chunker output is ready to hand off to the embedding step.
Next week: The Embedding series: what embedding models do, how vectors are produced from your chunks, and how to choose the right model for the job.