AI for the Curious/Code Examples/Embedding in Practice

Embedding in Practice

A working example with some basic retrieval tests

Code ExamplesArticle 11 of 11 · ~10 min read

Current

Embedding in Practice

Recap

What we're building

Our fictitious airline is looking to build a world-class service agent.

There are agents for both airline staff and customers. Depending on who is asking, different information is surfaced.

The agent(s) must be:

  • Grounded in knowledge from the thousands of documents the airline holds: FAQs, PDFs, knowledge bases, and more.
  • Connected to structured data stores including booking systems, scheduling, inventory, loyalty, CDP, and CRM.

The agent(s) should return accurate, helpful answers quickly, but must gate what is shown to whom. A passenger can only access their own data, not another passenger's, and never internal staff-only content.

Full scenario: The Running Example

Our First Embedding

We decided to use OpenAI text-embedding-3-small for our demonstration here. This is perfect for the amount of data we have. For an enterprise deployment, we would pick a much larger model.

  • The Code Explorer contains all the code for the embedding examples.
  • The Playground allows you to execute the examples and see the output - no need to install anything locally!
  • The Examples walk you through each embedding script and build on concepts

Note

You can find all the code on Github

The documents our airline uses can be found in the Code Playground under data/policies/simple_example. For this article, we are sticking to a simple set of Markdown documents. Of course, we will also have PDFs, word documents, HTML on web pages, notes from customer service cases, and lots more sources to ingest. That goes outside the scope of this document.

Note

If you have not yet set up a pgvector database, do it now! This article walks through how to do it in a few simple steps.

Setting up OpenAI

To use Open AI's embedding model, you will need to get an API account and load it up with a few dollars.

  1. Create an account or log in at OpenAI API
  2. Add some credit ($5 will be enough to run several embeddings.)
  3. Click 'API Keys' from the left menu and 'Create New Secret Key' to create a new secret key.
  4. Note down your key for the next steps.

Code Explorer

Remember

This series provides working examples to demonstrate the core concepts in a clear way. This is not production-ready code.

All the code used for our embedding examples can be found below.

The Chunking Script

We will use chunker-3 from the last exercise to chunk each document in our corpus. We will then pass each chunk to an embedding script, which uses

chunking/chunker-3.ts

Explorer

// chunker-3 builds on chunker-2 by adding two things:
// 1. A token budget: sections that exceed the target are split into smaller chunks
// 2. Overlap: the end of each chunk is repeated at the start of the next, so context
//    isn't lost at a boundary
import type { Chunk } from "../types/chunk";

// Record<string, any> is a TypeScript generic type meaning "an object whose keys are strings
// and whose values can be anything". We use it here because metadata fields vary by document.

// --- Chunking parameters ---
// targetTokens: the maximum number of tokens we want in a single chunk.
//   We use 400 rather than the 500 mentioned in the article as our hard ceiling.
//   The overlap (60 tokens) is added on top, so an emitted chunk can reach ~460 tokens
//   in practice. Keeping the content budget at 400 leaves headroom without wasting space.
//
// overlapTokens: how many tokens of the previous chunk we copy to the start of the next.
//   60 tokens (~240 characters) is roughly 2-3 sentences — enough to preserve context
//   at a boundary without significantly inflating chunk size.
//
// tokenLength: our character-to-token approximation (4 chars per token).
//   This is a rough rule of thumb for English text. In production you would use the
//   actual tokenizer for your embedding model (e.g. tiktoken for OpenAI models) to
//   get an exact count. For this example the approximation is close enough.
const def_targetTokens = 400;
const def_overlapTokens = 60;
const def_tokenLength = 4; // avg chars per token for English text

function estimateTokens(s: string): number {
  // In production, replace this with a real tokenizer for your embedding model.
  // tiktoken (OpenAI) and @anthropic-ai/tokenizer are both good options.
  return Math.ceil(s.length / def_tokenLength);
}

// Reads the document-level metadata out of the policy header block.
// Each field is marked up in bold in the markdown, e.g. **Policy ID:** LOYALTY-050
function extractDocMetadata(md: string) {

  // Match the top-level # heading to get the document title.
  // ?.[1] is optional chaining — it safely accesses index [1] of the match result,
  // returning undefined instead of throwing if match() found nothing.
  // ?? "" is nullish coalescing — it means "use this value, or "" if it's null/undefined".
  const docTitle = (md.match(/^#\s+(.+)$/m)?.[1] ?? "").trim();

  // A helper that extracts a single metadata field by its label.
  function pick(label: string): string {
    const re = new RegExp(`^\\*\\*${label}:\\*\\*\\s*(.+)\\s*$`, "mi");
    return (md.match(re)?.[1] ?? "").trim();
  }

  // A helper for fields that contain comma-separated lists, like Topics or Audience.
  // .map(s => s.trim()) strips whitespace from each item in the array.
  // .filter(Boolean) removes any empty strings that might result from trailing commas.
  function pickArray(label: string): string[] {
    const raw = pick(label);
    return raw ? raw.split(",").map(s => s.trim()).filter(Boolean) : [];
  }

  return {
    doc_title: docTitle,
    policy_id: pick("Policy ID"),
    version: pick("Version"),
    effective: pick("Effective"),
    region: pick("Region"),
    owner: pick("Owner"),
    classification: pick("Classification"),
    audience: pickArray("Audience"),
    sensitivity: pick("Sensitivity"),
    topics: pickArray("Topics"),
    applies_to: pickArray("Applies To"),
  };
}

// Splits a single section into one or more chunks, respecting the token budget.
// If the section fits within targetTokens it is returned as-is.
// If it is too large we walk through it block by block (splitting on blank lines),
// emitting a new chunk whenever the next block would push us over the limit.
// The overlap from the previous chunk is prepended to each new chunk so that
// a sentence or table row that straddles a boundary appears in both chunks.
// Note: overlap is calculated from the fresh content only — it does not compound
// across multiple splits, which would cause the chunks to grow unboundedly.
function* splitIntoTokenBudgetChunks(
  content: string,
  headingPath: string[],
  metadata: Chunk["metadata"],
  targetTokens: number,
  overlapTokens: number
): Generator<Chunk> {
  const tokens = estimateTokens(content);

  // Fast path: section is already within budget
  if (tokens <= targetTokens) {
    yield { chunkIndex: 0, headingPath, content, metadata };
    return;
  }

  // Split on blank lines rather than at a raw character position.
  // This keeps paragraphs, list items, and table rows intact — cutting
  // mid-sentence would produce incoherent chunks and hurt embedding quality.
  const blocks = content.split(/\n\n+/).filter(b => b.trim());

  let chunkCount = 0;
  let currentChunk: string[] = [];
  let currentTokens = 0;
  let overlapContent = "";

  for (const block of blocks) {
    const blockTokens = estimateTokens(block);

    // This block would push us over budget — emit what we have and start fresh
    if (currentTokens + blockTokens > targetTokens && currentChunk.length > 0) {
      const chunkContent = currentChunk.join("\n\n");
      yield {
        chunkIndex: chunkCount++,
        headingPath,
        content: overlapContent + chunkContent,
        metadata,
      };

      // Carry the tail of this chunk forward as overlap for the next one.
      // We take the last ~60 tokens worth of characters from the fresh content
      // (not including the overlap we prepended) so that overlap doesn't compound.
      const overlapChars = Math.floor(overlapTokens * def_tokenLength);
      overlapContent = chunkContent.slice(-overlapChars);
      if (overlapContent && !overlapContent.endsWith("\n\n")) {
        overlapContent += "\n\n";
      }

      // A table row without its column headers is meaningless — an LLM reading
      // "| $300 | $400 |" with no header has no idea what those values represent
      // and may hallucinate column names. Tables are self-contained units, so
      // if the overlap slice landed inside a table, discard it rather than
      // carry misleading context into the next chunk.
      if (overlapContent.trimStart().startsWith("|")) {
        overlapContent = "";
      }

      // Start the new chunk's token count from the overlap we're carrying forward
      currentChunk = [];
      currentTokens = estimateTokens(overlapContent);
    }

    currentChunk.push(block);
    currentTokens += blockTokens;
  }

  // Emit whatever is left in the final chunk
  if (currentChunk.length > 0) {
    yield {
      chunkIndex: chunkCount,
      headingPath,
      content: overlapContent + currentChunk.join("\n\n"),
      metadata,
    };
  }
}

// chunkPolicyMarkdown is our main export.
// It splits the document on H2 headings (##), extracts metadata from the header block,
// and then passes each section through splitIntoTokenBudgetChunks.
// Sections that are already under the token budget come through unchanged.
// Sections that are over budget get split into multiple chunks with overlap.
export function* chunkPolicyMarkdown(
  md: string,
  targetTokens = def_targetTokens,
  overlapTokens = def_overlapTokens
): Generator<Chunk> {
  const meta = extractDocMetadata(md);
  const docTitle = meta.doc_title || "Untitled";

  // Split on H2 headings and drop anything before the first one (the metadata header block).
  // That content has already been captured in `meta` above.
  const parts = md.split(/\n(?=##\s+)/g).filter(s => s.trim().startsWith("## "));

  let globalIndex = 0;

  for (let i = 0; i < parts.length; i++) {
    const section = parts[i].trimEnd() + "\n";
    const h2 = section.match(/^##\s+(.+)$/m)?.[1]?.trim() ?? `Section ${i}`;
    const headingPath = [docTitle, h2];

    // yield* delegates to the inner generator, forwarding each chunk as it is produced.
    // We re-index here to keep chunkIndex sequential across the whole document.
    for (const chunk of splitIntoTokenBudgetChunks(section, headingPath, meta, targetTokens, overlapTokens)) {
      yield { ...chunk, chunkIndex: globalIndex++ };
    }
  }
}

Code Playground

See each example come to life in your browser

$ npx tsx index.ts embed_one_fixed
↵ click to focus, then press Enter to run

Examples Explained

Expand each section for a walkthrough of what is happening

Example 1- Embed One Fixed

Files to Explore

In the Code Explorer above, review the three main files used in this example:

  • data/policies/simple_example contains all of our airline policy documents.
  • index.ts is the script we run and calls our chunker, logging the output. It takes in the file to be chunked.
  • chunking/chunker-1.ts is the actual chunking function we run to split the document into chunks.

Click the Execute button above and run example 1.

This command is run:

npx tsx index.ts embed_one_fixed 10_irrops_reaccommodation.md
  1. It uses npx to execute the contents of index.ts, using the embedder embed_one_fixed. The file argument does not matter here as we have fixed the content in the embedding file just to show the concept.

  2. The value of chunk in embed_one_fixed is a 'chunk' from a larger document. It contains metadata and the actual text to be embedded.

  3. embedOne is called which invokes the OpenAI embedder and awaits the result to come back (Which is a set of vectors.)

  4. We log the results so you can see what happened

  5. We write the result to our Neon database.

Example 2 expands this to call our chunker as part of the loop and embed an entire document.

Example 2- Embed All

Files to Explore

This time is we embed real chunks (as specified from the file on the command line) rather than a fixed chunk. embed_all.ts contains changes to consume the chunks pushed from index.ts, and in the same way as above, await vectors from OpenAI before performing an embedding to the Neon database.

Run npx tsx index.ts embed_all 10_irrops_reaccommodation.md to embed one file.

Run npx tsx index.ts embed_all to embed everything in the simple_example directory.

Example 3- Retrieve Test

This will be the first test of our embedding.

Files to Explore

Explore retrieve/embed_query.ts and retrieve/retrieve_test.ts

  • retrieve/retrieve_test.ts will take a query from the command line and embed it (using retrieve/embed_query.ts) using the same embedding model as our original embedding.
  • The embedding (A set of vectors) is then used to retrieve 'similar' vectors.

Note

The question I asked intentionally shares few words with any of the chunks we embedded. Remember we are testing semantic similarity, not pattern matching. "what happens if the airline cancels my trip"

The similarity score here is cosine similarity. It measures the angle between two vectors in 1536-dimensional space. A score of 1.0 (100%) means the vectors point in exactly the same direction -- semantically identical. A score of 0.0 means completely unrelated.

You can run the code above to see what would be retrieved- here are the first few lines of the top chunk:

1. [0] Flight Disruption FAQ > My Flight Was Cancelled — What Happens Next?

Similarity: 61.78%

Content: ## My Flight Was Cancelled — What Happens Next?

61.78% for that result is actually a strong match for RAG considering the query and top chunk used different words, but had the same meaning.

  • Cosine similarity in high-dimensional embedding space rarely approaches 100% even for near-identical text, because the model encodes subtle meaning differences across all 1536 dimensions.
  • The query is conversational ("what happens if my flight is cancelled") and the chunk is formal policy text -- same meaning, different register, which costs some similarity.
  • In practice for text-embedding-3-small on domain- specific content, you'd typically interpret scores roughly as:
  • 65%+: strong topical match, almost certainly relevant
  • 55-65%: related but possibly a tangential section
  • below 50%: likely noise, the model is reaching

Our embedding worked, and the results look good - but there a lot we can do to fully test retrieval and, of course much we can optimise. Part 3: Retrieval coming soon.

← Previous

Embedding for RAG