Embedding for RAG
11 min read
How we make our chunks searchable and usable by our LLM.
Recap
What we're building
Our fictitious airline is looking to build a world-class service agent.
There are agents for both airline staff and customers. Depending on who is asking, different information is surfaced.
The agent(s) must be:
- Grounded in knowledge from the thousands of documents the airline holds: FAQs, PDFs, knowledge bases, and more.
- Connected to structured data stores including booking systems, scheduling, inventory, loyalty, CDP, and CRM.
The agent(s) should return accurate, helpful answers quickly, but must gate what is shown to whom. A passenger can only access their own data, not another passenger's, and never internal staff-only content.
Full scenario: The Running Example
A chunk sitting in a database is just text. A keyword search for "delayed baggage" won't find a chunk titled "Lost Luggage Reimbursement". They share no words, even though they cover the same ground. Embedding converts each chunk into a vector that represents its meaning, so that semantically similar content ends up close together and can be found that way.
- Chunks are generated from our corpus (source content) by our chunker.
- Those chunks are passed into the embedder and a vector is produced.
- Each vector (and in most cases the associated chunk) are committed to a vector database.
Remember
The vector represents the chunk's meaning and so chunks about similar concepts will have vectors that only have a small distance between each other in the vector database. The understanding of which chunks are indeed "similar" is where embedding models come into their own and will be discussed in this article. An embedding model might consider chunks talking about the following to all be similar and assign them similar vectors:
- Baggage allowances
- Excess luggage
- Oversized sports equipment
- Prohibited items in the hold
What Exactly is Embedding?
Put simply, it is the process of taking a chunk and producing a vector from it using an embedding model, and then writing this to a vector database. OpenAI, Voyage AI, and many others offer embedding models.
What is a Vector?
A vector has two properties: Its magnitude (length), and direction (where it points). Think of it as an arrow: The length of the arrow is the magnitude and the point of the arrowhead is the direction.
For the purposes of embedding, a vector is a list of numbers that describe a chunk. Rather than having three dimensions like a vector in our physical world, a vector produced by an embedder might have hundreds, or thousands of dimensions.
The embedding model itself is trained to recognise semantic similarity and map it to similar vectors. For example, it understands 'luggage' and 'bags' are almost the same thing and therefore chunks containing those words will have higher vector proximity - meaning a lower distance or smaller angles. The reverse is also true: If our chunks around 'luggage' and 'bags' diverge to talk about 'lost luggage compensation' and 'where to reclaim oversized bag on arrival', the embedding model will assign vectors to the semantically dissimilar words, reducing vector proximity.
Understanding Distance
Once everything is represented as vectors, closeness is a distance/similarity function. As a reminder, chunks with a similar semantic meaning (even if they do not share any common words) are likely to have similar vectors.
There are several different ways of defining distance. More detail can be found in the pgvector setup article, here is a quick summary.
Here is a very simplified visual model:
This might be vectors for a chunk whose subject matter drifts into a related tangent (Permanently lost baggage* vs Compensation for delayed baggage). Depending on how you measure 'distance', you could say:
-
'The distance between these points is moderate.' (I only care about distance)
-
Or 'They're moderately similar but not a great match.' (I care more about angle, less about distance)
This image shows two arrows pointing in almost the same direction. One is much longer than the other: Are they similar?
-
'These are not similar at all, they are very far apart.' (I only care about distance)
-
If answering using Cosine Similarity, you would say 'Yes, they are very similar'. (I care more about angle, less about distance)
Remember
The important detail is that both methods can rank results differently because they measure different things. Another way to think about this:
-
L2 cares about the tip-to-tip distance, which depends on both angle and length
-
Cosine cares about the angle
-
In text embeddings, we usually care more about direction than length (or magnitude), which is why cosine (or normalised vectors, which control for magnitude) is a common default.
-
If vectors are L2-normalised (for length), cosine distance and L2 distance produce the same ranking.
What Does an Embedding Look Like?
Quite literally, a long series of numbers, alongside the actual chunk to be embedded.
A. "compensation for delayed baggage" → [0.021, -0.134, 0.892, 0.045, -0.219, ...] (1,536 values)
B. "lost baggage reimbursement" → [0.019, -0.128, 0.887, 0.041, -0.211, ...]
C. "flight departure schedule" → [0.412, 0.731, 0.104, -0.382, 0.509, ...]
A and B are semantically similar, both relate to baggage delays and compensation. Notice how close their numbers are. C, about flight schedules, sits in a completely different part of the vector space.
The next article in this series allows you to get hands on with embeddings.
As we are learning, there are many ways to chunk, and many ways to embed. An important first step following an embedding is to attempt a retrieval. Along with that, we need to understand how our chunking and embedding is performing. Two of the key metrics we use are precision and recall.
Precision & Recall: A Quick Guide
As we get into the detail of embedding, these terms come up a lot, so here is a quick guide as to what they are and why they matter.
Precision and recall are two de facto benchmarks of how well retrieval performs. Both terms talk to the quality of the data returned in relation to a query.
Consider our three example chunks above, retrieved in response to the query: "What is your policy for misplaced luggage?" We might get some, all, or none of them back. We can quantify how well we did:
- Precision: Of the chunks returned, how many were actually relevant?
- Recall: Of all the relevant chunks that exist, how many did we retrieve?
| Which Chunks Were Returned | Conclusion |
|---|---|
| None or C only | Total fail! |
| A, B | Precision high, Recall high |
| A, C or B, C | Precision low, Recall medium |
| A only OR B only | Precision high, Recall low |
| A, B, C | Precision low, Recall high |
This gives us a basis to tune multiple aspects of our pipeline, from chunking through to re-ranking and answer generation. There is a lot to unpack here and we will revisit this topic in great detail in the Retrieval mini-series.
Tools like RAGAS provide automated metrics for precisely these dimensions, including context precision, context recall, and faithfulness. Worth knowing about when you get to the tuning stage.
Now we have the baseline knowledge, we can explore some commonly seen embedding solutions.
The Main Players
OpenAI, Google, and Voyage (Anthropic's partner for embedding) all offer pay-per-token embedders. Of course, there are many open source options too. Huggingface maintains a MTEB leaderboard with links to ranked models. Many of the stats in the table below are taken from that leaderboard.
Different embedders excel at different things: Some will be stronger in specific domains like legal text or fiction, or be suited to very large datasets. It is worth taking time to understand which is the best fit for your use case.
Remember
-
MTEB: Massive Text Embedding Benchmark. It is a standardised framework designed to evaluate the performance of text embedding models across a wide range of tasks.
-
Generation Models: Handle formulating the response after retrieval — covered in the Retrieval mini-series. Worth knowing now: you can mix and match embedding and generation models. OpenAI embedding with Claude generation is a valid combination.
| OpenAI | Voyage + Claude | Open Source | ||
|---|---|---|---|---|
| Model | gemini-embedding-001 | text-embedding-3-large | voyage-3.5 | Lots. e.g. Qwen3-embedding-8B |
| Dimensions | 768 | 3,072 | 1,024 | 4,096 max* |
| Max Tokens | 2,048 | 8,191 | 32,000 | 32,768 |
| Cost | Pay per token | Pay per token | Pay per token | Free (self-hosted) |
| MTEB ranking (Lower is better) | 4 | 26 | 31 | 3 |
| Best for | Cheaper to store and query, good quality results. | Default enterprise choice. | Overall performance, re-ranking. | Cost control at scale / experimentation, top MTEB ranking. |
* Up to 4096 dimensions but unlike other models here, can be configured to a lower value.
Evaluating Options for our Airline RAG
There are a few things we should consider from the table above:
-
Vector DBs store vectors alongside the chunk itself. Thousands of vectors multiplied by millions of chunks (as we can assume an airline has a lot of information to chunk) can result in significant storage costs. Gemini's vectors are half the size of OpenAI's.
-
From our last series of articles, we decided to produce ~500 token chunks. Whilst all the models here are over that, if we were to pick Gemini, as an example, and decide to increase our token length per chunk, we might run into issues which could:
- Retrieval metrics, particularly recall, could suffer if we are restricted to chunks under 2,048 tokens.
- We would have limited ability to tune our chunking strategy upward, since the embedder's token ceiling becomes a hard constraint.
In reality this is unlikely to be an issue today, but it's worth mentioning. Embedder token lengths have increased substantially in the last couple of years. Even the 'smallest' token budget in our list - 2,048 gives us a good amount of breathing room.
-
MTEB rankings are crucial to understanding how our chosen embedder will perform using a standardised benchmark. We should pay close attention to the scores of each. I only considered models with >10B parameters, of which there are about 350 indexed by Huggingface, and all models picked here rank highly.
Also worth noting: Google, OpenAI, and Voyage all offer enterprise-grade infrastructure, no maintenance requirements, enterprise-level uptime, and so on. There are plenty of hosting options for Qwen3, but this requires careful vetting, or for our airline to manage self-hosting. Uptime is crucially important for retrievals as outlined below.
-
Some embedders offer features like re-ranking. OpenAI's
text-embedding-3-smalldoes not, Voyage does. Some others are also more suited to certain types of text over others. For example, some may excel at embedding research articles, or fiction. This may not be a dealbreaker, but worth consideration. -
Token costs are important. There are two things to consider here.
Indexing: We will embed all our chunks as we index them and commit them to our database. If we have huge volumes of text, these costs can multiply quickly. I have not included costs here as they are very subject to specific licensing deals, and so on, but it is well worth considering. Moreover, if we go open source, we need to either rent server time, or use GPU power to run a large model like Qwen3 above.
Retrieval: Every single retrieval query we make (that is, every time we send a query to our vector database 'What is the compensation amount for lost bags?'), we will need to embed that query with the same embedder. This means that token (Google, OpenAI, Voyage) or compute costs (Qwen3) are ongoing, not just a 'one-time' event during indexing.
-
If we are embedding highly confidential information, models like Qwen3 can be run on privately hosted hardware without sending chunks back-and-forth to a hosted service like OpenAI.
-
This is covered a little above, but worth calling out again: No matter which option you choose, Indexing and retrieval must be made with the same embedder. If you choose
gemini-embedding-001, you are locked into using that for both indexing and retrieving - you cannot mix embedders. Moreover, if Google releasegemini-embedding-002, and you wish to upgrade, you must re-index every chunk again and point retrievals togemini-embedding-002.
Choosing an Embedder
If I were designing for a real enterprise, I would likely pick either gemini-embedding-001 or voyage-3.5 from the shortlist.
Both rank well on MTEB, and excel in enterprise use cases. Voyage AI in particular offers real flexibility in token length, and re-ranking is powerful for optimising retrieval (covered in the Retrieval mini-series).
Those are the models you would evaluate for a production pipeline. For this series, we drop down to OpenAI text-embedding-3-small: a fraction of the cost, fewer dimensions, but more than capable for a demo. It offers 1,536 dimensions, up to 8,191 tokens, and ranks a respectable 44 on MTEB for a model of its size.
Ready to see this in practice? Next in the Embedding mini-series: pgvector Setup for Airline RAG — get your vector database running on Neon and run your first similarity query.