pgvector Setup for Airline RAG
8 min read
A practical guide to setting up pgvector locally and running your first similarity query
The previous article explained that vector databases store meaning as numbers and retrieve by similarity. This article shows you what that actually looks like.
What is a Vector Database?
I gave a quick overview and analogy of a Vector Database in Primer: Databases. This article goes a few layers deeper and provides a practical guide to setting up your own Vector database locally.
You don't need to do the practical exercises here, and I have provided a playground that gives you an overview of how a vector database works without you having to build one yourself.
Why Vector Databases Matter to RAG
The entire point of RAG is to enable an LLM to provide information grounded in a source of truth. In our airline example, an IRROPS (Irregular Ops) manual. What is the process for rebooking a passenger if a flight is cancelled?
Whilst you could store this data in an SQL or Document database (and sometimes this is the right approach for certain kinds of text-based data), retrieval becomes difficult because:
-
Traditional queries are built for structured filters and matching keywords (the same words or common patterns (
LIKE '%word%')), not ranking text passages by similar meaning (i.e. luggage and bags are similar). It's worth noting PostgreSQL does offer Full Text Search, which is a relevance-ranked keyword search, but it's still not semantic. -
Once you break that text up (which is important in optimising the retrieval of relevant passages of text), the challenge is selecting the most relevant chunks because you are looking for chunks that all have a similar meaning.
Setting up a Vector Database Locally
This series uses pgvector. It is an extension of PostgreSQL that adds vector storage and similarity search to a relational database. It works like this:
- An embedding model embeds a chunk of text, producing a
vector(N). - pgvector stores the embedding vector as a column
vector(N)in the database, along with the chunk's metadata and (optionally) the actual text from that chunk as adjacent columns.
Local Database Install
These instructions assume you have access to a command line and are using Docker. They should work well on Mac and Linux systems. I will update this with Windows instructions at a later date.
Prerequisites
-
You are comfortable with running bash/ zsh commands
-
You already have Docker (I recommend the Docker approach for this)
brew install --cask docker(Or your Linux equivalent)
-
OR you have installed PostgreSQL and pgvector locally
Setup
- Create a local folder
mkdir -p ~/dev/airline-lab && cd ~/dev/airline-lab
- Create a
docker-compose.yml(If using Docker)
services:
db:
image: pgvector/pgvector:pg17
environment:
POSTGRES_USER: app
POSTGRES_PASSWORD: secretpassword
POSTGRES_DB: airlineraglab
ports:
- "5432:5432"
volumes:
- pgdata:/var/lib/postgresql/data
volumes:
pgdata:
- Start your shiny new database
docker compose up -d
docker compose ps
- Connect with
psql
docker compose exec db psql -U app -d airlineraglab
- Enable pgvector in your new database
CREATE EXTENSION IF NOT EXISTS vector;
- Run a test
CREATE TABLE items (
id bigserial PRIMARY KEY,
embedding vector(3)
);
---
CREATE TABLE
===
INSERT INTO items (embedding)
VALUES ('[1,2,3]'), ('[4,5,6]');
---
INSERT 0 2
===
SELECT *
FROM items
ORDER BY embedding <-> '[3,1,2]'
LIMIT 2;
---
id | embedding
----+-----------
1 | [1,2,3]
2 | [4,5,6]
(2 rows)
SQL Playground
If you want to skip the local install and just see how a vector database works, click Execute on the window below:
You can see 1,2,3 is the closest result, followed by 4,5,6.
What Just Happened?
- We created a single table with 2 columns:
idandembedding
CREATE TABLE items (
id bigserial PRIMARY KEY,
embedding vector(3)
);
idis the primary key and is an auto-incrementing 64-bit value (bigserial).embeddingis a vector with 3 'dimensions'.
- We inserted 2 rows. Remember our id will auto-increment.
INSERT INTO items (embedding)
VALUES ('[1,2,3]'), ('[4,5,6]');
- Finally, we ran a similarity query. We ordered the rows by distance from each stored vector (The values we just inserted) and the vector in our query
SELECT *
FROM items
ORDER BY embedding <-> '[3,1,2]'
LIMIT 2;
Understanding Distance
I discussed returning the results in terms of the distance from each stored vector. Once everything is represented as vectors, closeness is a distance/similarity function. Here is a very simplified visual model:
This might be vectors for a chunk whose subject matter drifts into a related tangent (Permanently lost baggage vs Compensation for delayed baggage).
-
If answering using Euclidean (L2) distance, you would say 'The L2 distance is moderate: the endpoints aren’t extremely close, but they’re not maximally far apart either.' L2 measures the distance between the endpoints, which increases with both angle and length differences.
-
If answering using Cosine Similarity, you would say 'They’re moderately similar but not a great match.' Cosine only cares about the angle between vectors, so a 65° gap means partial alignment but noticeable drift.
This image shows two arrows pointing in almost the same direction. One is much longer than the other: Are they similar?
-
If answering using Euclidean (L2) distance, you would say 'No, they are not very similar at all'. They both point in almost the same direction, but the distance between the endpoints is large.
-
If answering using Cosine Similarity, you would say 'Yes, they are very similar'. They both point in almost the same direction, regardless of their length.
Remember
The important detail is that both methods can rank results differently because they measure different things. Another way to think about this:
-
L2 cares about the tip-to-tip distance, which depends on both angle and length
-
Cosine cares about the angle
-
In text embeddings, we usually care more about direction than length (or magnitude), which is why cosine (or normalised vectors, which control for magnitude) is a common default.
-
If vectors are L2-normalised (for length), cosine distance and L2 distance produce the same ranking.
This operator <-> in our query means Euclidean (L2) distance - or straight line distance - that is, how 'close' are two vectors to each other? Two chunks with identical text would produce identical vectors. Again, two chunks on the same topic, but with a big variance in length (or magnitude), may end up further 'apart' than they should be.
For text, the most common way of determining distance is Cosine Similarity. This ignores length (magnitude) and focuses on similarity, so the two chunks mentioned above, which may not end up as K nearest neighbours using Euclidean (L2) distance, do end up as K nearest neighbours in this case.
In the example above, I used Euclidean (L2) distance because it's the easiest way to see/ explain what is 'closest' in a test. When we run this against real data, we would use Cosine Similarity (<=>) like this:
ORDER BY embedding <=> '[3,1,2]' LIMIT 2;
Scaling the Embedding Vector
A Vector of 3 'dimensions' (as above) isn't going to be sufficient for our embedding: We need hundreds or thousands of dimensions. We also need to prep our database with some fields for our chunk and its metadata. Create a new table in your database:
CREATE TABLE IF NOT EXISTS docs (
id bigserial PRIMARY KEY,
source text NOT NULL,
chunk_index int NOT NULL,
content text NOT NULL,
metadata jsonb NOT NULL DEFAULT '{}',
embedding vector(1536) NOT NULL,
created_at timestamptz NOT NULL DEFAULT now()
);
idis the primary key and is an auto-incrementing 64-bit value (bigserial).sourceandcontentare text fields.chunk_indexis an integer.metadatais a JSON object (This is data about our chunk, like where it came from and when it was last updated. More on this later.)created_atgets the current time and date by default.embeddingis a vector with 1536 'dimensions'.
Creating an Index
An index in a database works similarly to an index in a book. If you are looking for a subject starting with 'S', you can use the index to quickly locate your 'S' content without having to look through the entire book (or at least A-R) first.
We don't really need an index for a small dataset, but for any 'real life' use case, we certainly would. Here's a quick summary on indexes:
- Without an index, Postgres will scan all rows.
- We'll add an ANN index (Approximate Nearest Neighbour).
- Two common index types are:
- IVFFlat: partitions vectors into clusters (“lists”) and searches only the most relevant clusters.
- HNSW: builds a graph of vectors and walks the graph to find near neighbours quickly.
- You must match the operator class to the distance metric (cosine vs L2).
Remember
The section above on distance metrics is important here because when you build your index, you must specify an operator class that matches your distance metric. If you used a distance metric of L2 and you specify an operator class of cosine, you won't get the correct results (The correct ANNs).
Add the index to your table now:
CREATE INDEX IF NOT EXISTS docs_embedding_hnsw
ON docs
USING hnsw (embedding vector_cosine_ops);
Cleanup
You can manage your local Docker instance using these commands:
-
docker compose stop/ restart Stop Docker and save your data. (Recommended.) -
docker compose downTears down all resources associated with your Docker instance. Data is persisted in the DB. -
docker compose down -vdeletes the volume (wipes the DB)