pgvector Setup for Airline RAG

8 min read

A practical guide to setting up pgvector locally and running your first similarity query

The previous article explained that vector databases store meaning as numbers and retrieve by similarity. This article shows you what that actually looks like.

What is a Vector Database?

I gave a quick overview and analogy of a Vector Database in Primer: Databases. This article goes a few layers deeper and provides a practical guide to setting up your own Vector database locally.

You don't need to do the practical exercises here, and I have provided a playground that gives you an overview of how a vector database works without you having to build one yourself.

Why Vector Databases Matter to RAG

The entire point of RAG is to enable an LLM to provide information grounded in a source of truth. In our airline example, an IRROPS (Irregular Ops) manual. What is the process for rebooking a passenger if a flight is cancelled?

Whilst you could store this data in an SQL or Document database (and sometimes this is the right approach for certain kinds of text-based data), retrieval becomes difficult because:

Traditional queries are built for structured filters and matching keywords (the same words or common patterns (LIKE '%word%')), not ranking text passages by similar meaning (i.e. luggage and bags are similar). It's worth noting PostgreSQL does offer Full Text Search, which is a relevance-ranked keyword search, but it's still not semantic.
Once you break that text up (which is important in optimising the retrieval of relevant passages of text), the challenge is selecting the most relevant chunks because you are looking for chunks that all have a similar meaning.

Setting up a Vector Database Locally

This series uses pgvector. It is an extension of PostgreSQL that adds vector storage and similarity search to a relational database. It works like this:

An embedding model embeds a chunk of text, producing a vector(N).
pgvector stores the embedding vector as a column vector(N) in the database, along with the chunk's metadata and (optionally) the actual text from that chunk as adjacent columns.

Local Database Install

These instructions assume you have access to a command line and are using Docker. They should work well on Mac and Linux systems. I will update this with Windows instructions at a later date.

Prerequisites

You are comfortable with running bash/ zsh commands
You already have Docker (I recommend the Docker approach for this)
```
brew install --cask docker
```
(Or your Linux equivalent)
OR you have installed PostgreSQL and pgvector locally

Setup

Create a local folder

mkdir -p ~/dev/airline-lab && cd ~/dev/airline-lab

Create a docker-compose.yml (If using Docker)

services:
  db:
    image: pgvector/pgvector:pg17
    environment:
      POSTGRES_USER: app
      POSTGRES_PASSWORD: secretpassword
      POSTGRES_DB: airlineraglab
    ports:
      - "5432:5432"
    volumes:
      - pgdata:/var/lib/postgresql/data

volumes:
  pgdata:

Start your shiny new database

docker compose up -d
docker compose ps

Connect with psql

docker compose exec db psql -U app -d airlineraglab

Enable pgvector in your new database

CREATE EXTENSION IF NOT EXISTS vector;

Run a test

CREATE TABLE items (
  id bigserial PRIMARY KEY,
  embedding vector(3)
);
---
CREATE TABLE
===
INSERT INTO items (embedding)
VALUES ('[1,2,3]'), ('[4,5,6]');
---
INSERT 0 2
===
SELECT *
FROM items
ORDER BY embedding <-> '[3,1,2]'
LIMIT 2;
---
 id | embedding
----+-----------
  1 | [1,2,3]
  2 | [4,5,6]
(2 rows)

SQL Playground

If you want to skip the local install and just see how a vector database works, click Execute on the window below:

psql — airlineraglab

airlineraglab=# CREATE TABLE items ( id bigserial PRIMARY KEY, embedding vector(3) );

You can see 1,2,3 is the closest result, followed by 4,5,6.

What Just Happened?

We created a single table with 2 columns: id and embedding

CREATE TABLE items (
  id bigserial PRIMARY KEY,
  embedding vector(3)
);

id is the primary key and is an auto-incrementing 64-bit value (bigserial).
embedding is a vector with 3 'dimensions'.

We inserted 2 rows. Remember our id will auto-increment.

INSERT INTO items (embedding)
VALUES ('[1,2,3]'), ('[4,5,6]');

Finally, we ran a similarity query. We ordered the rows by distance from each stored vector (The values we just inserted) and the vector in our query

SELECT *
FROM items
ORDER BY embedding <-> '[3,1,2]'
LIMIT 2;

Understanding Distance

I discussed returning the results in terms of the distance from each stored vector. Once everything is represented as vectors, closeness is a distance/similarity function. Here is a very simplified visual model:

Two arrows of the same length separated by ~65°. Are they similar?

This might be vectors for a chunk whose subject matter drifts into a related tangent (Permanently lost baggage vs Compensation for delayed baggage).

If answering using Euclidean (L2) distance, you would say 'The L2 distance is moderate: the endpoints aren’t extremely close, but they’re not maximally far apart either.' L2 measures the distance between the endpoints, which increases with both angle and length differences.
If answering using Cosine Similarity, you would say 'They’re moderately similar but not a great match.' Cosine only cares about the angle between vectors, so a 65° gap means partial alignment but noticeable drift.

This image shows two arrows pointing in almost the same direction. One is much longer than the other: Are they similar?

Two arrows of different lengths separated by ~1°. Are they similar?

If answering using Euclidean (L2) distance, you would say 'No, they are not very similar at all'. They both point in almost the same direction, but the distance between the endpoints is large.
If answering using Cosine Similarity, you would say 'Yes, they are very similar'. They both point in almost the same direction, regardless of their length.

Remember

The important detail is that both methods can rank results differently because they measure different things. Another way to think about this:

L2 cares about the tip-to-tip distance, which depends on both angle and length
Cosine cares about the angle
In text embeddings, we usually care more about direction than length (or magnitude), which is why cosine (or normalised vectors, which control for magnitude) is a common default.
If vectors are L2-normalised (for length), cosine distance and L2 distance produce the same ranking.

This operator <-> in our query means Euclidean (L2) distance - or straight line distance - that is, how 'close' are two vectors to each other? Two chunks with identical text would produce identical vectors. Again, two chunks on the same topic, but with a big variance in length (or magnitude), may end up further 'apart' than they should be.

For text, the most common way of determining distance is Cosine Similarity. This ignores length (magnitude) and focuses on similarity, so the two chunks mentioned above, which may not end up as K nearest neighbours using Euclidean (L2) distance, do end up as K nearest neighbours in this case.

In the example above, I used Euclidean (L2) distance because it's the easiest way to see/ explain what is 'closest' in a test. When we run this against real data, we would use Cosine Similarity (<=>) like this:

ORDER BY embedding <=> '[3,1,2]' LIMIT 2;

Scaling the Embedding Vector

A Vector of 3 'dimensions' (as above) isn't going to be sufficient for our embedding: We need hundreds or thousands of dimensions. We also need to prep our database with some fields for our chunk and its metadata. Create a new table in your database:

CREATE TABLE IF NOT EXISTS docs (
  id          bigserial PRIMARY KEY,
  source      text NOT NULL,
  chunk_index int  NOT NULL,
  content     text NOT NULL,
  metadata    jsonb NOT NULL DEFAULT '{}',
  embedding   vector(1536) NOT NULL,
  created_at  timestamptz NOT NULL DEFAULT now()
);

id is the primary key and is an auto-incrementing 64-bit value (bigserial).
source and content are text fields.
chunk_index is an integer.
metadata is a JSON object (This is data about our chunk, like where it came from and when it was last updated. More on this later.)
created_at gets the current time and date by default.
embedding is a vector with 1536 'dimensions'.

Creating an Index

An index in a database works similarly to an index in a book. If you are looking for a subject starting with 'S', you can use the index to quickly locate your 'S' content without having to look through the entire book (or at least A-R) first.

We don't really need an index for a small dataset, but for any 'real life' use case, we certainly would. Here's a quick summary on indexes:

Without an index, Postgres will scan all rows.
We'll add an ANN index (Approximate Nearest Neighbour).
Two common index types are:
- IVFFlat: partitions vectors into clusters (“lists”) and searches only the most relevant clusters.
- HNSW: builds a graph of vectors and walks the graph to find near neighbours quickly.
You must match the operator class to the distance metric (cosine vs L2).

Remember

The section above on distance metrics is important here because when you build your index, you must specify an operator class that matches your distance metric. If you used a distance metric of L2 and you specify an operator class of cosine, you won't get the correct results (The correct ANNs).

Add the index to your table now:

CREATE INDEX IF NOT EXISTS docs_embedding_hnsw
ON docs
USING hnsw (embedding vector_cosine_ops);

Cleanup

You can manage your local Docker instance using these commands:

docker compose stop / restart Stop Docker and save your data. (Recommended.)
docker compose down Tears down all resources associated with your Docker instance. Data is persisted in the DB.
docker compose down -v deletes the volume (wipes the DB)