pgvector Setup for Airline RAG

8 min read

A practical guide to setting up pgvector on Neon and running your first similarity query

The previous article explained that vector databases store meaning as numbers and retrieve by similarity. This article shows you what that actually looks like.

What is a Vector Database?

I gave a quick overview and analogy of a Vector Database in Primer: Databases. This article goes a few layers deeper and provides a practical guide to setting up your own Vector database using Neon — a free, serverless Postgres service with pgvector built in.

You don't need to do the practical exercises here, and I have provided a playground that gives you an overview of how a vector database works without you having to build one yourself.

Why Vector Databases Matter to RAG

The entire point of RAG is to enable an LLM to provide information grounded in a source of truth. In our airline example, an IRROPS (Irregular Ops) manual. What is the process for rebooking a passenger if a flight is cancelled?

Whilst you could store this data in an SQL or Document database (and sometimes this is the right approach for certain kinds of text-based data), retrieval becomes difficult because:

  • Traditional queries are built for structured filters and matching keywords (the same words or common patterns (LIKE '%word%')), not ranking text passages by similar meaning (i.e. luggage and bags are similar). It's worth noting PostgreSQL does offer Full Text Search, which is a relevance-ranked keyword search, but it's still not semantic.

  • Once you break that text up (which is important in optimising the retrieval of relevant passages of text), the challenge is selecting the most relevant chunks because you are looking for chunks that all have a similar meaning.


Setting up a Vector Database with Neon

This series uses pgvector: an extension of PostgreSQL that adds vector storage and similarity search to a relational database. We'll use Neon: a free, serverless Postgres service that has pgvector ready to go. No local install required. This will work as we code locally and then start to use services like Google Colab in later articles.

It works like this:

  • An articles/RAG/Embedding model embeds a chunk of text, producing a vector(N).
  • pgvector stores the articles/RAG/Embedding vector as a column vector(N) in the database, along with the chunk's metadata and (optionally) the actual text from that chunk as adjacent columns.
Neon Setup (free account, takes 2 minutes)

Prerequisites

  • A free Neon account. Sign up with GitHub or Google if you like

Setup

  1. Create a new project in the Neon dashboard. Name it airline-rag and pick the region closest to you.

  2. Once created, copy your connection string from the dashboard. It looks like:

postgresql://user:password@ep-xxx.region.aws.neon.tech/neondb?sslmode=require
  1. Open the SQL Editor in the Neon dashboard and enable pgvector:
CREATE EXTENSION IF NOT EXISTS vector;
  1. Verify it worked:
SELECT extname, extversion FROM pg_extension WHERE extname = 'vector';
  1. Run a test:
CREATE TABLE items (
  id bigserial PRIMARY KEY,
  embedding vector(3)
);
---
CREATE TABLE
===
INSERT INTO items (embedding)
VALUES ('[1,2,3]'), ('[4,5,6]');
---
INSERT 0 2
===
SELECT *
FROM items
ORDER BY embedding <-> '[3,1,2]'
LIMIT 2;
---
 id | embedding
----+-----------
  1 | [1,2,3]
  2 | [4,5,6]
(2 rows)

SQL Playground

If you want to skip the setup and just see how a vector database works, click Execute on the window below:

psql — airlineraglab
airlineraglab=# CREATE TABLE items ( id bigserial PRIMARY KEY, embedding vector(3) );

You can see 1,2,3 is the closest result, followed by 4,5,6.

What Just Happened?

  1. We created a single table with 2 columns: id and embedding
CREATE TABLE items (
  id bigserial PRIMARY KEY,
  embedding vector(3)
);
  • id is the primary key and is an auto-incrementing 64-bit value (bigserial).
  • embedding is a vector with 3 'dimensions'.
  1. We inserted 2 rows. Remember our id will auto-increment.
INSERT INTO items (embedding)
VALUES ('[1,2,3]'), ('[4,5,6]');
  1. Finally, we ran a similarity query. We ordered the rows by distance from each stored vector (The values we just inserted) and the vector in our query
SELECT *
FROM items
ORDER BY embedding <-> '[3,1,2]'
LIMIT 2;

Understanding Distance

I discussed returning the results in terms of the distance from each stored vector. Once everything is represented as vectors, closeness is a distance/similarity function. Here is a very simplified visual model:

Two arrows of the same length separated by ~65°. Are they similar?

This might be vectors for a chunk whose subject matter drifts into a related tangent (Permanently lost baggage vs Compensation for delayed baggage).

  • If answering using Euclidean (L2) distance, you would say 'The L2 distance is moderate: the endpoints aren't extremely close, but they're not maximally far apart either.' L2 measures the distance between the endpoints, which increases with both angle and length differences.

  • If answering using Cosine Similarity, you would say 'They're moderately similar but not a great match.' Cosine only cares about the angle between vectors, so a 65° gap means partial alignment but noticeable drift.

This image shows two arrows pointing in almost the same direction. One is much longer than the other: Are they similar?

Two arrows of different lengths separated by ~1°. Are they similar?
  • If answering using Euclidean (L2) distance, you would say 'No, they are not very similar at all'. They both point in almost the same direction, but the distance between the endpoints is large.

  • If answering using Cosine Similarity, you would say 'Yes, they are very similar'. They both point in almost the same direction, regardless of their length.

Remember

The important detail is that both methods can rank results differently because they measure different things. Another way to think about this:

  • L2 cares about the tip-to-tip distance, which depends on both angle and length

  • Cosine cares about the angle

  • In text articles/RAG/Embeddings, we usually care more about direction than length (or magnitude), which is why cosine (or normalised vectors, which control for magnitude) is a common default.

  • If vectors are L2-normalised (for length), cosine distance and L2 distance produce the same ranking.

This operator <-> in our query means Euclidean (L2) distance - or straight line distance - that is, how 'close' are two vectors to each other? Two chunks with identical text would produce identical vectors. Again, two chunks on the same topic, but with a big variance in length (or magnitude), may end up further 'apart' than they should be. In the example above, I used Euclidean (L2) distance because it's the easiest way to see/ explain what is 'closest' in a test.

For text, the most common way of determining distance is Cosine Similarity. This ignores length (magnitude) and focuses on similarity, so the two chunks mentioned above, which may not end up as K nearest neighbours using Euclidean (L2) distance, do end up as K nearest neighbours in this case.

Again, when we run this against real data, we would use Cosine Similarity (<=>). Like this:

ORDER BY embedding <=> '[3,1,2]' LIMIT 2;

Scaling the Embedding Vector

A Vector of 3 'dimensions' (as above) isn't going to be sufficient for our articles/RAG/Embedding: We need hundreds or thousands of dimensions. We also need to prep our database with some fields for our chunk and its metadata. Create a new table in your database:

CREATE TABLE IF NOT EXISTS docs (
  id          bigserial PRIMARY KEY,
  source      text NOT NULL,
  chunk_index int  NOT NULL,
  content     text NOT NULL,
  metadata    jsonb NOT NULL DEFAULT '{}',
  embedding   vector(1536) NOT NULL,
  created_at  timestamptz NOT NULL DEFAULT now()
);
  • id is the primary key and is an auto-incrementing 64-bit value (bigserial).
  • source and content are text fields.
  • chunk_index is an integer.
  • metadata is a JSON object (This is data about our chunk, like where it came from and when it was last updated. More on this later.)
  • created_at gets the current time and date by default.
  • embedding is a vector with 1536 'dimensions'.

Creating an Index

An index in a database works similarly to an index in a book. If you are looking for a subject starting with 'S', you can use the index to quickly locate your 'S' content without having to look through the entire book (or at least A-R) first.

We don't really need an index for a small dataset, but for any 'real life' use case, we certainly would. Here's a quick summary on indexes:

  • Without an index, Postgres will scan all rows.
  • We'll add an ANN index (Approximate Nearest Neighbour).
  • Two common index types are:
    • IVFFlat: partitions vectors into clusters ("lists") and searches only the most relevant clusters.
    • HNSW: builds a graph of vectors and walks the graph to find near neighbours quickly.
  • You must match the operator class to the distance metric (cosine vs L2).

Remember

The section above on distance metrics is important here because when you build your index, you must specify an operator class that matches your distance metric. If you used a distance metric of L2 and you specify an operator class of cosine, you won't get the correct results (The correct ANNs).

Add the index to your table now:

CREATE INDEX IF NOT EXISTS docs_embedding_hnsw
ON docs
USING hnsw (embedding vector_cosine_ops);

Your vector store is ready. Next in the Embedding mini-series: Embedding in Practice — generating real embeddings from your policy chunks and loading them into Neon.