Primer: Databases

6 min read

A primer on the main database technologies and how they relate to LLMs

Introduction

We want to augment our LLM with our business-specific knowledge. We can break this into two types:

  • Unstructured Data: Think about what happens if someone loses a bag or if flights are cancelled due to weather - our airline has thousands of pages of documents covering every eventuality in many formats.

  • Structured Data: Our airline has many databases containing ticketing information, loyalty and transaction records, and more.

Both are critical to ensuring accurate, relevant, and helpful responses. Without access to this data, the LLM can at best provide only superficial guidance to a customer, and, at worst, provide incorrect information.

Our end goal is to have all that unstructured data in a central knowledge base. Even better if, rather than just search for articles, you can ask an ‘AI Assistant’ questions and get factual answers based on all that data. Additionally, we want to give our LLM access to all the structured data on bookings, loyalty, flight schedules, and so on, so it can provide a relevant answer to you.

Before we get into the ‘how', it is important to understand the different ways enterprise data can be stored. The following paragraphs are not a definitive guide to enterprise data storage, but are intended as a grounding in some core concepts we will visit throughout this series, and they focus on databases.


Database Types

Relational Databases

A great fit for structured data, a poor fit for unstructured data

A relational database is like a room full of drawers of records, where an archivist strictly mandates that every record in a given drawer must contain the same fields (a schema). Relational databases can scale to huge sizes, contain many tables, and store many kinds of structured data. For our airline example, think:

  • Bookings
  • Flight schedules
  • Loyalty records and transactions
  • Inventory and revenue data

Relational databases are the default choice for many applications because they are flexible in how you query the data and reliable. SQL lets you query and combine data across tables, and the database can enforce rules like “this value must exist” or “this relationship must be valid.”

They can be a good fit for MCP architectures because a lot of the enterprise data you want to act on is naturally stored as records like the examples above.

Relational databases are usually not the best default for long-form, unstructured knowledge, especially when your goal is search and retrieval. For example:

  • Fare policy documents
  • IRROPS guidelines
  • The free-text history of a customer service case, where you want to search the body of the case itself (for example, find all cases where the passenger complained about the food)

For these, teams often pair relational data with a document store, a search index, or a RAG layer, depending on what they are trying to do.


Document Databases

A great fit for structured data in specific use cases, a poor fit for searching and reasoning over unstructured data

A document database is like a room of filing cabinets. You store documents (each one can contain whatever you want) inside collections (folders). The archivist is not strict about consistency. Two documents in the same collection can have different fields and structures. That flexibility is powerful, but it also means things can get messy if you do not define some rules and put thought into how you organise the filing system.

Document databases are a strong choice when you need to:

  • Return data in a predictable shape that an application can use immediately, with minimal extra work

  • Serve data directly to an app or UI, and do it quickly

  • Handle data that changes often. Some document databases (Like Firebase) make it easy for applications to subscribe to changes rather than constantly asking for updates

A simple example is our airline’s app needs to display “most relevant content” for each customer. A document database can quickly return a single customer document, already shaped as the app expects. Relational databases can do this too, but you often end up stitching together rows from multiple tables and transforming the result into the format your application wants, slowing the data pipeline down and making the app more complex to develop and maintain. I often use a document database (like Firestore) in my personal projects for this reason.

Document databases can be very fast when the data you need is co-located in one document, plus any nested data inside it (think multiple pages stapled together). The tradeoff shows up when you want to query across lots of documents in different ways. For example:

  • Find me all the bookings Sam made in 2025

  • Find me all the passengers who travelled on QF4401 on 4 February

To make both of those fast, you need to design your data model and indexes around the questions you plan to ask. Do you store bookings under the passenger, under the flight, or in both places? In practice, document databases often push you toward designing for a few key access patterns, and sometimes duplicating data to keep reads fast.

Relational databases are built for a different approach. Because the data is normalised (only exists in one place) and linked, it is usually easier to explore it dynamically with new questions later.


Vector Databases

A poor fit for structured data, a great fit for unstructured data, especially when searching and reasoning

This is a broad simplification of how vector databases and embedding models work. It's intended to give you a baseline understanding and a mental model to work from. We'll dive into some of the specifics in future articles.

This time, the room has no filing cabinets or record drawers. It also has many more dimensions than the three we can imagine - hundreds or thousands. Fragments of information about a business (e.g., our airline's operating procedures) are represented as vectors and stored in the ‘room’. This is done by 'chunking' the information - either manually or by each subheading.

Our Archivist calls an embedding model to map text to numbers. These numbers are an embedding vector for that fragment. The vector can be thought of like the fragment's coordinates (but along thousands of dimensions, rather than the three that we can easily visualise). Fragments that contain similar subject matter - even if the actual words are different - tend to have vectors that are similar.

Example: fragments about the airline's rebooking policy and its missed connection FAQ would end up with similar vectors, even though they do not share any common words.

At retrieval time, your system will compute a vector from the words in your query (using the same embedding model) and retrieve the nearest neighbours (i.e., the closest fragments by distance/similarity).

Example - a query containing the words "Weather-based cancellations" would likely map to a vector that was close enough to our "Rebooking Policy" and "Flight Disruption FAQ" fragments to cause these to be returned as part of the search results. By the same token, vector databases are not a good choice when absolute precision or very high retrieval speed is needed (e.g., returning details of Sam’s flight booking from 25th May 2025).

The power of vector databases is the ability to handle information across different structures and formats - like our flight disruption FAQ (A long-form document), and delay compensation tiers (a text representation of a spreadsheet) and still retrieve relevant information by semantic similarity rather than exact keywords or rigid schema queries.