RAG & MCP 101

7 min read

A quick 101 to RAG & MCP - two key concepts for this series

RAG and MCP are two of the most misunderstood terms in modern LLM architecture. I’ve even heard people describe MCP as a replacement for RAG, which is usually the wrong framing. Both are important and are often used together to achieve optimal outcomes in a complex business environment.

Remember

RAG retrieves unstructured knowledge (documents) MCP invokes structured capabilities (tools like databases). Many real deployments use both.

So what does a RAG & MCP Flow Look Like?

Here’s one we might all be able to relate to on some level!

Example

My flight is cancelled, and I’m Gold!! What are my options today?”

RAG retrieves the disruption policy + Gold entitlements + partner rebooking rules (unstructured text, cited).
MCP tools fetch the customer’s tier, booking, and real-time inventory (structured systems).
The LLM combines both: it explains the policy and proposes specific next flights, with citations for the policy portion.

RAG 101

Retrieval Augmented Generation

RAG is a pattern in which the system retrieves relevant source passages and supplies them to the LLM so it can answer, grounded in that text (often with citations).

Example

Airline policies in PDFs: (“What do I do if my flight is cancelled due to bad weather?” )

RAG is right for “I need to answer using text sources (policies/docs) and cite them.”

RAG is not sufficient for “I need exact computed values or to enforce rules deterministically.” (“I need Sam’s loyalty balance”)

Here are the main concepts of RAG:

Source Documents

These are typically unstructured documents (PDFs, knowledge bases, internal wikis) that contain information on various aspects of the business.

Chunking

A good chunking strategy is crucial to getting everything else here right. Chunking is the process of breaking these documents into logical, digestible parts that each has its own meaning. The content of the chunk (excluding metadata) will normally have a length within a certain range (measured in tokens). Each chunk is stored as text and metadata (generated or identified during the chunking process) to support retrieval, filtering/ranking, and citation. Chunks also overlap, so important context is not lost at the boundaries. e.g. a rebooking policy might be split over dozens of chunks. Each of those might overlap by a percentage to ensure the LLM can construct a cohesive response.

Chunking a document

Embedding

The process of taking each chunk and producing a vector from it using an embedding model, and then writing this to a vector database. OpenAI, Voyage AI, and many others offer embedding models.

The vector produced by an embedding model is a very long series of numbers- it is high-dimensional (hundreds or thousands of values). It represents the chunk’s meaning in a way that makes similarity measurable: chunks about similar concepts land “near” each other in a vector database, or to put it another way, chunks containing content with similar meaning or subject will be assigned somewhat similar vectors.

Embedding a chunk

An embedding model might consider chunks about:

Baggage allowances
Excess luggage
Oversized sports equipment
Prohibited items in the hold

To all be similar and assign them similar vectors.

The chunks (Passages of text from the source documents) are sometimes in a separate datastore and can be looked up using data returned from the vector database query. This is explained below.

Retrieval

Note

In the diagram below, "Application" means everything the user interacts with — the LLM, the connections to back-end services, and the orchestration between them. This is a simplification of how the LLM plays a role in this flow. In reality, there is more to unpack here.

Retrieval is how you use that index at question time.

Take the user’s query (e.g. “Checking in snowboards”).
Embed the query using the same embedding model.
Run a nearest-neighbour search to fetch the top K most similar chunk vectors.
Return these to the Application.
Query the Datastore and return the corresponding chunks.
Return those chunks (often with metadata like source doc & section) to the LLM (Part of our Application) to answer from.

Retrieval: Returns four vectors with semantic similarity to our embedding query.
These four vectors correspond to four chunks in the datastore.

In this example, retrieval would likely return chunks about oversized sports equipment and baggage allowances.

A first step in optimising RAG is tuning retrieval quality by running test questions, inspecting the top-K chunks returned, and measuring recall and precision.

Remember

Recall: Did we pull all the relevant chunks? Did we miss anything?
Precision: Did we avoid pulling in irrelevant chunks?

MCP 101

Model Context Protocol

MCP is a protocol (or a standard) for exposing tools (APIs, database queries, internal services) to an agent or application using an LLM. The model decides what to call and which parameters to send, but the underlying system does the work and returns a structured result.

MCP is right for “I need to fetch/compute/act via specific systems (DBs, APIs) in a standardised way.”
(“I need Sam’s loyalty balance”, “Search flights”, “Create a rebooking case.”)

MCP is not the best solution for “I need to construct an answer to a question using context held in natural-language business documents” (Case notes, PDFs, Wikis, etc.), and cite it. That is a RAG problem, as here the focus is on retrieving the right passages and grounding the answer in them, not invoking (calling) the downstream system.

MCP Server

This is the core component of an MCP pattern. It avoids the traditional complexity of creating numerous bespoke integrations for every client or agent. It acts as a tool gateway and publishes a consistent catalogue of tools and their schemas to any MCP-capable client (e.g., an agent using an LLM). The MCP server will manage the calls to the underlying systems and normalise (translate) the inputs and outputs.

In practice, most teams also gate “write” actions (rebooking, refunds, case creation): the model can propose a call, but the application controls execution for safety and auditability.

Example

Sam says, “My flight is cancelled, and I am Gold - Book me on the next flight!!”
The model sees the request from Sam and proposes a tool call or calls: create_rebooking(...) against the booking tool and sends Sam’s details in the call.
The application/agent runtime checks the proposal before it is executed:
- Is this user authorised?
- Do the parameters look sane? (Is this the correct ‘Sam’? Does the policy allow this?)
- Where should we audit/ log it?
- Should there be a human approval?
Only when these checks are passed do the real system(s) get called.
A result comes back {rebooking approved} (as structured data) to be explained to the user (in plain English) by the model.
This is how teams keep tool use safe, auditable, and deterministic, especially for write actions.

Note

MCP servers are specifically for LLM-to-tool use cases. Integration layers, GraphQL, APIs, and more are still the right choice for many enterprise architecture use cases.

Without an MCP Server

You might have your bookings database, loyalty database, Customer Data Platform, CRM, and more, all directly integrated into each application or every client/agent. A change to a model or the addition of a new feature to your client applications (like a booking system) requires complex, often difficult decoupling and recoupling of systems.

With an MCP Server

You have a consistent way to describe and expose business systems to LLMs:

What tools exist (e.g., get_loyalty_balance, search_flights, create_rebooking_case)
What inputs they take (a typed schema)
What they return (structured outputs)
How a client should call them (a predictable request/response shape)

Now your integrations are portable. You can swap models or client applications without rewriting the entire ‘tool layer’ because the tool interface is standardised behind MCP.