Top Trumps: LLM Edition

Setting Up the Game

Qwen3.5 has reignited the question of whether local models can replace an OpenAI or Anthropic subscription - both at an individual and enterprise level. They are posting impressive scores in LLM benchmarks, beating models many times their size, and frontier models in some cases. However, as we will discuss here, model benchmarks and attributes cannot just be taken at face value.

In Build vs Buy: The House of Mirrors, I concluded that Build was probably not worth it unless you were planning to burn serious tokens over a sustained period. I ended the article by discussing some recursive patterns in model usage (combining local and Frontier models across a small selection of patterns, each with a slightly different endgame).

I performed some structured experiments to explore two things:

Local model performance. Is it worth running locally? What are the constraints? What are the real costs?
Worker-Judge Architecture. What does a simplistic agentic loop look like? How do you decompose a task, evaluate output, and identify failure?

These will be explored over two separate articles - this one focusing on the local model performance.

This article will talk about the Qwen family of models - mainly the new 3.5 generation.

TL;DR

Qwen3.5-9B benchmarks impressively against models ten times its size. How practical is it for real use cases today? As we'll find out, putting it to work in a real-world test tells a different story.

This article covers:

What the numbers in a model name actually tell you
How quantisation lets large models run on modest hardware
How attention works, and why context quality matters as much as context length

Part 2 covers the Worker-Judge architecture and the cost of getting the job done when the model can't do it alone.

What is Qwen?

A quick recap: Qwen is a family of models developed by Alibaba, a Chinese retail and computing giant. For open weights models, they are quite impressive. They come in several flavours, ranging from 0.7 billion parameters to 397 billion.

Remember

In the model name (for example, Qwen3.5-27B), 27B denotes the number of parameters (or weights) it has.

We'll run tests with a model that can actually be run on consumer hardware: Qwen3.5-9B. There are other variants too, such as:

Qwen3.5-27B
Qwen3.5-9B
Qwen3.5-4B
Qwen3.5-2B
Qwen3.5-0.8B

These are known as dense models.

MoE vs Dense Models

The larger models have an interesting feature known as 'Mixture of Experts' (or MoE) where they do not need to activate all their parameters at once, and can therefore operate in a much smaller VRAM 'footprint': For specific tasks, they only deploy a fraction of their parameters. This is shown in model names like:

Qwen3.5-397B-A17B
Qwen3.5-122B-A10B
Qwen3.5-35B-A3B

Remember

Qwen3.5-397B-A17B indicates that 17 billion parameters are activated at once. Activating only a fraction of parameters per forward pass reduces the compute required per token.

There are two reasons why MoE is advantageous:

If you are VRAM-limited (e.g., many consumer GPU cards under $ 1,000 USD have only 16 GB of VRAM), you can get 'outsized' performance by loading only the active parameters into VRAM and leaving the rest in system memory.
If you are not VRAM-limited and all (let's say 35Bn) parameters live in VRAM, you get much, much faster performance because you are only leveraging a small portion (say 3Bn) to process your input.

Understanding the difference between an MoE and dense model is important:

Dense Models

Dense models are only really usable when all of their parameters are loaded into VRAM (GPU memory) at once. So in the case of Qwen3.5-9B you are loading 9Bn parameters into VRAM. This requires between 4.5 and 18 GB of VRAM, depending on the quantisation used (More on this later). Every time the model attempts to answer a prompt, it activates all 9Bn parameters.

MoE Models

Because MoE models deploy only a fraction of their parameters on each pass (say 3Bn for Qwen3.5-35B-A3B), only the 3Bn need to be in VRAM, and the rest can sit within system memory. (Servers and home PCs generally have more system memory than VRAM). This may come at the tradeoff of speed, as the model will constantly shift parameters between VRAM and system memory. However, as the model only activates a fraction of its parameters (say 3Bn), the inference will often be faster than a dense model token-by-token.

In addition, because the model is deploying a subset of a very large parameter space, rather than 100% of a smaller space (as in a dense model), answer generation tends to be stronger.

Understanding Qwen Model Names

The model's name Qwen3.5-9B-q8_0 describes quite a few of its properties:

Qwen - The name of the model - developed by Alibaba, a Chinese retail and computing giant.
3.5 - The generation of the model. The 3.5 generation was released in February 2026.
9B - The number of parameters in the model. (9 Billion).
q8 -8-bit quantised. This is the size of each model parameter in bits (8 1's and 0's) and introduces information loss. The q is significant as it means it is a quantised model: Its size has been reduced at the expense of accuracy in certain scenarios. This is explained below, but in short, it allows us to fit large models onto modest hardware, and 8-bit quantisation is often a worthwhile trade-off with minimal impact on most tasks.
0 - The kind of quantisation being utilised. (Known as the quantisation scheme).

Models By the Numbers

Parameters define how much the model can represent. Quantisation affects how precisely that information is stored. Attention determines how effectively that information is used during inference. We'll cover all three in detail.

Parameters

How much data the model can represent

Using an analogy from another classic board game Risk, each army piece is a parameter.

In Risk, a larger army gives you more potential, but it doesn’t guarantee victory. If those forces are poorly deployed or uncoordinated, a smaller, better-positioned opponent can win. In the same way, more parameters increase a model’s capacity, but performance depends just as much on how that capacity was trained, structured, and used at inference time.

Practically, parameters are:

Weights and biases learned during training. They encode patterns learned during training, such as language structure, relationships, and behaviour.
Represented as floating-point or integer values (More on this below).
Most models will have billions of parameters, with the largest commercial ones potentially having over a trillion.

Why isn't more always better?

Parameter count is only one part of the story, but it's the 'Top Trump stat' that gets talked about. Over the last few years, models have tended to trend larger in terms of the absolute number of parameters. However, this creates another problem: larger models tend to:

Require more data and more training.
Respond more slowly unless they activate only a fraction of their parameters at once.
Require more powerful hardware to operate.

If I have 16 GB of memory and a 20B-FP8 parameter model, I will not be able to 'fit' it on my hardware without taking steps like Quantisation. This can be a real problem, even for enterprises, where a 397B model requires a whole stack of very expensive GPUs just to run.

With that said, the parameter count can influence several things that are key to model performance:

Diversity: Can a model solve a wide range of problems?
Reasoning: Can a model reason its way to solving a complex problem?
Interpretation: Can a model understand your prompt - pick up the subtleties of language, break several questions into individual units of 'work'?
Responses: Can a model answer you coherently? Know which tools to call to answer your query? Know when it's wrong or fact-check itself?

Thinking back to our Risk analogy the real question is how you utilise your assets to make your model 'smart'.

So what makes a model 'smarter'?

Training data: the quality, diversity, and relevance of the data the model learns from.
Architecture: how the model processes and routes information (e.g. dense vs Mixture of Experts).
Training process: how the model is optimised (pre-training, fine-tuning, alignment).
Inference-time behaviour: How the model is used: prompt design, context size, and how much noise it has to deal with.

This is why a well-trained 9B model can outperform a much larger one on specific tasks. It’s not just about how many parameters you have, but how effectively they are used.

Quantisation

A trade off between precision and model size

Earlier we touched on how we might fit a model on to a memory constrained GPU. Quantisation is often used to balance model precision and model size and speed. The term 'bits' refers to the 1's & 0's that represent that parameter in the model.

Remember

Quantisation allows us to fit large models on to a smaller memory footprint with some (minimal) loss of 'quality'.

Using a 4-bit quantised version of a 20B model needs 10 GB of memory for weights. This may allow it to fit on a consumer-grade GPU with 16 GB of memory. A 4-bit quantised 100B model needs 50 GB of memory for weights. This may fit on an enterprise-grade NVIDIA H100 GPU with 80 GB of memory.

To better understand what happens during 8-bit quantisation, consider this table of values. Each starts life as a more precise value, such as BF16 or FP32 (a number with decimal places represented by 16 or 32 1's and 0's), and becomes an INT8 (A whole number represented by 8 bits). Because you cannot express every 16-bit value using just 8 bits, some 'fitting' occurs. The process is:

Multiply by 100 (for FP16).
Rounding may occur to convert to an INT8 number (as these can only be whole values).
FP16 weight is now stored as INT8 and represented by 50% fewer bits (potentially with some error).
Model weights now consume half as much GPU memory at the cost of some accuracy.

FP16 Weight	x100	Stored as INT8	Reconstructed (×0.01)	Error	Error %
1.2700	127.0	127	1.2700	0.0000	0%
0.8350	83.5	84	0.8400	+0.0050	0.6%
0.3210	32.1	32	0.3200	0.0010	.31%
0.0050	0.5	1	0.0100	0.0050	100%
0.0000	0.0	0	0	0.0000	n/a (0÷0)
-0.1580	-15.8	-16	-0.1600	0.0020	1.27%
-0.7730	-77.3	-77	-0.7700	0.0030	0.39%
-1.1240	-112.4	-112	-1.1200	0.0040	0.36%

The above table is a very large simplification of quantisation, but it is representative of what happens:

One of the values shows an error of 100%
However, near-zero values carry very little signal. The absolute error (0.005) is more important than the percentage.
Conversion to an integer (a whole number) actually favours large values, which carry more signal anyway.

Broadly, this means the model:

Was trained at a higher precision (at least 16 bits, so twice as many 1's and 0's per parameter - think more decimal places per number)
Its size was then halved or quartered (depending on whether it was originally 16 or 32 bits) to make it fit on modest hardware (like mine) using quantisation. Think of this as fitting each weight to the nearest available value in a smaller set.

The outcome is a slight loss in accuracy, as many parameters would have been rounded up or down (since they are now represented with fewer bits), but much of the capability is preserved.

Because each of the above weights is one of billions in the model, the impact of each error is tiny (and they are both above and below 0). If the majority of the errors were rounded only up or only down, you would see a compounding effect, and the model's accuracy would drop off hugely. The quantisation scheme is important here, as it is designed to minimise systematic bias (by design, some errors will round up, and others down), but the error distribution does not perfectly cancel out.

Qwen Benchmarks: High Score

I picked Qwen3.5-9B for this exercise. I can run it locally without renting a GPU, and it has benchmarked well against much larger peers in recent tests:

It outperformed GPT-OSS-20B across many tests, including some math and reasoning tests, despite having less than half the number of parameters. GPT-OSS-20B was developed by a company at the forefront of LLMs (OpenAI)
It beat GPT-OSS-120B (another OpenAI model but with over thirteen times the parameters of our Qwen model) on knowledge benchmarks and only lost out on math and reasoning tests by narrow margins. It did fall away significantly in pure coding tests.

This model performed extremely well in certain benchmarks, and the test I devised (and will talk about later) was designed to align with these:

It outperformed both the above models on long context benchmarks. This is the ability to distil and use information from long prompts successfully. It should reduce a problem seen in many other models: the "lost in the middle" problem. This is a phenomenon in which information at the start or end of a prompt is weighted more heavily than information in the middle.
On the Agentic test across multiple tasks (Tau-2), it scored extremely well. This suggests it is able to complete multiple commands - or steps successfully, particularly in a controlled setting. GPT-OSS models were not benchmarked.
On conflicting or complex multi-constraint prompts (MultiChallenge) and instruction following (IFEval), it scored the highest of any model.

Where Benchmarks Fall Short

Benchmarks are a good indicator of performance, but it is vital to remember they are just that: an indicator - normally a single shot, clean, stateless evaluation. This is not how we work in real-life, particularly when you look at the application of LLMs to real-world problems.

Benchmarks	Ideal Test
Sometimes known up-front	Unknown up-front
Clean	Noisy
Single pass	Iterative
Known answer	Open-ended
No error accumulation	Cascading failure

Spoiler alert: We will look to devise something closer to an ideal test in part 2.

Introducing Attention

Using information effectively at inference time

As we've just seen in the benchmarks, Qwen excels at maintaining context and executing multiple commands. Attention is the mechanism behind all of this: the ability to focus on what matters in the context for each task.

Models have a finite number of tokens they can hold in context before they need to start clearing out their context window. This is the model's working memory: everything it can "see" at once.

This page tells us that Qwen3.5-9B has a default context length of 262,144 natively and extensible up to 1,010,000 tokens. So out of the box, when the model is holding 262,144 tokens of context, the serving layer will start dropping old tokens or applying methods to keep the context within the hard limit. For example, the serving layer of some models (like Claude) will compress context into a summary.

A model will have a certain number of attention 'heads' to evaluate these tokens - think of these like a bunch of workers looking across the context, each with a slightly different perspective.

The same link above tells us this about Qwen3.5-9B's attention heads: Number of Linear Attention Heads: 32 for V and 16 for QK

Models like Qwen have many layers of attention. Put simply, the above numbers explain:

How many heads are looking at the tokens (Query, Keys)
How many heads are evaluating and forwarding the output (Value) for the next layer.

This is significant because as the context fills up, those heads must evaluate relationships across an increasing number of tokens, making it harder to isolate the most relevant signals.

Imagine each head has a finite distribution of attention it can give. Some tokens will get less attention, and some more, with the head giving more attention to things seen as more 'important' for the current context. If the number of tokens doubles, the scope for additional noisy context increases, potentially making it harder for the head to distinguish signal from noise, and potentially less weight being given to important tokens. This is part of the reason models seem to 'forget' things or sometimes answer in a way that does not make sense in the context of previous messages in that session.

As context grows, the model must evaluate more tokens.<br/>This increases noise, making it harder to consistently prioritise the most relevant information. — As context grows, the model must evaluate more tokens.
This increases noise, making it harder to consistently prioritise the most relevant information.

There are different ways to optimise attention such as:

Grouped Query Attention (GQA) - what the 32V/16QK (Used by Qwen3.5-9B) already is; fewer routing heads (QK), more output heads (V).
Sliding Window Attention: Instead of attending to all prior tokens, each head only attends to a fixed window of recent tokens.
Flash Attention: A hardware-level optimisation that computes attention in blocks to avoid loading the full attention matrix into memory; this doesn't change the output, it just makes it faster and more memory-efficient.

Picking The Right Size Model

I mentioned earlier I picked Qwen3.5-9B-q8_0. It is quite impressive for such a small model, and is the 'right fit' for a 16Gb card. The maths works out like this:

A card with 16Gb RAM has 16 billion bytes of memory
There are 8 bits in a byte
A 9B model at 8-Bit quantisation needs 9Gb memory just for the weights. Because 8 ÷ 8 = 1, we need 1 byte of RAM of each parameter.
If you took a 9B model that was BF16 (16 bits per parameter), you would need 18Gb RAM on your GPU just for the weights. Because 16 ÷ 8 = 2, we need 2 bytes of RAM of each parameter.
You then need to reserve memory on your GPU for things other than model weights:
- The GPU needs memory for its own operations and your operating system might reserve memory for its user interface (you can disable this).
- You need working memory for your model - such as a K/V (Key/Value) cache and for the model's context. You can 'offload' working memory to the system's RAM, but it slows things down by an order of magnitude.
Taking this into account, 9B parameters (9Gb RAM) + ~7Gb for cache and runtime overhead makes it a reasonable fit on a 16GB card.

These rules apply equally to server-grade GPUs which might have hundreds of gigabytes of RAM.

Return of the Mac

If you're determined to set up a home lab and run your own models, Apple's Mac Mini and Mac Studio options can offer value.

The Apple 'M' series architecture (Found in every laptop and desktop they sell now) offers Unified Memory - the idea that, rather than having system RAM and GPU RAM (Like my PC), there is a single pool of memory that both the GPU and system can call upon. The tradeoff is that access to that memory is slower than for a dedicated GPU in a PC or an enterprise-grade system, so absolute throughput is lower, but it does mean you can run larger models at relatively low cost.

Apple offers models with up to 256 GB unified memory, which theoretically allows for an enterprise-grade 398b parameter model at FP4 precision. (2 parameters per byte of memory). However, for reasons discussed in my previous article, you'd have to do some very careful maths to work out if this was really a saving over just using a Frontier model.

How To Run a Local Model

To run a model on your own hardware with reasonable performance, you will need a PC with a GPU (Graphics Processing Unit) or an Apple device with a unified memory architecture.

The fastest way (I have found) is to use Ollama. It allows you to serve models locally, and I used it to perform all tests in this article using the aforementioned Qwen3.5-9B-q8_0.

You can see an example of how to get Ollama running in the box below. However, your setup will vary depending on the combination of hardware and software you are running.

Running Ollama & [Qwen](glossary:qwen) on Linux/ Mac

Create this file on your machine. Call it qwen-coding.modelfile

FROM qwen3.5:9b-q8_0

SYSTEM """You are a coding assistant. Output only working, complete code.
- No markdown fences unless the output is intended to be rendered
- No explanations unless explicitly asked
- No placeholders, TODOs, or ellipsis in code
- Match the style and conventions of existing code
- Do not use internal reasoning unless the problem is highly mathematical or specifically requested."""

PARAMETER num_ctx 32768
PARAMETER num_gpu 99
PARAMETER temperature 0.2
PARAMETER top_k 20
PARAMETER top_p 0.95
PARAMETER repeat_penalty 1.1
PARAMETER presence_penalty 0

Linux or Mac: curl -fsSL https://ollama.com/install.sh | sh This installs Ollama and sets it up as a systemd service.
Verify Installation: ollama --version
Start the Ollama Server: It should start automatically as a service, but you can also run it manually: ollama serve
Check it's running: systemctl status ollama
Pull the model: ollama pull qwen3.5:9b-q8_0
Build your custom model from the Modelfile: ollama create qwen-coding -f [PATH]/qwen-coding.modelfile Replace [PATH] with the path to the modelfile above.
Run it: ollama run qwen-coding

This is a quick summary of each parameter we added to our modelfile.

num_ctx 32768: The context window size (in tokens); 32,768 tokens means the model can "see" ~25k words of conversation/code at once before older content is dropped.
num_gpu 99: How many GPU layers to offload; 99 is effectively "use all layers on GPU", maximising speed.
temperature 0.2: Controls randomness; low value (0–1 scale) makes the model more deterministic and focused: good for code generation where correctness matters more than creativity.
top_k 20: At each token, only consider the top 20 most probable next tokens; narrows the candidate pool to reduce wild outputs.
top_p 0.95: Nucleus sampling: consider only the smallest set of tokens whose cumulative probability reaches 95%; works with top_k to keep outputs sensible without being too rigid.
repeat_penalty 1.1: Penalises the model for repeating tokens it has already used; 1.1 lightly discourages repetition without being overly strict.
presence_penalty 0: No additional penalty for using tokens that have already appeared in the output; set to 0 so the model doesn't avoid re-using necessary keywords (e.g. variable names).

The model benchmarks at around 48 Tokens/ Second on my AMD hardware. This is fast enough for it to 'feel like a conversation' when you are talking to it, and means I can run each of my 'Battleship' tests in just a couple of minutes.

Playing The Game

I devised a test that would probe both strengths and weaknesses of the model: Build the classic board game Battleship from a prompt because it tests several areas where the model should excel:

Long context (strong)
Instruction following (strong)
Coding (Strong for its size, but not a match for bigger models)

Battleship is simple enough for us all to understand, but still has enough rules and complexity to make a low- to mid-size model sweat. In fact, for Qwen3.5-9B, we will need to break the game-building process into several small pieces - it is incapable of one-shotting the game. This means it was unable to complete the task successfully from a single prompt with no additional context. Claude Sonnet managed this feat. This goes against the benchmark narrative. On paper, Qwen3.5-9B should handle all three things Battleship requires:

The prompt was long
It required the model to follow a set of instructions that were clearly explained
And of course, write code.

The benchmarks showed it should handle long context and multi-step instructions well: two of the three things building Battleship requires. But benchmarks test a single shot at a well-defined task with a known answer. Building a game incrementally, where each step depends on what came before, is a different problem. The context gets both longer and noisier. Earlier code becomes distractor tokens for later steps: The model is not only solving the current step, but it also has to re-evaluate the entire prior state every time.

In Part 2, we'll look at whether breaking the task into smaller pieces and adding a judge to evaluate each one can get the game built. And what that architecture costs.