AI for the Curious/Labs/KV Quantisation: Measuring Coastlines

KV Quantisation: Measuring Coastlines

I ran 170 tests to try and break TurboQuant KV compression. Here's what actually failed, what didn't, and why the answer to both questions matters more than the score.

LabsArticle 5 of 11 · ~21 min read

Current

KV Quantisation: Measuring Coastlines

About a year ago, Google published the TurboQuant paper on arXiv. It received much wider attention recently after Google Research highlighted it in a March 2026 blog post. Using TurboQuant, it is possible to quantise the KV cache of a model into a much smaller memory footprint. Given I am constrained by a mid-range GPU on my home machine, I was extremely excited by the thought of stuffing a lot more context into my local LLM setup, so I dropped everything, set it up, and started to test.

Over three phases and roughly 170 runs, I tried to break it. I found one failure. It was 100% reproducible. And then I fixed it.

This article is about what I found, how I found it, and more importantly, whether any of it actually matters.


The Coastline Problem

James Gleick put it brilliantly in his book Chaos when he talked about measuring the coastline of Great Britain. You can use a ruler and measure around every pebble on every beach — and eventually, after decades of doing this, you will get an answer. You can also draw a box around the UK and use the perimeter. That takes about 20 seconds. Both answers will be wildly different, and arguably both are 'wrong' by some standard or another. Whether the standard itself is correct is also subjective.

Testing LLMs sits in exactly the same place. I can measure around every pebble, run thousands of tests across every combination of context size, task type, token position, and quantisation level, and I will get an answer, and almost certainly failures. Or I can draw a box: ask the model to answer a question from a long document and see if it gets it right.

The question I kept asking myself was: at what granularity does testing actually start to matter?

The spoiler: I found a real failure at a specific quantisation level and context size. I asked the model to perform a multi-step date-arithmetic task on data buried deep in a long document. It failed 100% of the time in that configuration. And 100% of the time in a slightly different configuration, it passed.

Does this matter for real-world use? My full conclusion is at the bottom. The short answer is: probably not for most people, and I can tell you exactly why.


Why I Tested This

As I mentioned, TurboQuant is exciting! It promises real savings in KV cache size, which opens up larger models to mid-tier GPUs like mine. Why? because large language models do not just need memory for their weights, they also need memory for the context they are reading. TurboQuant compresses that working memory (KV cache) so a smaller GPU can hold much more context. The question is whether compressing that memory damages the model’s ability to reason over long documents. If it does, that could undermine the entire idea of using extreme quantisation for KV cache.

The Setup

The test was built around TurboQuant and a specific hardware configuration:

  • Model: Qwen3.5-35B-A3B-APEX-TQ-Compact — a 35B-parameter Mixture of Experts model that activates only 3B parameters at once. More on what that means below.

  • Hardware: AMD Radeon RX 9070 XT (16 GB VRAM), running llama.cpp via a TurboQuant fork.

Note

ngl is llama.cpp’s "number of GPU layers" setting. Higher means more of the model runs on the GPU, which is faster, but leaves less VRAM for long context.

  • GPU offload: ngl=30 - 30 of the model's 40 transformer layers run on the GPU; 10 run on the CPU. This frees VRAM for a larger KV cache at the cost of slower generation speed. NOTE: Whilst I settled on ngl=30 for these tests, I run the model day-to-day at ngl=35 because it delivers much higher token throughputs with only a small decrease in KV cache

  • KV compression: TurboQuant at K4/V3 (my runtime config) or K4/V4 (upgraded). More on this in a moment.

What is a Mixture of Experts model?

Qwen3.5-35B-A3B has 35 billion total parameters, but for each token prediction it activates only around 3 billion parameters. Gate logic (or routing) inspects the incoming tokens and routes them to the 3B parameters most likely to be relevant. The rest sit idle. The key thing here is the routing happens per token, not prompt, or word.

More Details on MoE Models

For example, the gate might 'see' and 'route' tokens like this:

TokenExpert Routing
SQL QueryCode/ data expert
Refund policyCustomer service expert
DescribeWriting/ reasoning expert
1+1Mathematical expert
JSONStructured output expert

Experts are not literally labelled like this - the model uses statistical specialisations. At each layer, the scores are produced over the available experts:

expert_1: 0.02
expert_2: 0.71
expert_3: 0.11
expert_4: 0.56
...

Then the top k experts are chosen.

MoE architecture gives two practical advantages:

  • In some local inference setups, MoE makes it possible to keep only part of the model (3B) hot on the GPU at any moment, with the remaining weights offloaded or streamed from system memory. This can makes a 35B model viable on a 16 GB GPU depending on model, KV cache quantisation and other factors. (However, there is a caveat - below)
  • Each inference can be fast because you are only processing 3B parameters worth of computation, not 35B (As above with gates and routing).

The tradeoff is that routing is an imperfect science. The gate does not always pick the optimal experts for a given task, and swapping parameters between VRAM and system memory adds latency. On NVIDIA hardware this penalty can be substantially reduced through CUDA's asynchronous memory transfer APIs, which allow expert weights for the next token to be pre-fetched into VRAM in parallel with the previous layer's computation; AMD's ROCm stack (as I tested it - on llama.cpp specifically) does not effectively implement the equivalent, so expert swaps are near-synchronous and the latency hit is much larger.

For reference I can get around 45 tokens/sec on my setup when loading most of the model into VRAM (ngl=35), but this drops to about 20 tokens/sec as I increase the KV cache and move some layers into CPU (ngl=30)

Simulating Inference Speeds

This tool shows you what inference looks like at different speeds. Use the slider to select a speed, and click the button to see how responses to prompts 'feel' at different speeds.

Token Speed Simulator

See how inference speed shapes the experience

45 tok/s200

Press Start to begin


KV Cache and TurboQuant

The two most VRAM-intensive parts of a running model are the weights and the KV cache.

The KV cache is where your context lives — the document you asked the model to summarise, the codebase you want analysed, the prior conversation turns. It is measured in tokens. For scale, a 200-page novel is roughly 65–75,000 tokens.

KV cache must live in VRAM to be useful. Reading from system memory is orders of magnitude slower. This means when sizing a local model, you are constantly trading weight storage against context capacity.

Note

On 16 GB VRAM with a 4-bit quantised 35B-A3B model and ngl=30: standard (f16) KV cache supports roughly 12–15k tokens. 8-bit (Q8_0) KV cache supports around 28k. TurboQuant K4/V3 unlocks around 65k tokens in the same memory footprint.

TurboQuant applies aggressive quantisation specifically to the KV cache, trading precision per token for the ability to store more tokens in the same memory. The innovation is in the algorithm: it aims to compress heavily while preserving the signal that matters for downstream attention.

The naming convention is straightforward:

  • turbo4 K — 4-bit quantisation applied to the Key cache
  • turbo3 V — 3-bit quantisation applied to the Value cache (my runtime config)
  • turbo4 V — 4-bit quantisation applied to the Value cache (higher precision)

This is roughly a 5× compression ratio versus f16. It is why I can almost hold a 200-page novel in context on a mid-range consumer GPU.


The Test Design

Testing KV cache precision is harder than testing model weights. With weights, you can run a benchmark and get a number. With KV cache, failures can be subtle and dependent on:

  • What kind of information you are retrieving.
  • How far into the context it sits.
  • What cognitive operation you are asking the model to perform.

I designed a needle-in-a-haystack test suite. The structure is:

  1. Generate a long document (the haystack) with padding tokens.
  2. Embed one or more specific facts (the needles) at a controlled position.
  3. Ask the model a question that requires retrieving and using those facts.
  4. Score pass/fail based on whether the model gets the right answer.

The core variables:

  • Task type

    • Verbatim recall
    • Contradiction detection
    • Multi-hop reasoning
    • Arithmetic
  • Token gap: how many tokens separate the needle from the end of the document

  • KV config: turbo3 V vs turbo4 V

All tests were run at 60k context (-c 65536) with ngl=30 and no warmup. Sampling was deterministic (temperature 0) to ensure reproducibility. This means the failure is likely to come from the model configuration rather than creative sampling variance.

Note

Temperature defines how 'creative' the model's next output token will be. 0 = only the most likely token will ever be next. As you increase temperature, the model becomes increasingly 'random' with its next token of output until it becomes nonsensical. You can expand this dialogue to read more on temperature:

Temperature Explained
After all layers have been processed, the model produces a probability score for every possible next token. (Of which there will be around 50,000-100,000 to 'choose' from). 'Car' might get 40%, 'Bike' might get 30%, and 'Unicycle' might get 0.1%.
  • 0 = Only the most likely token will ever be next.

  • < 1.0 = Lower probability tokens are suppressed. The closer to 0, the more deterministic the output.

  • 1.0 = Sample the distribution as-is. (40% chance you get 'car')

  • 1.0 = Distribution flattens, resulting in more random, 'creative' outputs.

  • 2.0 is an absolute upper bound for many models. Beyond 1.2, you may start getting gibberish or repeating content.

  • Lower temperatures are better for more factual, deterministic tasks (coding, summarising papers).

  • Higher temperatures are better for creative writing and brainstorming.

The reason you get coherent outputs rather than 'Car' and 'Bike' being used interchangeably is because the probabilities for the next token are calculated based on the token(s) that came before it - crucially they are not fixed for the whole response - each draw is independent. Unless you set the temperature to a high value and skew the probabilities towards bikes or unicycles, the model stays 'on-topic'.

The Named Tests

Several tests come up repeatedly by name throughout this article. Here is what each one actually is.

Jean: The primary failure test. The haystack contains two facts about Jean: an anchored birthday/date fact in one part of the document, and a duration or age-based offset in another. The model has to combine both facts and calculate an exact calendar date. This requires two separate retrieval steps plus day-level calendar arithmetic. The token gap between the two facts is controlled by how much filler text sits between them.

Example: Jean

[body of text ~30,000 tokens of filler]

Born and raised in Edinburgh, Jean celebrated her 21st birthday on 5th January 1975...

[body of text ~30,000 tokens of filler]

After more than four decades of uninterrupted service, Jean retired exactly a week and two days after her 65th birthday...

[body of text remaining tokens]


Question: Based solely on the information provided in this document, on what exact date did Jean retire?

Hartwell: A structural clone of Jean, but the arithmetic is year-level rather than day-level. Where Jean asks "what date is 47 days after her birthday?", Hartwell asks "what year was he born if he turned 30 in [year]?". Used to test whether the distance alone caused the Jean failure, or whether the granularity of the arithmetic was the real variable.

Example: Hartwell

[body of text: ~30,000 tokens of filler]

The Hartwell Mill came into production in the summer of 1887. Edmund Hartwell had turned thirty-five that same year...

[body of text: ~30,000 tokens of filler]

The decision was made to shut the mill permanently in the year that fell exactly ninety-four years after the birth of the original founder...

[body of text: remaining tokens]


Question: Based solely on the information provided in this document, in what year was the Hartwell Mill permanently closed?

Jenny's Shells: A more demanding accumulation test. Four separate facts about Jenny collecting and trading shells are scattered across the document at different positions. The model is asked a question that requires all four facts to produce a correct answer (How many shells Jenny has at the end of the story) so a single retrieval miss causes the whole answer to fail. At 60k gap this produced answers that used roughly one of the four facts correctly; at shorter gaps it passed at all depths.

Example: Jenny's Shells

[body of text]

...by the time she turned back toward the town she had gathered 4 shells... She wrapped them in her scarf and carried them home, setting them on the kitchen windowsill when she got in.

[body of text]

Her neighbour Mrs. Patterson came round one evening... There were 2 in the bag... Jenny thanked her and added both of them to the windowsill.

[body of text]

Her younger sister visited on the Saturday... Jenny told her to take half of them, and meant it.

[body of text]

Inside were miscellaneous objects from what must have been a seaside holiday years before... three shells wrapped in tissue paper... She brought the shoebox down and added the three shells to the collection on the windowsill.

[body of text]


Question: Based solely on the information provided in this document, how many shells does Jenny currently have on her windowsill?

Counting ×7: A phrase counting test. A specific phrase is embedded exactly seven times at different points across the full document. The model is asked how many times that phrase appears. This tests a different cognitive operation than retrieval: the model has to scan the full context and accumulate a count, rather than locate a single value.

Example: Counting ×7

The phrase "the Meridian lighthouse" is woven into 7 separate passages distributed across the document, each in a different plausible context (a coastal survey, maintenance logs, harbour correspondence, a heritage assessment, a complaints register, an engineering report, and one more). Each occurrence is a natural sentence — nothing is repeated verbatim.

[body of text] ...the Meridian lighthouse was noted as a significant navigational reference point... [body of text] ...entries relating to the Meridian lighthouse, most concerning routine lamp replacement... [body of text] ...references the Meridian lighthouse on two separate occasions... [body of text] ...and so on, ×7 total


Question: Based solely on this document, how many times is "the Meridian lighthouse" mentioned? Count every occurrence.

Positional sweep: The Jean test run with the two facts placed only ~300 tokens apart, but at seven different absolute positions within the document (early, mid, and late). This isolates depth-in-context as a variable. All seven positions passed, which ruled out document position as a factor.

Example: Positional sweep

The two Jean facts are placed adjacent (~300 tokens apart) and the pair is moved to seven different absolute positions within the document — early, mid, and late. The surrounding filler fills out the rest of the ~60k context.

[body of text: variable length, positions the needle pair]

Jean celebrated her 21st birthday on 5th January 1975... [~300 tokens of filler] ...Jean retired exactly a week and two days after her 65th birthday...

[body of text: variable length, completes the context]


Question: Jean's fellowship begins on her birthday. What date does it end? (Run 7 times with the fact pair placed at different absolute positions)

Distance sweep: The Jean test run with the token gap between the two facts varied from 1k up to 60k, using standard filler content. Tests whether the gap itself is the variable. All seven gap sizes passed, which confirmed the failure was specific to the combination of gap and arithmetic granularity, not gap alone.

Example: Distance sweep

Jean celebrated her 21st birthday on 5th January 1975...

[body of text: gap varied across runs: 1k / 10k / 20k / 30k / 40k / 50k / 60k tokens]

...Jean retired exactly a week and two days after her 65th birthday...

[body of text: remaining tokens to complete context]


Question: Based solely on the information provided in this document, on what exact date did Jean retire? (Run at each of 7 gap sizes using standard filler)

Origfiller sweep: Same as the distance sweep, but using a different filler document (the original haystack content from the early exploratory tests). Confirmed that the specific content of the padding text was not a factor.

Example: Origfiller sweep

Jean Moreau was born on 14 March 1987.

[body of text: original exploratory haystack content, gap varied: 30k / 40k / 50k / 60k tokens]

Jean's fellowship runs for exactly 47 days.

[body of text: remaining tokens]


Question: Jean's fellowship begins on her birthday. What date does it end? (Same Jean facts and question; filler swapped to the original exploratory haystack to rule out padding content as a variable)


Phase 1: What Fails at 60k

The first sweep tested eight task types at a fixed ~60k token gap under my runtime config (turbo3 V). Five runs each for arithmetic; three runs each for everything else.

Task typeDescriptionResult
Verbatim retrievalRecall exact string + date from document✓ 3/3
Contradiction detectionIdentify conflicting facts at opposite ends✓ 3/3
Keyword countingCount phrase occurrences across full context✓ 3/3
3-hop chain reasoningFollow A→B→C retrieval chain✓ 3/3
Instruction followingApply early rule to late fact✓ 3/3
Code version overrideUse revised version, ignore original✓ 3/3
Year-level arithmetic2-hop retrieval + year-granularity calculation✓ 5/5
Day-level calendar arithmetic2-hop retrieval + add specific days to date✗ 0/5

Seven out of eight task types: perfect at 60k. One task type: zero out of five.

The failure was specific: retrieve a person's birth date from one part of the document, retrieve a number of days from another part, calculate the resulting date. Every single run produced a wrong answer. Not a close miss, a confidently wrong answer with no hedging.


Phase 2: Isolating the Cause

The interesting question was whether the failure was caused by context distance, task type, or both simultaneously.

I ran isolation tests:

  • Distance alone (year arithmetic at 60k): The Hartwell test — same 60k gap, but year-granularity arithmetic instead of day-granularity. Result: ✓ 5/5. Distance is not the variable by itself.
  • Day arithmetic alone (short gap): A positional sweep with the two facts separated by only ~300 tokens at seven different depths in the document. Result: ✓ 7/7. Day arithmetic is not the variable by itself.

The failure requires both conditions simultaneously:

Facts ~300 tokens apartFacts ~60k apart
Year-level arithmetic
Day-level calendar arithmetic

This is the coastline problem in microcosm. If I had only tested one dimension at a time(or measured around fewer pebbles), I would have concluded the model was fine. The failure only surfaces when context compression interacts with a specific type of fine-grained arithmetic.

My theory: at a ~60k token gap, the accumulated quantisation error in a turbo3 V cache is sufficient to degrade the precision of the retrieved date values just enough that day-level calculations fail. Year-level arithmetic tolerates the degradation because the answer space is coarser (you can be less accurate and still be 'correct').


Phase 3: Finding the Threshold

If there is a failure at 60k, there must be a point where it breaks. I ran a threshold sweep in 2k increments, five runs at each depth:

Token gapResult
~60k✗ 0/5
~58k✗ 0/5
~56k✓ 5/5
~54k✓ 5/5
~52k✓ 5/5
~50k✓ 5/5
~40k✓ 5/5

The failure boundary sits between 56k and 58k tokens. It is binary, no partial failures, no degraded accuracy. Below 56k: perfect. Above 58k: zero.

The failure is a cliff, which suggests it is threshold effect in how the turbo3 quantisation scheme handles cumulative precision loss at that depth, rather than a gradual noise floor rising across all contexts.


The Fix

The upgraded V-cache config (turbo4 instead of turbo3) adds one additional bit of precision per Value cache entry. I re-ran the entire threshold sweep under both configs:

Testturbo3 V (my runtime)turbo4 V (upgraded)
Jean 60k✗ 0/5✓ 5/5
Jean 58k✗ 0/5✓ 5/5
Jean 56k✓ 5/5✓ 5/5
Jean 54k✓ 5/5✓ 5/5
Jean 52k✓ 5/5✓ 5/5
Jean 50k✓ 5/5✓ 5/5
Jean 40k✓ 5/5✓ 5/5
Hartwell 60k (year arithmetic)✓ 5/5✓ 5/5
Year arithmetic 60k✓ 5/5✓ 5/5

turbo4 V eliminates the failure across all tested depths. The 56k–58k cliff disappears entirely.


Does Any of This Matter?

Back to the coastline.

I found a real failure. It is 100% reproducible. It has a precise cause and a fix. Here is the 'final' picture:

ConfigChosen contextObserved failure threshold
turbo3 V (my runtime, cap 51k)51k~56k threshold → context is set below observed failure threshold
turbo4 V (upgraded)65k>60k → No test failure any tested range
f16 KV (no compression)~12–15k on 16 GB VRAM (Hits VRAM limits)N/A
Q8_0 KV~28k on 16 GB VRAM (Hits VRAM limits)N/A

The failure threshold is 56–58k Setting the context at 51k means I can never reach the failure threshold. The single flag change: -ctv turbo4 unlocks the full 65K context. with turbo4 v, the specific failure disappears in my tests.

The model is not broken, but there is a narrow edge case. And if there is one here, you can bet there are 50 more I have not found. Does this mean you should not use TurboQuant? No, because the very nature of what is going on here: quantisation, probabilistic sampling, layer upon layer of attention, means the only way to be 100% safe is to use no model at all.

It does underpin the fact we need to check model outputs, test before putting things into Production, and remember the words of George Box:

"All models are wrong, some are useful."


Appendix: Full Test Log

For completeness, the exploratory tests run across all phases:

TestConfigGapResultNotes
Jean originalturbo3 V~60k✗ 0/8Initial discovery
Jenny's Shells (4-fact accumulation)turbo4 V~60k✓ 5/5-
Jenny's Shells (4-fact accumulation)turbo4 V7k–49k✓ 5/5All depths correct
Distance sweep (short filler)turbo4 V1k–60k✓ 7/7Day arithmetic with short-gap filler
Origfiller sweepturbo4 V30k–60k✓ 5/5Confirmed filler content not a variable
Positional sweep (7 depths)turbo4 V~300 (facts adjacent)✓ 7/7Depth not a variable
Counting ×7turbo4 V20k-
Counting ×7turbo4 V35k-
Counting ×7turbo4 V49k-

TurboQuant delivers 3–4× more usable context than alternatives on 16 GB VRAM. The failure I found is real, bounded, and avoidable. Measuring around the pebbles was worth doing — not because I found catastrophic failure, but because I can now tell you exactly where the pebbles are.

← Previous

Build vs Buy: The House of Mirrors

Next →

RAG & MCP 101