The Bottleneck Was Never the Model

5 min read

There is a default assumption in how most people evaluate AI tools: bigger is better. More parameters, more capability. A 32-billion-parameter model must outperform a 14-billion-parameter model. The arithmetic feels obvious.

Running a local LLM optimization session on a dual-GPU workstation produced a result that contradicted this assumption entirely.

The Setup

Two local models running via Ollama: Qwen 2.5 in both 14B and 32B configurations. The 32B model had nearly double the parameters. It also had a problem: the primary GPU had 12 GB of VRAM. The 32B model weighs 19 GB. The math does not work.

Ollama's response to this mismatch is partial CPU offload. It loads as much of the model as VRAM allows and routes the remainder to system RAM, processing it through the CPU. On this workstation, that meant 68% of the 32B model's inference ran through CPU rather than GPU. CPU inference is 10 to 50 times slower than GPU inference for this class of workload.

Generation ran at roughly 5 to 15 tokens per second. Usable, but slow enough to break the rhythm of interactive use.

The Diagnosis

The symptom was slow generation. The tempting interpretation was that a faster GPU or a more capable model was needed. The actual root cause was VRAM headroom: the model chosen did not fit the hardware available.

Diagnosing this required looking at the right metric. Not just GPU utilization (which read 99% and appeared to indicate the GPU was working hard), but the CPU/GPU inference split reported by ollama ps. That single number told the real story: 68% of inference work was happening on the slower processor, not the faster one.

The 14B model fit within 12 GB VRAM with capacity to spare. Running it with full GPU offload, Flash Attention enabled, and appropriate context window sizing produced 46.8 tokens per second. That is a 4.7x improvement over the larger model, freeing 7 GB of VRAM and dropping GPU temperature from 55 degrees to 48 degrees.

The smaller model was faster, cooler, and left enough VRAM headroom that the GPU was no longer saturated by the inference workload.

Out

CPU OFFLOAD

Qwen 2.5

32B

GPU

32%

CPU

68%

tok/s

5 – 15

temp

55°C

3,598 MiB VRAM free — GPU saturated

FULL GPU

Qwen 2.5

14B

GPU

100%

CPU

tok/s

46.8

temp

48°C

10,676 MiB VRAM free — headroom restored

// VRAM_INFERENCE_SPLIT — QWEN 2.5 · RTX 4070 SUPER · 12 GB

Why This Happens

The constraint is not intelligence. It is memory bandwidth.

A model that fits entirely on GPU operates on data moving through the GPU's memory bus, which in a modern card runs at hundreds of gigabytes per second. A model that spills into CPU RAM operates on data moving through the system memory bus, which runs at a fraction of that speed. When a large portion of the model's weights live in system RAM, every forward pass requires moving data across that slower bus, and the generation speed degrades accordingly.

The GPU utilization reading of 99% was not evidence of peak performance. It was evidence of the GPU waiting on data from a slower source. High utilization without high throughput is a bottleneck signal, not a capability signal.

The Broader Pattern

This pattern generalizes beyond local LLM infrastructure. When evaluating any AI tool, the specification of the model is only part of the picture. The other part is whether the runtime environment can actually deliver what the specification promises.

A 32B model running 68% on CPU is not a 32B model in practice. It is a 32B model bottlenecked by a slower I/O path between processor and memory. The label on the product does not describe the experience in production.

The same logic applies to cloud AI tools: a model with a 1-million-token context window is only as useful as the latency its infrastructure can deliver at that context length. A RAG pipeline with 95% retrieval precision at benchmark scale may perform differently under the query distribution of actual users. The specification and the deployment are two separate things, and the gap between them is often where performance surprises live.

Implications for Analytical Tooling

For analysts building local AI workflows, the model sizing decision is not purely a capability question. It is a fit question. A smaller model that runs fast, responds immediately, and stays resident in VRAM between requests is a more useful daily tool than a larger model that is theoretically more capable but operationally sluggish.

The right question when selecting a local model is not which model has the highest benchmark score. It is which model fits the available hardware, and what failing to fit costs in latency and responsiveness.

Getting that question right produced a 4.7x performance improvement without purchasing new hardware, changing the underlying model architecture, or upgrading cloud services. The bottleneck was never the model's intelligence. It was a configuration decision that could be measured, diagnosed, and corrected in an afternoon.

That is a useful reminder when any AI tool underperforms expectations: look at the constraint before assuming the model is the limit.

View Portfolio Get in Touch