AI & LLM·6 min read

Ollama vs llama.cpp: why the same model felt twice as fast

TL;DR
  • Ollama is built on top of llama.cpp, so the inference engine is identical — the speed gap comes from the binary, the defaults and version drift, not the model.
  • On a CPU-only laptop a freshly compiled binary (AVX2/FMA), flash attention, KV-cache quantisation and the right thread count can change the experience substantially.

I've been running local LLMs on my laptop for about a year. Nothing fancy — a Huawei machine with an Intel i5, 16GB of RAM, and Ubuntu. For most of that year, my tool of choice was Ollama. It's friendly, it just works, and the speed was "good enough" once I had a GPU around to help.

Then the other day I tried llama.cpp directly, with Gemma4 — the new Google model released in April. Same machine, same model. It was blazingly fast — noticeably faster. And I know the first reaction you might have. It's a new model, but in fact that's not the case, irrespective of me disabling the thinking mode (which on llama.cpp you can do by passing a --reasoning off at launch time).

That sent me down a rabbit hole. Here's what I learned.

The first surprise: they're the same engine

Ollama isn't a competitor to llama.cpp. Ollama is built on top of llama.cpp. Same inference code underneath. Ollama wraps it in a friendly CLI, an HTTP server, a model registry, and a bunch of sensible defaults.

So if the underlying engine is the same, why the speed difference? It comes down to three things: the binary itself, the defaults, and a bit of version drift.

The binary matters more than I thought

When I ran brew install llama.cpp, Linuxbrew compiled it from source on my machine. That means the compiler looked at my specific CPU and turned on every instruction set it supports — AVX2, FMA, possibly more depending on the generation. It also typically links against a fast maths library like OpenBLAS, which makes a real difference for matrix multiplication (which is most of what an LLM does).

Ollama, by contrast, ships a prebuilt binary. To work on as many machines as possible, it's compiled with a conservative baseline. My CPU probably has features that binary literally can't use, because they weren't enabled at compile time.

That alone can account for a meaningful chunk of the speed gap.

A short detour: what AVX2 and FMA actually are

I kept seeing those acronyms and decided to actually understand them.

A CPU understands a fixed vocabulary of instructions: "add these two numbers," "load this from memory," and so on. That vocabulary is called an instruction set. Over the years, Intel and AMD have added new instructions to make certain tasks faster — but software has to be compiled to use them. If your CPU supports them and the program wasn't built to use them, they sit idle.

Most of the modern speed-ups belong to a family called SIMD — Single Instruction, Multiple Data. Normally a CPU does maths one number at a time. SIMD lets it process many at once in a single instruction. Which is exactly what a neural network needs: it's all "multiply this list of numbers by that list of numbers," over and over.

The names worth knowing:

  • AVX2 (2013): processes 256 bits at a time — 8 floating-point numbers per instruction. Standard on essentially any i5 from the last decade.
  • FMA (Fused Multiply-Add): does (a × b) + c in a single step. Neural networks are literally that operation repeated billions of times, so FMA roughly doubles throughput on the core maths.
  • AVX-512: the bigger sibling — 512 bits, 16 floats at a time. Hit-or-miss on consumer laptops because Intel removed it from many chips for a stretch.
  • AMX: even newer, on recent high-end Intel chips, designed specifically for AI matrix maths.

On ARM hardware (Apple Silicon, Raspberry Pi) the equivalents are called NEON and SVE.

How to check what your CPU has

On Linux, one line:

cat /proc/cpuinfo | grep flags | head -1

You'll get a long string of abbreviations. Look for avx2, fma, avx512f, etc. Those are the features your chip supports — and the ones a freshly compiled binary will gladly use, while a generic prebuilt one might leave on the table.

Defaults that quietly cost you performance

A few knobs that ship "off" or "conservative" in Ollama:

Flash attention. A newer, faster algorithm for the attention maths. Same output, less time and memory. Ollama defaults to off; you turn it on with OLLAMA_FLASH_ATTENTION=1.

KV cache quantisation. As a model generates, it keeps "notes" about everything it's said so far so it doesn't redo work. Those notes are stored in 16-bit precision by default. You can compress them to 8-bit (OLLAMA_KV_CACHE_TYPE=q8_0) for roughly half the memory and barely any quality loss. Smaller notes = faster reads = faster generation.

Threads. llama.cpp from brew defaults to using your physical cores. Ollama is more cautious. On a 4-core i5, the difference between 2 threads and 4 threads is close to 2× on the maths.

Context size. Recent Ollama versions scale context with available memory, which is great for capability but can quietly inflate the KV cache and push a tight machine towards swap. On 16GB of RAM with a desktop environment running, that matters.

Version drift

llama.cpp is updated almost daily. There's a real community race to make it faster, especially for new model architectures. Ollama bundles a snapshot from some weeks ago. So when I built llama.cpp fresh, I wasn't just running the same code with different settings — I was running newer code with optimisations that hadn't reached Ollama yet.

For Gemma specifically, the model I went testing immediately, this gap was real. New model support tends to land in llama.cpp first and get tuned over the following weeks.

Why CPU-only made the gap obvious

On a GPU, the hardware is so fast that wrapper overhead and a few missed optimisations get hidden. On a CPU, every inefficiency shows up. The maths is the bottleneck, and any factor that slows the maths — fewer threads, an older binary, a bigger KV cache eating cache lines — shows up directly in tokens per second.

That's why my CPU-only test felt so dramatic. It wasn't that llama.cpp was magic. It was that on CPU, the conditions around the engine matter more than the engine itself.

The takeaway

Same model file. Same machine. Genuinely different machinery around it.

If you're using Ollama for convenience, that's a fine choice — it's a lovely tool and the defaults are getting better with each release. But if you've been wondering why your local LLM "feels slow," it's worth knowing where the slack is. A few environment variables and a freshly compiled binary can change the experience substantially.

What to try, in order of impact

  1. Build llama.cpp from source (or brew install on Linux/macOS) and compare against Ollama on the same .gguf file.
  2. In Ollama, set OLLAMA_FLASH_ATTENTION=1 and OLLAMA_KV_CACHE_TYPE=q8_0 before starting the server.
  3. Check your context size. If you don't need 32K, don't allocate it — especially on tight RAM.
  4. Watch your thread count. On CPU, more (up to physical cores) is better.
  5. Update Ollama regularly. The bundled llama.cpp moves forward, just slower than upstream.

The most useful thing I got out of this whole exercise wasn't the speed boost. It was finally understanding what was actually under the hood.