Do LLM Inference Engines Actually Get Faster With Updates?

OpenClaw shovels snow

Three weeks of skiing in Utah, and the snow was excellent even if the temperatures weren’t — it was hitting the 50s most days, which is not what you want when you’re chasing powder. I came home to Virginia to find the opposite problem: snow on the ground, frost on the windows, and a house that had been sitting empty for three weeks. And OpenClaw doesn’t shovel snow for you while you are on vacation.

Time to fire up the basement space heater. The one made by NVIDIA.

The GPU has been idle since mid-January, which in this ecosystem is an eternity. SGLang, vLLM, and Ollama have all shipped updates. The question is: what’s actually improved in the past month, besides my skiing?

In January, we benchmarked vLLM, SGLang, and Ollama running Llama 3.1 8B on an NVIDIA RTX PRO 6000 Blackwell (96GB VRAM). A month later, all three projects have shipped updates. Do newer versions deliver measurable improvements, or is it all changelog theater? We re-ran the same benchmarks to find out.

Version Changes

Backend January Test February Test
SGLang lmsysorg/sglang:blackwell lmsysorg/sglang:v0.5.8.post1-cu130
vLLM vllm/vllm-openai:latest (v0.13.0) vllm/vllm-openai:latest (v0.15.1)
Ollama ollama/ollama:latest (v0.13.5) ollama/ollama:latest (v0.16.1)

Same GPU, same model (GPTQ-INT4 for SGLang/vLLM, llama3.1:8b for Ollama), same benchmark script, same concurrency levels (1 through 128).

Results: Modest But Real Gains

Backend Jan Peak Feb Peak Change
vLLM (NVFP4) 8,033 tok/s 8,430 tok/s +4.9%
SGLang (GPTQ) 6,395 tok/s 6,460 tok/s +1.0%
vLLM (GPTQ) 5,478 tok/s 5,637 tok/s +2.9%
Ollama 484 tok/s 504 tok/s +4.2%

Every backend improved. vLLM NVFP4 gained the most in absolute terms, pushing from 8,033 to 8,430 tokens/second. vLLM GPTQ showed a consistent 3-8% improvement across all concurrency levels. SGLang held steady with a marginal gain at peak. The ranking is unchanged: vLLM NVFP4 leads, SGLang beats vLLM on the same GPTQ model by ~15%, and Ollama trails at roughly 1/13th the throughput.

What We Learned

Incremental updates deliver incremental gains. A ~3-5% improvement per month may not sound exciting, but it compounds. At this pace, these engines get 30-50% faster per year without changing hardware.

Configuration matters more than version bumps. The single biggest performance delta in our testing wasn’t a software update—it was fixing Ollama’s parallel request setting. Before chasing the latest release, audit your configuration.

vLLM’s CUDA 13.1 compatibility note: The latest-cu130 tag failed on our updated NVIDIA driver 590.48.01 (CUDA 13.1) with Error 803. Switching to the latest tag resolved it. Worth checking if you’re on recent Blackwell drivers. The cu-130 tag on the docker repo may not have been posted yet for the latest version of vLLM, so hopefully this will resolve itself shortly.

Our recommendation from January stands: vLLM with NVFP4 for production throughput, SGLang for broad GPTQ model support, Ollama for development. We’ll retest when SGLang ships CUDA graph support for Blackwell—that could shake up the ranking.


Benchmarks conducted February 2026. Same hardware and methodology as our January comparison.