200 Billion Parameters for $1,000: Running 4-Bit Quants on eBay Hardware

200 billion parameters for $1,000 — running large MoE models on budget eBay hardware

TL;DR You can run frontier-size mixture-of-experts (MoE) models—such as Qwen3.5-122B, MiniMax-M2.7 (~229B), and DeepSeek-V4-Flash (284B)—on a modest GPU backed by standard system RAM using llama.cpp’s --cpu-moe flag. This works because these models only activate a fraction of their parameters per token (around 10–13B), allowing the massive “expert” weights to live in cheap system RAM while the GPU handles attention and the KV cache.

While a $12.5K RTX PRO 6000 setup runs Qwen3.5-122B at a blistering 128 tokens per second (tok/s) for a single user—and an aggregate 780 tok/s for a concurrency of 8 users—a $1,000 eBay-scavenged Xeon workstation with 2016-era Pascal cards runs the same model at ~10 tok/s. It’s slower, yes, but still faster than most people read—and entirely usable for a single person. You sacrifice raw concurrency, not capability.


The prevailing advice for running massive MoE models locally usually demands a massive budget: drop $25,000 on two RTX PRO 6000 Blackwells, buy two DGX Sparks (which run about $4,699 each), or pick up two Strix Halos with 128GB of unified memory (priced around $3,999 each). The logic assumes that because 4-bit model weights take up 70–160GB, you need a matching mountain of ultra-fast VRAM.

You don’t.

Over the past two weeks, I’ve been running Qwen3.5-122B-A10B, MiniMax-M2.7 (~229B), and DeepSeek-V4-Flash (284B) on setups that look nothing like a datacenter. One is a machine with a single RTX 5090, and the other is a ~$1,000 workstation built from eBay parts featuring a pair of $250 Quadro P5000s from 2016. The secret lies in a llama.cpp feature called --cpu-moe, which exploits a massive architectural asymmetry that standard hardware advice ignores.

The Fact That Changes the Math

Running a dense 122B model on a CPU would be agonizing. Every generated token requires touching all 122 billion parameters. Because CPU memory bandwidth (a few hundred GB/s on DDR4/DDR5) is an order of magnitude slower than a GPU, you would wait seconds for a single token.

But new MoE models aren’t dense. Consider the active parameters:

  • Qwen3.5-122B-A10B: 122B total, 10B active per token.
  • MiniMax-M2.7: 229B total, ~10B active per token.
  • DeepSeek-V4-Flash: 284B total, 13B active per token.

This asymmetry changes everything. The bulk of the weights—tens of gigabytes of “experts”—sit idle for any given token. Using the --cpu-moe (or -cmoe) flag stores these mostly dormant experts in affordable system RAM. Only the components that touch every token—the attention layers, dense projections, and KV cache—remain on the GPU.

The result is a clean division of labor: the modest GPU handles the compute-intensive attention math, while cheap system RAM holds the massive pile of sleeping experts.

The Two Budget Boxes

  • The Ryzen Box: A more conventional build pairing an RTX 5090 (32GB, Blackwell) with an AMD Ryzen 9 9950X (16 cores / 32 threads) and 192GB of DDR5 RAM. The total build costs about $5,000, which includes the ~$3,200 GPU.
  • The Xeon Box: Built purely from older eBay parts, featuring dual Intel Xeon E5-2698 v4 CPUs (2016 Broadwell, 40 physical cores total), 128GB of DDR4 RAM, and two Quadro P5000s (16GB VRAM each). The GPUs were $250 each, the CPUs were $100 for the pair, and the chassis/board/RAM cost $425. Total cost: ~$1,000 for 32GB of aggregate VRAM and enough system memory to host a 108GB model.

(Note: These prices are roughly a year old, sourced before recent DRAM spikes. Replicating the $1,000 box today might cost closer to $1,400–$1,800 due to memory pricing.)

The Performance Numbers

Here is the single-stream decode throughput for Qwen3.5-122B-A10B, comparing our budget boxes against a “money-no-object” $12.5K RTX PRO 6000 reference machine:

Machine GPU(s) Total Box Cost Decode Speed
RTX PRO 6000 1× Blackwell 96GB ~$12.5K 128 tok/s
Ryzen Box 1× RTX 5090 32GB ~$5,000 22.8 tok/s
Xeon Box 2× Pascal P5000 16GB ~$1,000 ~10 tok/s

The $12.5K setup (a $10K card and a $2.5K box) is instantly responsive because it holds all 73GB of quantized weights in VRAM. But a $1,000 box with nine-year-old GPUs running a 122-billion-parameter model at 10 tokens per second is highly usable for interactive chat or coding assistance.

The Ryzen box is roughly twice as fast as the Xeon for two reasons—and the first one is easy to get backwards. With --cpu-moe, the experts that dominate decode are streamed from system RAM, not from the GPU. The Ryzen’s DDR5 is considerably faster than the Xeon’s 2016-era DDR4, so the host memory feeding those experts is the real difference. (The 5090’s own GDDR7 VRAM is faster still, but the offloaded experts never live there—only the attention layers and KV cache do.) Second, the modern 5090 chews through the attention math far quicker than a pair of Pascal P5000s.

The Trade-offs of CPU Offloading

This approach isn’t free. Here are the very real bottlenecks:

  • Decode is CPU-bandwidth-bound: The speed limit (10–23 tok/s) is dictated by how fast your RAM feeds the active experts to the CPU cores. Relying on hyperthreading can actually hurt performance; sticking to physical cores yields better results.
  • Prefill is compute-bound and agonizingly slow: Digesting a long prompt requires raw FLOPS and TOPS, which old CPUs severely lack.
  • Context is surprisingly cheap: Because the massive experts live in RAM, your GPU effortlessly handles the KV cache. The RTX 5090 fits Qwen3.5-122B’s entire 256K context into just 12GB of VRAM.
  • You must pin the experts in RAM: Using --no-mmap and --mlock together prevents the kernel from paging files out to swap memory mid-generation.

The Asymmetry: Prefill vs. Generation

An LLM request has two phases with entirely different bottlenecks. Generation is memory-bandwidth-bound, but prefill (digesting the prompt) is compute-bound.

While the dual Xeon’s 150 GB/s memory bandwidth is about 10% of the RTX PRO 6000’s 1.8 TB/s, the compute gap is astronomical. The Blackwell card delivers 125 TFLOPS of FP32 and 4,000 AI TOPS via tensor cores, whereas the 2016 Xeons manage only ~2.8 TFLOPS of FP32 with zero tensor cores.

The rule of thumb: decode speed barely moves with prompt length, but prefill cost explodes. A 44.5K token prefill for DeepSeek-V4-Flash took roughly 10 minutes on the Ryzen box. Extrapolating a 256K-token prompt implies waiting over an hour just for the first token. Pick your hardware by your expected prompt length, not just your model size.

Concurrency: What the $12.5K Box Actually Buys

When serving Qwen3.5-122B via vLLM on the RTX PRO 6000, a single user gets 128 tok/s, but pushing a concurrency of 8 users yields a massive 780 tok/s aggregate throughput.

CPU-offloading cannot match this. On the 5090, pushing multiple requests causes aggregate throughput to rise sub-linearly, while the per-request rate collapses from 23 tok/s to roughly 8.5 tok/s as the CPU experts saturate. If you need to serve a team, buy the big card.

Important Caveats & Setup

  • Old GPUs mean old CUDA: The P5000s are Pascal architecture, meaning they are pinned to a CUDA-12.8 llama.cpp image since CUDA 13 dropped Pascal support.
  • DeepSeek-V4-Flash: This architecture requires a community fork to run properly, so verified numbers are only available for the Ryzen box (hitting 11–12 tok/s).
  • Heat and Power Draw: Old servers are power hungry. The small server room holding the boxes for these tests rose to 83°F when I started running these evals—just a bit too warm for the office. Expect about 800W of power draw on the dual Xeons, and pushing 1000W+ on the RTX 5090 setup.

The Flags You Need:

  • --cpu-moe (or --n-cpu-moe N to shift layers manually).
  • --no-mmap and --mlock to pin weights in anonymous RAM. On the Ryzen box, MiniMax’s 131GB pins cleanly into the 192GB of RAM; on the Xeon, an IQ4_XS quant’s ~108GB pin leaves about ~20GB free on the 128GB box.
  • --flash-attn on alongside a quantized KV cache (like q8_0) to save VRAM.
  • --threads set strictly to physical cores (plus --numa distribute for dual-socket setups).
  • --parallel 1 to dedicate the entire context window to a single user.

Conclusion

This was never about speed; it’s about access. You are giving up the headroom for concurrency and the instant first token. But you are gaining the ability to run 122-billion, 229-billion, or 284-billion-parameter models on hardware that costs as much as a used laptop.