Alibaba's Qwen team dropped Qwen3-Coder-Next on February 3. It's an 80B parameter model that only activates 3B parameters per token, making it the first serious coding agent model designed to run locally on unified memory hardware.
The timing isn't coincidental. Strix Halo mini PCs are shipping, NVIDIA's DGX Spark is arriving, and M4 Max Mac Studios have been out for months. All of them have 128GB of unified memory. Qwen3-Coder-Next is the model that makes that hardware useful for real coding work.
The Architecture
| Spec | Value |
|---|---|
| Total Parameters | 80B |
| Active Parameters | 3B (per token) |
| Experts | 512 total, 10 active + 1 shared |
| Layers | 48 |
| Context Window | 262K native, 1M with YaRN |
| Attention | Hybrid: Gated DeltaNet + Gated Attention |
| License | Apache 2.0 |
The key innovation is the hybrid attention mechanism. Instead of standard self-attention everywhere, Qwen3-Next alternates between Gated DeltaNet (a linear attention variant) and standard Gated Attention in a repeating 3:1 pattern. This is what makes the 262K native context practical on consumer hardware—linear attention doesn't have the quadratic memory scaling problem.
Multi-Token Prediction (MTP) further accelerates inference by predicting multiple tokens at once, reducing the number of forward passes needed.
Benchmarks
For a model with only 3B active parameters, the numbers are hard to believe:
| Benchmark | Qwen3-Coder-Next (80B/3B) | DeepSeek-V3.2 (671B) | Qwen3-235B |
|---|---|---|---|
| SWE-Bench Verified | 70.6 | 70.2 | — |
| SWE-Bench Multilingual | 62.8 | — | — |
| LiveCodeBench | 56.6 | — | 51.8 |
| MMLU-Pro | 80.6 | — | 83.0 |
| Arena-Hard v2 | 82.7 | — | 79.2 |
It matches DeepSeek-V3.2 on SWE-Bench Verified while using 224x fewer active parameters. On LiveCodeBench and Arena-Hard it outperforms Qwen3-235B. The training cost was reportedly 10% of Qwen3-32B, with 10x inference throughput on contexts over 32K tokens.
Why Unified Memory Matters
Traditional discrete GPUs cap out at 24GB VRAM (consumer) or 48-80GB (professional). You can't fit an 80B model in 24GB regardless of quantization. Unified memory changes the equation entirely—CPU and GPU share a single memory pool, and these new systems have 128GB of it.
At Q4 quantization, Qwen3-Coder-Next needs roughly 46GB of memory. That leaves 80GB free for context, OS overhead, and other processes on a 128GB system. At Q8, you need about 85GB—still feasible.
The Hardware
Three platforms make this practical today:
AMD Strix Halo (Ryzen AI Max+ 395)
llama-server with -fa 1 --no-mmap flags or performance tanksNVIDIA DGX Spark (Grace Blackwell GB10)
Apple Mac Studio (M4 Max)
llama.cpp Metal supportBandwidth Is the Bottleneck
For autoregressive token generation, memory bandwidth is the limiting factor—not compute. Each token requires reading the active model weights from memory. With 3B active parameters at Q4, that's roughly 1.5GB per token. At 256 GB/s (Strix Halo), theoretical peak is ~170 tok/s. At 546 GB/s (M4 Max), it's ~364 tok/s.
In practice, you won't hit these peaks, but the M4 Max's 2x bandwidth advantage over Strix Halo and DGX Spark translates directly to faster token generation for MoE models.
Use Cases
Always-On Coding Agent
The real value of running Qwen3-Coder-Next locally isn't saving API costs—it's latency and availability. A local model has zero network round-trip, no rate limits, and no downtime. For agentic coding workflows that make hundreds of tool calls per session, eliminating API latency compounds into massive time savings.
With 262K native context and 3B active parameters, you can keep an entire codebase in context while maintaining fast token generation. This is the "Claude Code but local" use case.
Private Codebases
If you're working on proprietary code that can't leave your machine, local inference is the only option. Qwen3-Coder-Next's SWE-Bench scores make it the first local model that's genuinely competitive with cloud API models for real-world coding tasks.
Fine-Tuning for Your Stack
Apache 2.0 license means you can fine-tune it. LoRA fine-tuning on Strix Halo or Mac Studio is practical with 128GB unified memory. Train it on your company's codebase, coding standards, and internal APIs. This creates a coding assistant that understands your stack in ways no general model can.
Offline Development
Flights, trains, coffee shops with bad WiFi. With a Strix Halo mini PC or a MacBook Pro (M4 Max, 128GB), you have a SWE-Bench-competitive coding agent that works anywhere.
The Catch
Qwen3-Coder-Next runs in non-thinking mode only—no chain-of-thought reasoning blocks. This makes it fast but means it won't deliberate on complex architectural decisions the way a thinking model would. For straightforward coding tasks, completions, refactoring, and bug fixes, this is fine. For novel algorithm design or complex system architecture, you'll still want a thinking model.
The 3B active parameter count also means it trades some raw capability for efficiency. On MMLU-Pro (80.6 vs 83.0), it lags behind the full Qwen3-235B. For coding-specific tasks, the gap disappears or reverses.
Bottom Line
Qwen3-Coder-Next is the first model where "run it locally" isn't a compromise—it's a genuine advantage. The MoE architecture was designed for exactly the kind of hardware that's now available in desktop form factors. If you have a 128GB unified memory system, this is the coding model to run on it.
Model: Qwen3-Next-80B-A3B-Instruct / Qwen3-Coder-Next. Hardware comparisons sourced from StarryHope and Remio.