The Real Heroes of Local AI

Local AI is booming. Millions of people run large language models on their laptops, their phones, even their Raspberry Pis. It feels like magic. But when people talk about how this became possible, they usually credit the wrong projects.

The real heroes don't have slick landing pages or VC funding announcements. They work in the shadows — writing C++, inventing memory management tricks, and quantizing models at 3 AM so that you can run a 70-billion-parameter model on hardware that costs less than a used car.

This is their story.

llama.cpp & Georgi Gerganov

@ggerganov

TheBloke (Tom Jobbins)

Quantization Pioneer (Retired)

bartowski

Quantization Torch-Bearer

mradermacher

Quantization at Scale

vLLM & PagedAttention

UC Berkeley

Hugging Face

Model Ecosystem

Unsloth (Daniel & Michael Han)

Fine-Tuning Revolution

llama.cpp: The Engine That Started Everything

In March 2023, a Bulgarian engineer named Georgi Gerganov did something that changed the trajectory of AI. He took Meta's freshly leaked LLaMA weights and got them running in pure C/C++ — no Python, no PyTorch, no CUDA toolkit, no dependencies. Just a single binary that you could compile and run on practically anything.

That project, llama.cpp, single-handedly kickstarted the local LLM revolution. It runs on your MacBook via Metal, on your gaming PC via CUDA and Vulkan, on your phone via ARM NEON, and even in your browser via WebAssembly. Gerganov also created the GGUF-Format, which became the de facto standard for distributing quantized models for local inference.

The optimization work is staggering. Every major hardware acceleration path — Metal, CUDA, Vulkan, AVX-512, ARM NEON — has been hand-tuned. The result is inference performance that commercial products struggle to match, from a project run by open-source contributors.

The Elephant in the Room

Let's talk about Ollama. It's a fantastic product. Beautiful CLI, dead-simple model management, one-line installs. Millions of people use it as their gateway to local AI.

But underneath that polish? It's llama.cpp. The inference engine, the model loading, the quantization support — that's all Gerganov's work and the llama.cpp community's contributions. Ollama has built a great UX layer on top, but the engine that actually runs your models was written by someone else. And you'd be hard-pressed to find that acknowledged prominently on Ollama's marketing or homepage.

This isn't to diminish Ollama. Good UX matters enormously — it's why normal people can run local models at all. But the engineering foundation deserves equal billing, and it rarely gets it.

The Quantization Wizards

Here's a number that explains why quantization matters: a 70B parameter model at FP16 precision requires 140 GB of VRAM. That's more than any consumer GPU. At 4-bit quantization, it needs roughly 35 GB. That's the difference between "physically impossible on consumer hardware" and "runs on a Mac Studio."

Without quantization, local AI simply does not exist.

70B Model Memory Requirements

FP16 (original)140 GBServer-only

Q8 (8-bit)70 GBHigh-end workstation

Q5_K_M~48 GBMac Studio / dual GPU

Q4_K_M (4-bit)~35 GBMac Studio / 48GB Mac

Q2_K (2-bit)~22 GBGaming laptop

TheBloke (Tom Jobbins) was the unsung distribution hero of the local AI movement. He took hundreds of models and quantized them into every format imaginable — GGUF, GPTQ, AWQ — so that normal people could actually download and run them. Before TheBloke, getting a quantized model meant doing it yourself, which required significant technical knowledge and compute. He turned it into "click download."

In early 2024, TheBloke went quiet. No announcement, no farewell — his HuggingFace uploads simply stopped around January 2024, and his presence in the llama.cpp community faded. The LocalLLaMA community noticed almost immediately. Whatever his reasons, the gap he left was enormous — and it spoke volumes about how much one person had been carrying the ecosystem on their back.

But the community didn't collapse. Others picked up the torch. bartowski became the new go-to name for high-quality GGUF quantizations, quickly earning a reputation for being first to quantize popular new releases with careful attention to quality settings. mradermacher took a different approach — building automated pipelines that quantize models at massive scale, including importance-matrix (imatrix) quants that squeeze out better quality. He started precisely because TheBloke had vanished and he needed quants for himself. That's the open-source spirit in a nutshell.

TheBloke was a legend. Full stop. The people who followed him would be the first to say so. But the fact that the community found new heroes when it needed them — that's the real strength of open source. It doesn't depend on any single person.

turboderp pushed the boundaries further with ExLlamaV2, a GPTQ/EXL2 inference engine that squeezed every last drop of performance out of NVIDIA GPUs. The EXL2 format introduced per-layer quantization — different parts of the model quantized at different bit rates based on their sensitivity — producing better quality at the same file size.

And behind all of these are the research teams who developed the quantization methods themselves: GPTQ, AWQ, and the GGUF quantization schemes built into llama.cpp. Each with different tradeoffs between speed, quality, and compatibility — but all serving the same mission: making large models fit on small hardware.

vLLM and PagedAttention: Making Serving Work

Running a model for yourself is one thing. Serving it to hundreds of concurrent users is another problem entirely. The vLLM project from UC Berkeley solved this by applying a concept from operating systems to AI inference.

Their key insight, called PagedAttention, treats the key-value cache (the memory that grows as the model generates tokens) like virtual memory pages in an OS. Instead of pre-allocating huge contiguous memory blocks for each request, vLLM allocates small pages on demand and manages them dynamically. This eliminates massive memory waste and enables far more concurrent requests on the same hardware.

vLLM was open source from day one, production-grade from early on, and is now the backbone of countless self-hosted AI deployments. The team published their research, released the code, and built a community around it — the right way to do open-source AI infrastructure.

The Format and Ecosystem Builders

Hugging Face became the GitHub of AI models. Their model hub hosts tens of thousands of models in open formats, and their safetensors format solved real security issues with pickle-based model files. Without a centralized, open platform for model distribution, the local AI ecosystem would be fragmented beyond usefulness.

Simon Willison built the llm CLI tool and tirelessly documented the entire local AI space. His blog posts and tools turned complex technical concepts into accessible knowledge. In an ecosystem that moves at breakneck speed, having someone who methodically explains what's happening and why it matters is invaluable.

Andrej Karpathy — through projects like nanoGPT and llm.c — made transformer architectures genuinely understandable. His educational work created an entire generation of engineers who actually understand how these models work under the hood, rather than treating them as black boxes.

Unsloth: Fine-Tuning for the Rest of Us

Running models locally is one thing. Training them — or fine-tuning them to do what you actually need — has traditionally been something else entirely. Fine-tuning a 7B model used to require 40+ GB of VRAM and hours of patience. For most people with a single consumer GPU, the answer was simply: you can't.

Then two brothers from Australia changed the equation. Daniel and Michael Han built Unsloth, an open-source library that makes LLM fine-tuning 2–5x faster while using up to 80% less memory. They achieved this by manually rewriting backpropagation kernels in Triton — not wrapping existing libraries, not slapping a UI on top of someone else's work, but going deep into the math and rewriting the actual gradient computations to be more memory-efficient.

The impact has been enormous. Suddenly, fine-tuning a 7B model fits on a single RTX 3060. A 70B model can be fine-tuned on hardware that previously couldn't even load it for inference. Unsloth supports LoRA, QLoRA, and full fine-tuning across virtually every popular architecture — Llama, Mistral, Gemma, Qwen, DeepSeek — and has expanded into reinforcement learning with GRPO, 500K-context fine-tuning, and even text-to-speech model training.

By late 2025, NVIDIA partnered with Unsloth for their RTX AI Garage program, showing developers how to fine-tune models locally on RTX PCs and the new DGX Spark. The project has tens of thousands of GitHub stars and has become the default recommendation whenever someone asks "how do I fine-tune a model on my own GPU?"

What makes Unsloth special isn't just the performance — it's the philosophy. The Hans kept the core library open source and focused on making fine-tuning accessible rather than just possible. They didn't build a platform that locks you in. They built a tool that gives you superpowers and gets out of your way. That's the open-source ethos at its best.

The Attribution Problem

This brings us to the uncomfortable part.

Several projects have built beautiful, polished interfaces on top of llama.cpp and market themselves as "the easy way to run AI locally." They have landing pages with testimonials, download counters, and funding rounds. And if you look at their homepages, you'd think they built the whole thing from scratch.

This isn't unique to AI. The entire history of open source is littered with companies that build commercial products on top of volunteer labor without adequate credit. It's the tragedy of the commons, applied to software.

But it matters here because of the stakes. These engineers — Gerganov, TheBloke, turboderp, the vLLM team — many of them work for free or on minimal funding. They do the hard, unglamorous work of making bits move faster through silicon. When a wrapper project raises millions and gets all the attention, while the engine underneath gets a footnote in a LICENSE file, something is wrong.

The fix isn't complicated. Cite your dependencies. Put "Powered by llama.cpp" on your homepage. Link to the projects that make yours possible. Credit the engine, not just the paint job.

The Craft Over the Credit

Next time you marvel at running a 70-billion-parameter model on your laptop — at watching it generate coherent text, write code, answer questions, all without sending a single byte to the cloud — remember this:

That magic was built by a handful of engineers who cared more about the craft than the credit. Who spent their nights optimizing matrix multiplications, inventing memory management schemes, and quantizing models so that the rest of us could experience something that felt impossible just two years ago.

They are the real heroes of local AI. And they deserve to be known.

This is an opinion piece. I have no affiliation with any of the projects mentioned. Links go to the actual repositories and profiles — go give them a star.