Transcript
00:00:00Our local model setup works great, until we need a different model.
00:00:04Now we're killing llama server, changing ports, updating our OpenAI base URL, waiting
00:00:10for reloads, and just hoping nothing freaks out.
00:00:13All because our coding model is too big for quick chat, and your small model is too dumb
00:00:18for real code.
00:00:19LlamaSwap fixes that.
00:00:21One endpoint, multiple models, automatic swapping, and your tools don't know anything changed.
00:00:26I'll show you how to set this up in the next couple minutes.
00:00:34Most local LLM devs eventually hit the same wall.
00:00:37At first you use something convenient, llama, lmstudio, something that just works.
00:00:44Because it does.
00:00:45And honestly that's great, because they've gotten a lot better.
00:00:48But then we start wanting more control.
00:00:51You want exact llama CPP flags, GPU layer placement, maybe context size, custom backends, maybe
00:00:59even experimental models.
00:01:01So you move closer to raw llama server, and that feels awesome.
00:01:06Until you realize you just traded one problem for another.
00:01:09Now you're doing this.
00:01:11You're killing your llama server, then you start QuinCoder, then five minutes later, what
00:01:16are you doing?
00:01:17You're killing your llama server.
00:01:18You're bouncing between these models.
00:01:20And every time you do that, something waits, reconnects, fails, or silently uses the wrong
00:01:26model.
00:01:27So what you're really trying to do is keep one endpoint in front, swap whatever models
00:01:31you want behind it.
00:01:33That is the gap that llama swap fills.
00:01:36If you enjoy coding tools that speed up your workflow, be sure to subscribe.
00:01:39We have videos coming out all the time.
00:01:41Now let me show you before we talk about it, how all this works.
00:01:44Right now llama swap is running locally on one port.
00:01:48My client only knows this base URL, not one URL for Quin, not another for small LM, another
00:01:55URL for embeddings, just one front door.
00:01:58Here's a tiny config with two models.
00:02:02So one is QuinCoder, the other is small LM2.
00:02:06And each one has its own command.
00:02:09Each one has its own model file.
00:02:11Each one has its own context size.
00:02:14And the difference between these two is each one of these also has its own TTL.
00:02:19Now I'll ask the coding model for something.
00:02:22I send one normal OpenAI style chat request.
00:02:25The model field says QuinCoder, okay, great.
00:02:30Let's watch the logs.
00:02:32It waits until the backend is healthy, then it sends the request through.
00:02:36Now here's the thing that is not happening.
00:02:39I'm not changing the URL.
00:02:41I'm not restarting open web UI.
00:02:43I'm not editing this in cursor.
00:02:46I'm changing one field.
00:02:48So the model goes from QuinCoder to small LM2, same endpoint, same client, different model.
00:02:55And when the model sits idle past this TTL, llama swap can unload it so your VRAM comes
00:02:59back.
00:03:00That's the whole trick.
00:03:02Your tools think they are talking to one API.
00:03:04Llama swap handles the messy part behind all the curtains to really control how things are
00:03:09going.
00:03:10Okay, great.
00:03:11So what is llama swap?
00:03:12I kind of demoed it here, right?
00:03:13Think of it like a hub for your local models.
00:03:16Your apps don't talk directly to every model server.
00:03:19They talk to llama swap.
00:03:21Then llama swap looks at the model field and decides what should happen.
00:03:25If the model is already running, it's going to forward the request.
00:03:28If the model is not running, well then it's going to start the request.
00:03:31If another model needs to get out of the way, it's going to stop.
00:03:35Then your client gets a normal response.
00:03:38So there's no changing base URLs every 10 minutes.
00:03:41There's one binary, one config file, one stable API endpoint.
00:03:45It's built in Go and it uses YAML config.
00:03:48It works as a proxy for OpenAI compatible and anthropic compatible APIs and it can sit in
00:03:53front of backends like llama cpp, vllm, tabby API, and more.
00:03:59If you're lucky, you might have 10 or 20 models on disk, but only enough VRAM to keep one or
00:04:05two loaded.
00:04:06TTL helps with that.
00:04:08If a model is idle long enough, llama swap can unload it.
00:04:11So instead of your GPU being stuck holding a model that we're not actually using, it can
00:04:17free that memory for the next request.
00:04:20Before you had to remember what is running.
00:04:23Now the config remembers for you.
00:04:25At this point, the obvious question is why not just use llama or LM Studios or plain llama
00:04:31server?
00:04:32And the answer is, well, you might.
00:04:35Llama swap doesn't replace these all the time.
00:04:37It solves a very specific problem.
00:04:40Compared to llama, llama swap is not a model store, downloader, or a beginner friendly CLI.
00:04:47That is not the point here.
00:04:49The point is control.
00:04:50You bring your own llama cpp builds, you bring your own flags, you decide exactly how each
00:04:55model launches.
00:04:57Compared to LM Studio llama swap is more server first, no GUI required.
00:05:02It fits better into a dev box, a home lab server, Docker, or a shared machine where tools just
00:05:07need a stable API.
00:05:09This is not as easy as a llama run llama 3.
00:05:13You need your model files.
00:05:15You need to understand your backend.
00:05:17You need to write YAML.
00:05:19You need to know which flags fit your GPU.
00:05:22There is no built in model gallery that just downloads and configures everything for you.
00:05:26So honestly, the setup is a huge pain.
00:05:29But for some devs, this solves a very specific pain.
00:05:32The pain of knowing exactly what model you want, but wasting time wiring and rewiring
00:05:38everything around it.
00:05:39It's worth trying if you use tools like cursor, continue, custom agents, or local scripts.
00:05:44This is going to be useful, but the setup is more intensive.
00:05:47So that's llama swap.
00:05:49One stable API endpoint, multiple local models behind it, automatic swapping, idle unloads,
00:05:54full backend control.
00:05:56The main idea here is simple.
00:05:58Your clients stop caring which model server is actually running.
00:06:02Llama swap handles all that for them.
00:06:04If you enjoy coding tools like this, be sure to subscribe.
00:06:06We'll see you in another video.