Llama-Swap: This Fixes The Most Annoying Local LLM Problem

BBetter Stack
Computing/SoftwareConsumer ElectronicsInternet Technology

Transcript

00:00:00Our local model setup works great, until we need a different model.
00:00:04Now we're killing llama server, changing ports, updating our OpenAI base URL, waiting
00:00:10for reloads, and just hoping nothing freaks out.
00:00:13All because our coding model is too big for quick chat, and your small model is too dumb
00:00:18for real code.
00:00:19LlamaSwap fixes that.
00:00:21One endpoint, multiple models, automatic swapping, and your tools don't know anything changed.
00:00:26I'll show you how to set this up in the next couple minutes.
00:00:34Most local LLM devs eventually hit the same wall.
00:00:37At first you use something convenient, llama, lmstudio, something that just works.
00:00:44Because it does.
00:00:45And honestly that's great, because they've gotten a lot better.
00:00:48But then we start wanting more control.
00:00:51You want exact llama CPP flags, GPU layer placement, maybe context size, custom backends, maybe
00:00:59even experimental models.
00:01:01So you move closer to raw llama server, and that feels awesome.
00:01:06Until you realize you just traded one problem for another.
00:01:09Now you're doing this.
00:01:11You're killing your llama server, then you start QuinCoder, then five minutes later, what
00:01:16are you doing?
00:01:17You're killing your llama server.
00:01:18You're bouncing between these models.
00:01:20And every time you do that, something waits, reconnects, fails, or silently uses the wrong
00:01:26model.
00:01:27So what you're really trying to do is keep one endpoint in front, swap whatever models
00:01:31you want behind it.
00:01:33That is the gap that llama swap fills.
00:01:36If you enjoy coding tools that speed up your workflow, be sure to subscribe.
00:01:39We have videos coming out all the time.
00:01:41Now let me show you before we talk about it, how all this works.
00:01:44Right now llama swap is running locally on one port.
00:01:48My client only knows this base URL, not one URL for Quin, not another for small LM, another
00:01:55URL for embeddings, just one front door.
00:01:58Here's a tiny config with two models.
00:02:02So one is QuinCoder, the other is small LM2.
00:02:06And each one has its own command.
00:02:09Each one has its own model file.
00:02:11Each one has its own context size.
00:02:14And the difference between these two is each one of these also has its own TTL.
00:02:19Now I'll ask the coding model for something.
00:02:22I send one normal OpenAI style chat request.
00:02:25The model field says QuinCoder, okay, great.
00:02:30Let's watch the logs.
00:02:32It waits until the backend is healthy, then it sends the request through.
00:02:36Now here's the thing that is not happening.
00:02:39I'm not changing the URL.
00:02:41I'm not restarting open web UI.
00:02:43I'm not editing this in cursor.
00:02:46I'm changing one field.
00:02:48So the model goes from QuinCoder to small LM2, same endpoint, same client, different model.
00:02:55And when the model sits idle past this TTL, llama swap can unload it so your VRAM comes
00:02:59back.
00:03:00That's the whole trick.
00:03:02Your tools think they are talking to one API.
00:03:04Llama swap handles the messy part behind all the curtains to really control how things are
00:03:09going.
00:03:10Okay, great.
00:03:11So what is llama swap?
00:03:12I kind of demoed it here, right?
00:03:13Think of it like a hub for your local models.
00:03:16Your apps don't talk directly to every model server.
00:03:19They talk to llama swap.
00:03:21Then llama swap looks at the model field and decides what should happen.
00:03:25If the model is already running, it's going to forward the request.
00:03:28If the model is not running, well then it's going to start the request.
00:03:31If another model needs to get out of the way, it's going to stop.
00:03:35Then your client gets a normal response.
00:03:38So there's no changing base URLs every 10 minutes.
00:03:41There's one binary, one config file, one stable API endpoint.
00:03:45It's built in Go and it uses YAML config.
00:03:48It works as a proxy for OpenAI compatible and anthropic compatible APIs and it can sit in
00:03:53front of backends like llama cpp, vllm, tabby API, and more.
00:03:59If you're lucky, you might have 10 or 20 models on disk, but only enough VRAM to keep one or
00:04:05two loaded.
00:04:06TTL helps with that.
00:04:08If a model is idle long enough, llama swap can unload it.
00:04:11So instead of your GPU being stuck holding a model that we're not actually using, it can
00:04:17free that memory for the next request.
00:04:20Before you had to remember what is running.
00:04:23Now the config remembers for you.
00:04:25At this point, the obvious question is why not just use llama or LM Studios or plain llama
00:04:31server?
00:04:32And the answer is, well, you might.
00:04:35Llama swap doesn't replace these all the time.
00:04:37It solves a very specific problem.
00:04:40Compared to llama, llama swap is not a model store, downloader, or a beginner friendly CLI.
00:04:47That is not the point here.
00:04:49The point is control.
00:04:50You bring your own llama cpp builds, you bring your own flags, you decide exactly how each
00:04:55model launches.
00:04:57Compared to LM Studio llama swap is more server first, no GUI required.
00:05:02It fits better into a dev box, a home lab server, Docker, or a shared machine where tools just
00:05:07need a stable API.
00:05:09This is not as easy as a llama run llama 3.
00:05:13You need your model files.
00:05:15You need to understand your backend.
00:05:17You need to write YAML.
00:05:19You need to know which flags fit your GPU.
00:05:22There is no built in model gallery that just downloads and configures everything for you.
00:05:26So honestly, the setup is a huge pain.
00:05:29But for some devs, this solves a very specific pain.
00:05:32The pain of knowing exactly what model you want, but wasting time wiring and rewiring
00:05:38everything around it.
00:05:39It's worth trying if you use tools like cursor, continue, custom agents, or local scripts.
00:05:44This is going to be useful, but the setup is more intensive.
00:05:47So that's llama swap.
00:05:49One stable API endpoint, multiple local models behind it, automatic swapping, idle unloads,
00:05:54full backend control.
00:05:56The main idea here is simple.
00:05:58Your clients stop caring which model server is actually running.
00:06:02Llama swap handles all that for them.
00:06:04If you enjoy coding tools like this, be sure to subscribe.
00:06:06We'll see you in another video.

Key Takeaway

Llama-Swap eliminates the need to manually restart servers or change base URLs by acting as a proxy that automatically swaps local LLMs based on the model field in API requests.

Highlights

  • Llama-Swap provides a single stable API endpoint that acts as a front door for multiple local Large Language Models.

  • The system automatically unloads idle models based on a user-defined Time-to-Live (TTL) setting to reclaim VRAM.

  • Each model in the YAML configuration can have its own unique llama.cpp flags, GPU layer placement, and context size.

  • Llama-Swap functions as a proxy for both OpenAI and Anthropic compatible APIs, sitting in front of backends like llama.cpp and vLLM.

  • The tool is written in Go and requires manual configuration of model files and backend flags rather than automated downloads.

Timeline

Limitations of standard local LLM workflows

  • Switching between coding and chat models typically requires manual server restarts and port updates.
  • Constant manual swapping leads to connection failures or the accidental use of incorrect models.
  • Advanced users often trade convenience for control when moving from tools like LM Studio to raw llama servers.

The friction in local LLM development arises when a small model lacks the intelligence for code and a large model is too slow for chat. Developers find themselves killing processes and editing base URLs in tools like Cursor or Open WebUI every time they need a different capability. Llama-Swap addresses this gap by keeping a single endpoint active while handling the backend model rotations automatically.

Unified API endpoint and YAML configuration

  • A single local port serves as the universal base URL for all client requests.
  • YAML configuration files define specific commands, model paths, and context sizes for each LLM.
  • Model swapping occurs instantly when the 'model' field in a standard OpenAI-style chat request changes.

In a live environment, a client only needs to know one base URL rather than separate addresses for various models or embeddings. The configuration maps model names like QwenCoder or SmallLM2 to specific launch commands. When a request arrives, Llama-Swap checks if the requested backend is healthy and running before forwarding the traffic, ensuring the client never sees a disruption.

VRAM management and backend compatibility

  • Llama-Swap unloads models from the GPU once they sit idle past their designated TTL duration.
  • The proxy supports various backends including llama.cpp, vLLM, and TabbyAPI.
  • Automated logic stops running models to make room for new ones if hardware resources are limited.

Users often have dozens of models on disk but only enough VRAM to support one or two at a time. The TTL feature prevents the GPU from being stuck holding an unused model, freeing up memory for the next incoming request automatically. This shifts the burden of memory management from the user's memory to the configuration file's logic.

Comparative advantages and setup requirements

  • Llama-Swap differs from Ollama by focusing on precise backend control rather than being a model store or downloader.
  • The tool lacks a GUI and is designed for server-first environments like Docker or dev boxes.
  • Successful deployment requires manual knowledge of GPU flags and YAML syntax.

While beginner-friendly tools exist, they often hide the low-level flags necessary for specialized workflows. Llama-Swap is intended for developers using custom scripts or agents who need a stable API that doesn't require constant rewiring. It does not provide a built-in model gallery, meaning the user is responsible for sourcing their own GGUF files and understanding their specific hardware constraints.

Community Posts

View all posts