llama-swap Settings to Eliminate Model Switching Delays on GPUs with 12GB or Less

Manually Calculating idle_timeout by VRAM Capacity

For mid-range GPU users, VRAM is a perennially scarce resource. When running multiple models on an RTX 3060 or 4060, you quickly hit your limits. A Llama 3.1 8B model using 4-bit quantization (Q4_K_M) consumes 5.2GB for weights alone. After subtracting the ~1GB baseline Windows overhead, you're left with just over 2GB of space. If you try to load more models recklessly, you'll trigger a system RAM spillover. Seeing your generation speed crawl from 15 tokens per second to 1 token per second makes you want to kill the process immediately.

To prevent this bottleneck, you must set specific idle_timeout values for each model in your config.yaml to dictate when they are evicted.

8GB VRAM (RTX 3070/4060): Set globalTTL to 300 (5 minutes). Append --ctx-size 8192 to your model execution command (cmd) to ensure the KV cache doesn't swallow all remaining memory, effectively avoiding OOM (Out of Memory) errors.
12GB VRAM (RTX 3060 12G): Keep the frequently used Phi-3 Mini resident with ttl: 0, and set the heavier Qwen 2.5 Coder 7B to ttl: 60 so that VRAM is cleared as soon as your coding session ends.

Configuring it this way saves at least 20 minutes a day that would otherwise be wasted manually starting and stopping models.

Consolidating Multiple Inference Engines into One Port via YAML

Switching from Ollama to llama.cpp often leads to port conflicts and resource ownership battles. llama-swap acts as the traffic cop to organize this chaos. Written in Go, this lightweight proxy sends a SIGTERM signal to existing processes to safely shut them down before spinning up a new model when a fresh call comes in.

The YAML structure for stable integration is straightforward:

Define common flags like --flash-attn and --mlock along with executable paths in the macros section. This keeps your configuration file much cleaner.
Use the ${PORT} macro under the models entry to specify execution paths for each model.
For existing Ollama services, simply point the path in the proxy field to http://localhost:11434 to link it.

As a result, your applications only need to look at a single address: http://localhost:8080/v1. You no longer need to worry about whether the engine or model has changed under the hood.

Connecting Cursor and Obsidian to Local Endpoints to Save on Subscriptions

The real reason to use local LLMs is to protect your privacy while saving money. While Cursor is a paid service by default, you can bypass this using the OpenAI Compatible setting with your local llama-swap. This saves you $20 a month, or $240 a year.

The connection method is simple:

In Cursor Settings > Models, enable OpenAI API Compatible and enter http://localhost:8080/v1 in the Base URL.
You can enter any text for the API Key. If you set an alias like gpt-4o for your actual model name in the llama-swap config, Cursor will recognize it and work immediately.
In the Obsidian Copilot plugin, set the embedding model to nomic-embed-text and lock it in llama-swap with ttl: 0.

Models will swap automatically in the background as you move from note-taking to your coding window. Since all data stays on your machine, there's no need to worry about privacy.

Registering as a Background Service using NSSM

Opening a terminal to turn on your proxy every time is a chore. To use AI as a true tool, it should run silently like the air around you. For Windows users, registering llama-swap as a service using NSSM (Non-Sucking Service Manager) is the cleanest approach.

Here is how:

Install it via terminal with winget install NSSM, then run nssm install LlamaSwap with administrator privileges.
In the setup window, enter the path to llama-swap.exe in Path and --config config.yaml -watch-config in Arguments.
In the Process tab, set the priority to High. This ensures inference speed isn't throttled by other background tasks.

Now, your API endpoint comes alive as soon as you turn on your computer. Thanks to the -watch-config option, any changes you save to the YAML file are reflected immediately without needing to restart the service.

Preventing Crashes with Flash Attention and Context Limits

Most crashes or chat interruptions during model swaps are due to memory design flaws. Inference engines try to pre-allocate memory based on the context window when they start. If you don't control this, you'll run into unexpected errors.

Here are three mechanisms to ensure stability:

Explicitly set --ctx-size to around 8192 in the cmd field. Leaving it unlimited will cause your VRAM to explode.
Large models take longer to load. Set healthCheckTimeout generously to about 300 seconds so the proxy doesn't drop the connection during loading.
The --flash-attn option is mandatory. Using this allows you to use 20% more context within the same VRAM footprint.

For an 8B model, the swap completes in about 5 seconds. This is fast enough not to disrupt your workflow. You don't need a high-end workstation; with just a few adjusted settings, you can enjoy a smooth AI environment right at your desk.

llama-swap Settings to Eliminate Model Switching Delays on GPUs with 12GB or Less

Manually Calculating idle_timeout by VRAM Capacity

To prevent this bottleneck, you must set specific idle_timeout values for each model in your config.yaml to dictate when they are evicted.

8GB VRAM (RTX 3070/4060): Set globalTTL to 300 (5 minutes). Append --ctx-size 8192 to your model execution command (cmd) to ensure the KV cache doesn't swallow all remaining memory, effectively avoiding OOM (Out of Memory) errors.
12GB VRAM (RTX 3060 12G): Keep the frequently used Phi-3 Mini resident with ttl: 0, and set the heavier Qwen 2.5 Coder 7B to ttl: 60 so that VRAM is cleared as soon as your coding session ends.

Configuring it this way saves at least 20 minutes a day that would otherwise be wasted manually starting and stopping models.

Consolidating Multiple Inference Engines into One Port via YAML

The YAML structure for stable integration is straightforward:

Define common flags like --flash-attn and --mlock along with executable paths in the macros section. This keeps your configuration file much cleaner.
Use the ${PORT} macro under the models entry to specify execution paths for each model.
For existing Ollama services, simply point the path in the proxy field to http://localhost:11434 to link it.

As a result, your applications only need to look at a single address: http://localhost:8080/v1. You no longer need to worry about whether the engine or model has changed under the hood.

Connecting Cursor and Obsidian to Local Endpoints to Save on Subscriptions

The connection method is simple:

In Cursor Settings > Models, enable OpenAI API Compatible and enter http://localhost:8080/v1 in the Base URL.
You can enter any text for the API Key. If you set an alias like gpt-4o for your actual model name in the llama-swap config, Cursor will recognize it and work immediately.
In the Obsidian Copilot plugin, set the embedding model to nomic-embed-text and lock it in llama-swap with ttl: 0.

Models will swap automatically in the background as you move from note-taking to your coding window. Since all data stays on your machine, there's no need to worry about privacy.

Registering as a Background Service using NSSM

Here is how:

Install it via terminal with winget install NSSM, then run nssm install LlamaSwap with administrator privileges.
In the setup window, enter the path to llama-swap.exe in Path and --config config.yaml -watch-config in Arguments.
In the Process tab, set the priority to High. This ensures inference speed isn't throttled by other background tasks.

Preventing Crashes with Flash Attention and Context Limits

Here are three mechanisms to ensure stability:

Explicitly set --ctx-size to around 8192 in the cmd field. Leaving it unlimited will cause your VRAM to explode.
Large models take longer to load. Set healthCheckTimeout generously to about 300 seconds so the proxy doesn't drop the connection during loading.
The --flash-attn option is mandatory. Using this allows you to use 20% more context within the same VRAM footprint.

llama-swap Settings to Eliminate Model Switching Delays on GPUs with 12GB or Less

Related Video

Llama-Swap: This Fixes The Most Annoying Local LLM Problem

llama-swap Settings to Eliminate Model Switching Delays on GPUs with 12GB or Less

Manually Calculating idle_timeout by VRAM Capacity

Consolidating Multiple Inference Engines into One Port via YAML

Connecting Cursor and Obsidian to Local Endpoints to Save on Subscriptions

Registering as a Background Service using NSSM

Preventing Crashes with Flash Attention and Context Limits

Comments (0)

llama-swap Settings to Eliminate Model Switching Delays on GPUs with 12GB or Less

Manually Calculating idle_timeout by VRAM Capacity

Consolidating Multiple Inference Engines into One Port via YAML

Connecting Cursor and Obsidian to Local Endpoints to Save on Subscriptions

Registering as a Background Service using NSSM

Preventing Crashes with Flash Attention and Context Limits