This New Engine Runs Local AI Using 10x Less RAM! (Cactus)

BBetter Stack
컴퓨터/소프트웨어가전제품/카메라스마트폰/모바일

Transcript

00:00:00This is Cactus. It's a low latency inference engine designed to treat mobile and edge devices
00:00:06as first class citizens. So usually when we try to run AI models on edge devices, they
00:00:12feel heavy and battery draining and prone to getting killed by the mobile operating systems
00:00:18memory manager. But Cactus is trying to solve this problem because it's built specifically
00:00:23for the constraints of the neural processing units and limited RAM. So today we're going
00:00:28to look at Cactus, see how it works and test it out on an edge device to see how it performs.
00:00:34So let's dive into it. The biggest bottleneck for local AI isn't actually compute, it's
00:00:44the memory overhead. On a standard mobile device, the operating system is extremely aggressive
00:00:50about killing apps that spike in RAM usage. But Cactus solves this by using a zero copy
00:00:57memory mapping. Instead of the usual approach where you load everything to RAM, Cactus maps
00:01:02model weights directly from storage. It's a zero copy system that only pulls specific
00:01:08tensors into the active compute cycle as they are needed. You get the reasoning power of
00:01:13a large model without the risk of the operating system shutting your app down. And to achieve
00:01:19this, they've even transitioned away from the traditional GGUF format and have their
00:01:24own proprietary .CACT format that allows this mapping to be effective on edge devices. But
00:01:31the real heavy lifting happens in the NPU or the neural processing unit. While most local
00:01:37engines default to the GPU, Cactus is built to be NPU first. If you've looked at modern
00:01:43chips from Apple, Qualcomm or MediaTek, they all have dedicated silicon just for neural
00:01:50networks. Cactus communicates with these units directly, bypassing the usual translation layers
00:01:55that slow down your inference. And they've actually optimized specific models to take
00:02:00full advantage of these matrix multiplication units. If you head over to the Cactus dashboard,
00:02:07you'll see a list of NPU optimized models ready for download. And another cool feature Cactus
00:02:12has is the hybrid router. Now, the reality is that on edge devices, local models, no matter
00:02:18how optimized, eventually hit that reasoning ceiling. And this is where the hybrid router
00:02:23comes in. Instead of forcing you to choose between a fast but limited local model and
00:02:29a smart but expensive cloud model, Cactus can handle both and swap between them. It uses
00:02:35a confidence based routing system. And if you ask it a simple question, it stays on the
00:02:40NPU because it's fast, private and costs you nothing. But if the local model senses that
00:02:45the task is too complex or require a massive context window, it automatically hands the
00:02:51specific request off to a frontier model on the cloud. Your code stays the same. The engine
00:02:57just manages the failover in the background. So it's a production ready way to keep costs
00:03:03low without sacrificing the user experience when things get complicated. Now, all that
00:03:08sounds cool, but I want to try it out for myself. So on their landing page, they have
00:03:13this demo where they show how you can do a real time transcription with about 100 millisecond
00:03:19latency on an edge device. So I went ahead and vibe coded a little Swift app using their
00:03:25Swift Cactus package that supports running a real time transcription using their parakeet
00:03:30speech model locally and a Gemini model on the cloud. So let's try it out. As you can
00:03:36see, locally, we are averaging about 260 milliseconds of latency with live streaming. And mind you,
00:03:44I'm running this on an older iPhone model, the 12 Pro. So for an older model like this
00:03:50one, I think this performance on edge is pretty good. And if we switch to cloud cactus switches
00:03:55to Gemini 2.5 flash as the cloud alternative. And for some reason, they don't have the same
00:04:01parakeet model on their cloud side. So I was forced to use Gemini. And we can see here that
00:04:06this is averaging at about 2000 milliseconds for a three second batch transcription. And
00:04:12I guess this is to be expected because it is doing a round trip to the data server. But
00:04:17realistically, most of the time you would end up using the on edge transcription anyway,
00:04:23but the cloud option is useful for other tasks like heavy image analysis or something else
00:04:27that would be a heavier task. So there you have it, folks, that is the cactus engine in
00:04:33a nutshell. I think they are doing something really interesting here. I like how they are
00:04:37thinking about on edge optimization using a custom NPU friendly architecture. And I like
00:04:43the fact that they offer so many SDKs and so many models for all sorts of multimodal tasks.
00:04:50And I'm really curious to see how their product evolves. So I'll be keeping an eye on their
00:04:54progress for sure. But what do you folks think about cactus? Have you tried it? Let us know
00:04:59in the comment section down below. And folks, if you like these types of breakdowns, please
00:05:03let me know by smashing that like button underneath the video. And also don't forget to subscribe
00:05:08to our channel. This has been Andris from better stack and I will see you in the next
00:05:13videos.

Key Takeaway

Cactus eliminates mobile RAM limitations and prevents operating system app termination by mapping model weights directly from storage to the NPU using a custom .CACT format.

Highlights

  • Cactus bypasses the GPU to run AI models directly on the neural processing unit (NPU) of Apple, Qualcomm, and MediaTek chips.

  • The engine uses zero-copy memory mapping to stream model weights directly from storage, eliminating the need to load full models into RAM.

  • Cactus utilizes a proprietary .CACT file format to replace standard GGUF files for efficient edge device memory mapping.

  • A confidence-based hybrid router automatically switches between a local NPU model and a cloud-based Gemini 2.5 Flash model depending on task complexity.

  • Real-time local transcription using the Parakeet speech model achieves an average latency of 260 milliseconds on an iPhone 12 Pro.

Timeline

Memory Bottlenecks and Zero-Copy Mapping on Edge Devices

  • Mobile operating systems aggressively terminate applications that cause sudden spikes in RAM usage.
  • The primary bottleneck for local AI deployment on mobile hardware is memory overhead rather than compute power.
  • Zero-copy memory mapping streams specific tensors into the active compute cycle from storage only when needed.

Standard mobile execution environments regularly shut down local AI applications due to heavy memory demands and battery drain. Cactus addresses this limitation by transitioning away from the traditional GGUF format to a proprietary .CACT format. This file structure allows the system to map model weights directly from storage instead of loading the entire model into the device RAM.

NPU Optimization and Direct Silicon Communication

  • Cactus prioritizes the neural processing unit over the graphics processing unit for local inference.
  • Direct communication with dedicated silicon bypasses traditional translation layers to accelerate performance.
  • The Cactus dashboard provides specific models pre-optimized for mobile matrix multiplication units.

Most local inference engines default to utilizing the GPU for processing. Cactus targets the dedicated neural silicon found on modern hardware architectures from Apple, Qualcomm, and MediaTek. Bypassing standard translation layers removes the software overhead that typically slows down edge inference, allowing specialized models to maximize the hardware efficiency of matrix multiplication units.

Hybrid Routing and Cloud Failover Mechanics

  • Local edge models encounter a definitive reasoning ceiling when handling complex tasks or massive context windows.
  • A confidence-based routing system automatically shifts execution between local hardware and cloud servers.
  • The background failover mechanism manages model switching without requiring changes to the underlying application code.

Simple queries remain on the local NPU to maintain user privacy, eliminate hosting costs, and ensure low latency. When the engine detects a task that exceeds local capabilities, it seamlessly hands the request off to a frontier model in the cloud. This dual-layer approach maintains the user experience during complex operations while keeping overall production costs low.

Local and Cloud Latency Benchmarks on Mobile Hardware

  • Local real-time transcription averages 260 milliseconds of latency using the Parakeet speech model on an iPhone 12 Pro.
  • Cloud-based transcription via Gemini 2.5 Flash averages 2000 milliseconds for a three-second batch process.
  • The network round trip to data servers accounts for the higher latency observed during cloud execution.

Testing the Swift Cactus package via a local application reveals a distinct speed advantage for on-device processing compared to cloud alternatives. While the local Parakeet model streams data with minimal delay on older smartphone hardware, the cloud-routed Gemini alternative requires a full data round trip. This makes the cloud backup ideal for heavy image analysis or massive reasoning tasks rather than real-time processing.

Community Posts

No posts yet. Be the first to write about this video!

Write about this video