Transcript
00:00:00This is Cactus. It's a low latency inference engine designed to treat mobile and edge devices
00:00:06as first class citizens. So usually when we try to run AI models on edge devices, they
00:00:12feel heavy and battery draining and prone to getting killed by the mobile operating systems
00:00:18memory manager. But Cactus is trying to solve this problem because it's built specifically
00:00:23for the constraints of the neural processing units and limited RAM. So today we're going
00:00:28to look at Cactus, see how it works and test it out on an edge device to see how it performs.
00:00:34So let's dive into it. The biggest bottleneck for local AI isn't actually compute, it's
00:00:44the memory overhead. On a standard mobile device, the operating system is extremely aggressive
00:00:50about killing apps that spike in RAM usage. But Cactus solves this by using a zero copy
00:00:57memory mapping. Instead of the usual approach where you load everything to RAM, Cactus maps
00:01:02model weights directly from storage. It's a zero copy system that only pulls specific
00:01:08tensors into the active compute cycle as they are needed. You get the reasoning power of
00:01:13a large model without the risk of the operating system shutting your app down. And to achieve
00:01:19this, they've even transitioned away from the traditional GGUF format and have their
00:01:24own proprietary .CACT format that allows this mapping to be effective on edge devices. But
00:01:31the real heavy lifting happens in the NPU or the neural processing unit. While most local
00:01:37engines default to the GPU, Cactus is built to be NPU first. If you've looked at modern
00:01:43chips from Apple, Qualcomm or MediaTek, they all have dedicated silicon just for neural
00:01:50networks. Cactus communicates with these units directly, bypassing the usual translation layers
00:01:55that slow down your inference. And they've actually optimized specific models to take
00:02:00full advantage of these matrix multiplication units. If you head over to the Cactus dashboard,
00:02:07you'll see a list of NPU optimized models ready for download. And another cool feature Cactus
00:02:12has is the hybrid router. Now, the reality is that on edge devices, local models, no matter
00:02:18how optimized, eventually hit that reasoning ceiling. And this is where the hybrid router
00:02:23comes in. Instead of forcing you to choose between a fast but limited local model and
00:02:29a smart but expensive cloud model, Cactus can handle both and swap between them. It uses
00:02:35a confidence based routing system. And if you ask it a simple question, it stays on the
00:02:40NPU because it's fast, private and costs you nothing. But if the local model senses that
00:02:45the task is too complex or require a massive context window, it automatically hands the
00:02:51specific request off to a frontier model on the cloud. Your code stays the same. The engine
00:02:57just manages the failover in the background. So it's a production ready way to keep costs
00:03:03low without sacrificing the user experience when things get complicated. Now, all that
00:03:08sounds cool, but I want to try it out for myself. So on their landing page, they have
00:03:13this demo where they show how you can do a real time transcription with about 100 millisecond
00:03:19latency on an edge device. So I went ahead and vibe coded a little Swift app using their
00:03:25Swift Cactus package that supports running a real time transcription using their parakeet
00:03:30speech model locally and a Gemini model on the cloud. So let's try it out. As you can
00:03:36see, locally, we are averaging about 260 milliseconds of latency with live streaming. And mind you,
00:03:44I'm running this on an older iPhone model, the 12 Pro. So for an older model like this
00:03:50one, I think this performance on edge is pretty good. And if we switch to cloud cactus switches
00:03:55to Gemini 2.5 flash as the cloud alternative. And for some reason, they don't have the same
00:04:01parakeet model on their cloud side. So I was forced to use Gemini. And we can see here that
00:04:06this is averaging at about 2000 milliseconds for a three second batch transcription. And
00:04:12I guess this is to be expected because it is doing a round trip to the data server. But
00:04:17realistically, most of the time you would end up using the on edge transcription anyway,
00:04:23but the cloud option is useful for other tasks like heavy image analysis or something else
00:04:27that would be a heavier task. So there you have it, folks, that is the cactus engine in
00:04:33a nutshell. I think they are doing something really interesting here. I like how they are
00:04:37thinking about on edge optimization using a custom NPU friendly architecture. And I like
00:04:43the fact that they offer so many SDKs and so many models for all sorts of multimodal tasks.
00:04:50And I'm really curious to see how their product evolves. So I'll be keeping an eye on their
00:04:54progress for sure. But what do you folks think about cactus? Have you tried it? Let us know
00:04:59in the comment section down below. And folks, if you like these types of breakdowns, please
00:05:03let me know by smashing that like button underneath the video. And also don't forget to subscribe
00:05:08to our channel. This has been Andris from better stack and I will see you in the next
00:05:13videos.
Community Posts
No posts yet. Be the first to write about this video!
Write about this video