Transcript
00:00:00This is OMLX. It's a very exciting project, which is essentially a specialized inference
00:00:06engine designed to squeeze every last drop of performance out of your Apple Silicon.
00:00:11If you're a Mac user, you're going to be very excited about this one. OMLX is essentially
00:00:16attempting to solve the biggest bottleneck we have on local hardware, which is the memory tax.
00:00:21In this video, we'll take a look at OMLX, see how it works and we'll do a little test run and compare
00:00:27it with one of the heavyweights LM Studio to see if this new tool can really be the future
00:00:33of running local AI models on your Mac. It's going to be a lot of fun, so let's dive into it.
00:00:39So what exactly is OMLX? At its core, it's a runtime built specifically on top of Apple's
00:00:49MLX framework and unlike generalist tools that try to support every GPU under the sun,
00:00:55MLX is purpose-built by the Apple Silicon team to exploit the unified memory architecture that
00:01:02powers Macs specifically. In a traditional PC, your CPU and your GPU have separate memory pools,
00:01:09meaning data like your model's weights have to be constantly copied back and forth over the PCI bus.
00:01:16But MLX eliminates that copying entirely. Because the CPU and GPU share the exact same physical
00:01:22memory, MLX uses zero copy arrays. When the GPU finishes a calculation, the CPU can read the
00:01:29results instantly without moving a single byte. It also uses lazy computation, meaning it doesn't
00:01:36actually execute a math operation until the absolute last second when the output is needed,
00:01:41which allows it to optimize the entire calculation graph on the fly. But where OMLX differs from your
00:01:47standard LM Studio setup is how it manages the KB cache. In a typical LLM session, every word of your
00:01:54conversation history has to be remembered in your expensive RAM. But OMLX introduces a two-tier
00:02:01system. It keeps the immediate context in your unified memory for speed, but it freezes the
00:02:07older parts of your conversation, those massive system prompts and tool definitions, and swaps
00:02:12them onto your SSD. And when you compare this to LM Studio, the difference is immediate. And yes,
00:02:19it's incredibly stable and compatible, but the problem is that it wants to hold on to the entire
00:02:23memory history in a hot state. OMLX is more like a modern operating system. It's smart enough to know
00:02:30what data needs to be in your brain right now and what can be paged to disk. So let's spin up OMLX
00:02:36and try it out for ourselves. The interface is quite intuitive. Right off the bat, we get this
00:02:41window where we can specify our desired location for our server and launch it right away. After
00:02:47that, we get prompted to provide an API key. So let's do that. And finally, we land on this
00:02:53dashboard, which is the main entry point for your OMLX server. And from here, I went ahead and
00:03:00downloaded the QUEN 3.6 35 billion parameter 4-bit model, which we will use for our tests.
00:03:07I have also set up this empty repository with an agent's MD file where I will ask the model
00:03:13to create a simple web app where you can search for different movies, wishlist them and rate them
00:03:19using your movie DB API key. Nothing too fancy for this demonstration, just a simple coding test
00:03:24to see how it might potentially perform a real world coding task. And on the dashboard page,
00:03:31we get the section which provides us with ready to use code snippets for different AI agent harnesses
00:03:37that we can run. And for this demo, I will be using the codec CLI to conduct these tests.
00:03:42Now, you might be wondering why I'm not just using the official Claude code CLI for this. Well,
00:03:47the reality is that on a MacBook M2, every token counts. And if you look at Claude's context stats
00:03:54right out the gate on a totally blank slate, Claude code eats up about 16.2K tokens just for its own
00:04:02system prompts and tool definitions. And in a 32K window, this leaves us with only 16K tokens for
00:04:09the actual project, which is tiny when you're building a full stack application. But on the
00:04:14other hand, I found that codex is much more leaner. It doesn't blow the base weight of the conversation,
00:04:20which gives us a more generous runway to actually write code before we hit that context ceiling.
00:04:26All right, so now I'm going to launch codex with this simple command provided here.
00:04:31And then I'm going to give it a simple startup prompt explaining our task and get it going.
00:04:36And as it's cooking here on the right, you can see in real time how this session is performing,
00:04:42how many tokens are being produced, how many of them are being cached,
00:04:46and the overall cash efficiency percentage. And it's also very handy to see how many tokens on
00:04:51average are processed in a second. Now, overall, it took roughly 20 minutes for this 35 billion
00:04:57parameter quen 3.6 model running on my M2 MacBook Pro to get through this task. And this is to be
00:05:04expected because this is a very heavy undertaking for this model. Now, there were two or three
00:05:10instances where I hit a 400 error because the prompt exceeded the 30K to context limit on my
00:05:17M2 MacBook. In any other tool, it would be a total project killer. And normally, if I would run slash
00:05:24clear, it would wipe the AI's short-term memory, often leading to hallucinations because the model
00:05:29forgets the code it literally just wrote. But this is where OMLX's persistent SSD caching blew me away.
00:05:37Even though I cleared the session in codex, the actual computational state of my project
00:05:42were still sitting on my SSD. So the moment I gave codex a new prompt to continue where it left off,
00:05:48OMLX recognized the prefix and instantly hydrated the model's brain from the disk. And instead of
00:05:56hallucinating or starting from scratch, it picked up right where it left off. So the cash efficiency
00:06:02really helps in this case. And by the end of this task, we can see here that quen 3.6 with the help of
00:06:08OMLX was able to get through the task by churning out 1.78 million tokens, and roughly 1.59 million
00:06:16of them were cached. So we ended up with an 89% cash efficiency, which is pretty massive. And for
00:06:22the app itself, it looks quite decent. We're able to search for movies, add them to our watch list,
00:06:28and rate them. But once you refresh the page, the watch list resets. So I'm guessing it didn't
00:06:33implement the database storage solution properly, but solid effort overall nonetheless. Now this
00:06:40all looks impressive, but I wanted to find out how does this performance stacks up to a heavyweight
00:06:46model runner like LM Studio. So I decided to run the same task with the same quen 3.6 model
00:06:52using the same context window and constraints and see how it performs. And honestly, I wasn't
00:06:58expecting this, but I actually got a worse performance on LM Studio. So the task itself
00:07:04took roughly 35 minutes to finish. That's already 15 minutes more than on MLX. And I also noticed
00:07:11that while running this task, LM Studio was using every last juice of my MacBook. So much so that I
00:07:17couldn't even watch a video on a second monitor because it was lagging due to severe RAM shortage.
00:07:23Now I did not have the same problem with OMLX because when running this on OMLX, I was easily
00:07:30able to browse the web, watch videos, or do any other task while Codex was still running in the
00:07:35background. But this was nearly impossible to do on LM Studio. And look at these stats. What shocked
00:07:41me even more is that the average token per second speed on LM Studio was 16 tokens per second. And on
00:07:47OMLX, it was roughly 47. So that actually explains why the task took 15 minutes longer to finish.
00:07:55However, I do have to give credit where credit is due. LM Studio did not throw a single 400 error
00:08:01due to context limit bottlenecks like OMLX. So the context management on LM Studio is very stable and
00:08:08running perfectly. And if we look at the final result, it was very similar. I didn't have any
00:08:13fancy animations this time, but honestly, this feels like comparing the same output with different
00:08:18seed values for the same task on the same model. So I'm not going to jump into any conclusions here.
00:08:25It's the same quen 3.6 model. You can judge quen's models output here for yourselves. So what is the
00:08:33final verdict? Well, I must say I am very, very impressed with OMLX performance. If you're on a
00:08:39MacBook with a limited RAM and you want to actually use your computer while running a local AI agent in
00:08:45the background, then OMLX is a perfect tool for that. It effectively gives you a RAM extension by
00:08:52utilizing your high speed SSD combined with that sweet MLX framework that lets us run models more
00:08:58smoothly on Apple Silicon. But yes, the occasional 400 error means that you will have to be more
00:09:05hands-on with it and maybe do a clear command once in a while. But that is the trade-off you get for a
00:09:10three times faster generation speed. But I think it is well worth it in this case. So these kinds
00:09:16of projects like OMLX are proving that we don't necessarily need 128 gigabytes of RAM to run
00:09:23powerful agents. We just need a smarter way to manage the memory we already have on our MacBooks.
00:09:29And we actually ran a survey a few months ago and found out that most of our viewers are Mac users.
00:09:34So I'm actually curious to find out. Have you tried OMLX on your own machines? What has been the
00:09:40experience so far? Let us know in the comments section down below. So there you have it folks.
00:09:45That is OMLX in a nutshell. And folks, if you like these types of technical breakdowns, please let me
00:09:50know by smashing that like button underneath the video. And also don't forget to subscribe to our
00:09:55channel. This has been Andris from Better Stack and I will see you in the next videos.