Why Every Mac User Needs This New AI Model Runner (oMLX)

BBetter Stack
컴퓨터/소프트웨어가전제품/카메라AI/미래기술

Transcript

00:00:00This is OMLX. It's a very exciting project, which is essentially a specialized inference
00:00:06engine designed to squeeze every last drop of performance out of your Apple Silicon.
00:00:11If you're a Mac user, you're going to be very excited about this one. OMLX is essentially
00:00:16attempting to solve the biggest bottleneck we have on local hardware, which is the memory tax.
00:00:21In this video, we'll take a look at OMLX, see how it works and we'll do a little test run and compare
00:00:27it with one of the heavyweights LM Studio to see if this new tool can really be the future
00:00:33of running local AI models on your Mac. It's going to be a lot of fun, so let's dive into it.
00:00:39So what exactly is OMLX? At its core, it's a runtime built specifically on top of Apple's
00:00:49MLX framework and unlike generalist tools that try to support every GPU under the sun,
00:00:55MLX is purpose-built by the Apple Silicon team to exploit the unified memory architecture that
00:01:02powers Macs specifically. In a traditional PC, your CPU and your GPU have separate memory pools,
00:01:09meaning data like your model's weights have to be constantly copied back and forth over the PCI bus.
00:01:16But MLX eliminates that copying entirely. Because the CPU and GPU share the exact same physical
00:01:22memory, MLX uses zero copy arrays. When the GPU finishes a calculation, the CPU can read the
00:01:29results instantly without moving a single byte. It also uses lazy computation, meaning it doesn't
00:01:36actually execute a math operation until the absolute last second when the output is needed,
00:01:41which allows it to optimize the entire calculation graph on the fly. But where OMLX differs from your
00:01:47standard LM Studio setup is how it manages the KB cache. In a typical LLM session, every word of your
00:01:54conversation history has to be remembered in your expensive RAM. But OMLX introduces a two-tier
00:02:01system. It keeps the immediate context in your unified memory for speed, but it freezes the
00:02:07older parts of your conversation, those massive system prompts and tool definitions, and swaps
00:02:12them onto your SSD. And when you compare this to LM Studio, the difference is immediate. And yes,
00:02:19it's incredibly stable and compatible, but the problem is that it wants to hold on to the entire
00:02:23memory history in a hot state. OMLX is more like a modern operating system. It's smart enough to know
00:02:30what data needs to be in your brain right now and what can be paged to disk. So let's spin up OMLX
00:02:36and try it out for ourselves. The interface is quite intuitive. Right off the bat, we get this
00:02:41window where we can specify our desired location for our server and launch it right away. After
00:02:47that, we get prompted to provide an API key. So let's do that. And finally, we land on this
00:02:53dashboard, which is the main entry point for your OMLX server. And from here, I went ahead and
00:03:00downloaded the QUEN 3.6 35 billion parameter 4-bit model, which we will use for our tests.
00:03:07I have also set up this empty repository with an agent's MD file where I will ask the model
00:03:13to create a simple web app where you can search for different movies, wishlist them and rate them
00:03:19using your movie DB API key. Nothing too fancy for this demonstration, just a simple coding test
00:03:24to see how it might potentially perform a real world coding task. And on the dashboard page,
00:03:31we get the section which provides us with ready to use code snippets for different AI agent harnesses
00:03:37that we can run. And for this demo, I will be using the codec CLI to conduct these tests.
00:03:42Now, you might be wondering why I'm not just using the official Claude code CLI for this. Well,
00:03:47the reality is that on a MacBook M2, every token counts. And if you look at Claude's context stats
00:03:54right out the gate on a totally blank slate, Claude code eats up about 16.2K tokens just for its own
00:04:02system prompts and tool definitions. And in a 32K window, this leaves us with only 16K tokens for
00:04:09the actual project, which is tiny when you're building a full stack application. But on the
00:04:14other hand, I found that codex is much more leaner. It doesn't blow the base weight of the conversation,
00:04:20which gives us a more generous runway to actually write code before we hit that context ceiling.
00:04:26All right, so now I'm going to launch codex with this simple command provided here.
00:04:31And then I'm going to give it a simple startup prompt explaining our task and get it going.
00:04:36And as it's cooking here on the right, you can see in real time how this session is performing,
00:04:42how many tokens are being produced, how many of them are being cached,
00:04:46and the overall cash efficiency percentage. And it's also very handy to see how many tokens on
00:04:51average are processed in a second. Now, overall, it took roughly 20 minutes for this 35 billion
00:04:57parameter quen 3.6 model running on my M2 MacBook Pro to get through this task. And this is to be
00:05:04expected because this is a very heavy undertaking for this model. Now, there were two or three
00:05:10instances where I hit a 400 error because the prompt exceeded the 30K to context limit on my
00:05:17M2 MacBook. In any other tool, it would be a total project killer. And normally, if I would run slash
00:05:24clear, it would wipe the AI's short-term memory, often leading to hallucinations because the model
00:05:29forgets the code it literally just wrote. But this is where OMLX's persistent SSD caching blew me away.
00:05:37Even though I cleared the session in codex, the actual computational state of my project
00:05:42were still sitting on my SSD. So the moment I gave codex a new prompt to continue where it left off,
00:05:48OMLX recognized the prefix and instantly hydrated the model's brain from the disk. And instead of
00:05:56hallucinating or starting from scratch, it picked up right where it left off. So the cash efficiency
00:06:02really helps in this case. And by the end of this task, we can see here that quen 3.6 with the help of
00:06:08OMLX was able to get through the task by churning out 1.78 million tokens, and roughly 1.59 million
00:06:16of them were cached. So we ended up with an 89% cash efficiency, which is pretty massive. And for
00:06:22the app itself, it looks quite decent. We're able to search for movies, add them to our watch list,
00:06:28and rate them. But once you refresh the page, the watch list resets. So I'm guessing it didn't
00:06:33implement the database storage solution properly, but solid effort overall nonetheless. Now this
00:06:40all looks impressive, but I wanted to find out how does this performance stacks up to a heavyweight
00:06:46model runner like LM Studio. So I decided to run the same task with the same quen 3.6 model
00:06:52using the same context window and constraints and see how it performs. And honestly, I wasn't
00:06:58expecting this, but I actually got a worse performance on LM Studio. So the task itself
00:07:04took roughly 35 minutes to finish. That's already 15 minutes more than on MLX. And I also noticed
00:07:11that while running this task, LM Studio was using every last juice of my MacBook. So much so that I
00:07:17couldn't even watch a video on a second monitor because it was lagging due to severe RAM shortage.
00:07:23Now I did not have the same problem with OMLX because when running this on OMLX, I was easily
00:07:30able to browse the web, watch videos, or do any other task while Codex was still running in the
00:07:35background. But this was nearly impossible to do on LM Studio. And look at these stats. What shocked
00:07:41me even more is that the average token per second speed on LM Studio was 16 tokens per second. And on
00:07:47OMLX, it was roughly 47. So that actually explains why the task took 15 minutes longer to finish.
00:07:55However, I do have to give credit where credit is due. LM Studio did not throw a single 400 error
00:08:01due to context limit bottlenecks like OMLX. So the context management on LM Studio is very stable and
00:08:08running perfectly. And if we look at the final result, it was very similar. I didn't have any
00:08:13fancy animations this time, but honestly, this feels like comparing the same output with different
00:08:18seed values for the same task on the same model. So I'm not going to jump into any conclusions here.
00:08:25It's the same quen 3.6 model. You can judge quen's models output here for yourselves. So what is the
00:08:33final verdict? Well, I must say I am very, very impressed with OMLX performance. If you're on a
00:08:39MacBook with a limited RAM and you want to actually use your computer while running a local AI agent in
00:08:45the background, then OMLX is a perfect tool for that. It effectively gives you a RAM extension by
00:08:52utilizing your high speed SSD combined with that sweet MLX framework that lets us run models more
00:08:58smoothly on Apple Silicon. But yes, the occasional 400 error means that you will have to be more
00:09:05hands-on with it and maybe do a clear command once in a while. But that is the trade-off you get for a
00:09:10three times faster generation speed. But I think it is well worth it in this case. So these kinds
00:09:16of projects like OMLX are proving that we don't necessarily need 128 gigabytes of RAM to run
00:09:23powerful agents. We just need a smarter way to manage the memory we already have on our MacBooks.
00:09:29And we actually ran a survey a few months ago and found out that most of our viewers are Mac users.
00:09:34So I'm actually curious to find out. Have you tried OMLX on your own machines? What has been the
00:09:40experience so far? Let us know in the comments section down below. So there you have it folks.
00:09:45That is OMLX in a nutshell. And folks, if you like these types of technical breakdowns, please let me
00:09:50know by smashing that like button underneath the video. And also don't forget to subscribe to our
00:09:55channel. This has been Andris from Better Stack and I will see you in the next videos.

Key Takeaway

oMLX outperforms generalist runners on Apple Silicon by leveraging the MLX framework and SSD-based context swapping to deliver three times faster token generation while maintaining system responsiveness on low-RAM MacBooks.

Highlights

  • oMLX utilizes a two-tier KV cache system that keeps immediate context in unified memory while swapping older conversation history to the SSD.

  • The MLX framework achieves high performance on Apple Silicon by using zero-copy arrays that allow the CPU and GPU to share physical memory without data transfer overhead.

  • In a direct coding test using a 35B parameter model on an M2 MacBook Pro, oMLX finished the task in 20 minutes compared to 35 minutes for LM Studio.

  • oMLX reached an average generation speed of 47 tokens per second, which is nearly triple the 16 tokens per second observed in LM Studio.

  • An 89% cache efficiency was achieved by oMLX during a full-stack web application development task, processing 1.78 million tokens with 1.59 million cached.

  • Claude Code consumes 16.2K tokens for system prompts and tool definitions, leaving only 16K of a 32K context window for the actual project code.

Timeline

Native Inference Optimization via Apple MLX

  • oMLX is a specialized inference engine built specifically for Apple Silicon hardware.
  • The underlying MLX framework eliminates data copying between CPU and GPU through zero-copy arrays.
  • Lazy computation optimizes the calculation graph by delaying math operations until the output is required.

Traditional PC architectures are hindered by the memory tax, where data must move across the PCI bus between separate CPU and GPU memory pools. Apple Silicon's unified memory architecture allows oMLX to treat memory as a single resource. This integration enables the system to handle model weights more efficiently than generalist tools designed for diverse hardware.

Two-Tier Memory and SSD Context Caching

  • The system keeps active context in unified memory while paging older conversation history to the SSD.
  • SSD caching allows for a 'RAM extension' effect that prevents the software from locking up the entire machine.
  • Older system prompts and tool definitions are frozen to free up high-speed memory for immediate generation.

Heavyweight runners like LM Studio maintain the entire conversation history in a 'hot' state, which quickly exhausts available RAM. oMLX functions like a modern operating system by intelligently managing which data stays in the brain and what gets stored on disk. This approach prevents the memory bottlenecks that typically crash local LLM sessions on consumer-grade hardware.

Comparative Performance and Agent Efficiency

  • The Codex CLI offers a leaner alternative to Claude Code by reducing the base weight of system prompts.
  • Claude Code utilizes approximately 50% of a 32K context window just for its internal tool definitions.
  • oMLX provides ready-to-use code snippets for different AI agent harnesses through an intuitive server dashboard.

Token conservation is critical on devices like the MacBook M2 where context windows are limited. Using a leaner agent harness like Codex ensures there is more room for actual project code before hitting context ceilings. The dashboard facilitates the setup of these agents, allowing users to launch local servers with specific API keys and model parameters quickly.

Real-World Coding Test Results

  • A 35B parameter Quen 3.6 model successfully generated a movie search and rating web app using oMLX.
  • Persistent SSD caching allows the model to resume progress without hallucination even after a manual session clear.
  • Hydrating the model's state from the SSD saves the computational progress of the project during context overflows.

During the development of a movie wishlist application, the prompt occasionally exceeded the 30K context limit of the M2 MacBook. While this usually kills a project or forces a memory-wiping 'clear' command, oMLX recognized the prefix on the SSD and instantly reloaded the state. This resulted in an 89% cache efficiency, proving that local agents can handle massive token throughput without losing the thread of the conversation.

Head-to-Head Comparison with LM Studio

  • LM Studio required 35 minutes to complete the same task that oMLX finished in 20 minutes.
  • Multitasking was impossible during the LM Studio test as it consumed all available RAM and caused video playback lag.
  • oMLX generated text at 47 tokens per second compared to 16 tokens per second on LM Studio.

While LM Studio demonstrated superior stability by avoiding 400-level context errors, its resource consumption was significantly higher. oMLX allowed for background browsing and video watching while the AI agent worked, whereas LM Studio utilized every available resource of the MacBook. The three-fold increase in generation speed makes the manual management of occasional context errors a practical trade-off for productivity.

Community Posts

View all posts