I Ran a Local LLM on 12-Year-Old Raspberry Pi (It Actually Worked!)
BBetter Stack
Computing/SoftwareConsumer Electronics
Transcript
00:00:00This is the first-generation Raspberry Pi, which came out back in 2014.
00:00:05It has a 700 MHz single-core processor and 512 MB of RAM.
00:00:12By modern standards, this is basically a calculator.
00:00:16But today, we're going to see if we can push this 12-year-old hardware
00:00:21to its absolute limit by running a large language model on it locally.
00:00:26In this video, I'll show you which is the tiny model that you can run on a Raspberry Pi,
00:00:30and we'll see how it performs, and I'll show you how to install all the necessary dependencies
00:00:35so you can try it out for yourself.
00:00:37It's going to be a lot of fun, so let's dive into it.
00:00:40Honestly, I didn't think it was possible to find a model
00:00:47lean enough for this architecture.
00:00:49But after some digging, I actually found a candidate.
00:00:52Meet the Falcon H1 Tiny.
00:00:54It's an incredibly compact model with just 90 million parameters.
00:00:59It was developed by the Technology Innovation Institute in Abu Dhabi
00:01:03specifically to explore the extreme lower bounds of language modeling.
00:01:08But how could they make a model that small?
00:01:10Is there some kind of technical magic sauce behind it?
00:01:13Well, not really.
00:01:14They're basically using the same hybrid transformer plus Mamba architecture
00:01:19that companies like IBM used for their tiny Granite 4 models.
00:01:24Which I also did a video on if you want to check it out.
00:01:27But here's the thing.
00:01:28To successfully squeeze this model into memory, we have to talk about quantization.
00:01:33Now, the Falcon models are available in 2-bit, 4-bit, and 8-bit versions.
00:01:38Now, you might be tempted to try the ultra-lean IQ or importance quantization.
00:01:43But here's the catch.
00:01:45Those newer methods rely on a complex bit manipulation
00:01:49that requires modern CPU instructions to be efficient.
00:01:52On our vintage ARMv6 chip in our Raspberry Pi, this won't cut it.
00:01:57So instead, we have to go with the old-school Q4 models
00:02:01which is the gold standard for our case.
00:02:04It uses a medium-sized legacy quantization method
00:02:07that the Pi's processor can actually handle without choking.
00:02:11It gives us the best intelligent per megabyte ratio while keeping the logic intact.
00:02:17But to get this model running on this first-gen Raspberry Pi is not a trivial task.
00:02:22Since the Pi uses the ARMv6 architecture,
00:02:26it lacks the modern neon instructions that almost all of the AI libraries depend on.
00:02:31But luckily, there's llama.cpp which we can use to run our inference.
00:02:36But in order to do that, we have to compile its binary specifically for our ARMv6.
00:02:42And if you would try to compile it directly on the Pi,
00:02:45it would probably take 18 hours for the compiler to finish.
00:02:49That is, if it doesn't crash from out-of-memory errors first.
00:02:53So to bypass that, we have to get a little bit creative.
00:02:56We need to cross-compile these binaries on our laptop beforehand by using doc_cross,
00:03:02specifically targeting the ARMv6 instruction set with the VFP math unit enabled
00:03:08and then copy them over via SSH so we can get straight into inference.
00:03:13So that's exactly what we're going to do now.
00:03:15So first, we need to flash the leanest OS possible onto our Pi using the Raspberry Pi imager.
00:03:22And for a board with only 512 megabytes of RAM, every megabyte counts.
00:03:28So I'm going to go with Raspberry Pi OS Lite, the 32-bit version,
00:03:32since it doesn't have a desktop interface and it idles at a tiny fraction of the memory
00:03:38that the standard OS uses, leaving almost all of our RAM available for the model to run.
00:03:44And here's another important note, make sure to use the advanced settings
00:03:47to pre-configure your Wi-Fi and enable SSH.
00:03:51Because on these older boards, it's much easier to manage everything remotely,
00:03:55so you don't have to fight with that sluggish local terminal.
00:03:58Now, once the Pi is booted and we've SSH'd into it, we need to address the ARMv6 problem.
00:04:05So if we tried to compile llama CPP right here,
00:04:08the Pi would literally spend the next day and a half just chugging through headers.
00:04:13So instead, we're going to do it on a regular laptop to speed up compute and save time.
00:04:18So let's clone the source code of llama CPP and create a dedicated build
00:04:23directory where we will store our build that we'll be using on our Raspberry Pi.
00:04:28Now, here's another problem.
00:04:29My Mac is using ARMv8, which is the 64-bit version, not the 32-bit ARMv6.
00:04:37And they have different instruction sets.
00:04:40So to compile binary specifically for the Pi, we need to use doccross,
00:04:45which is a cross compiler toolchain that runs on my Mac,
00:04:48but generates binary specifically for the Pi's legacy architecture.
00:04:53Next, we need to configure the build.
00:04:55And this is where we need to be extremely precise.
00:04:58So we need to pass some very specific flags.
00:05:00First of all, we need to turn off shared libs to create a single portable binary.
00:05:05And then we have to turn off neon because our Pi lacks those modern math instructions.
00:05:10And we need to disable OpenMP to keep our memory footprint as lean as possible.
00:05:15We're essentially stripping out every modern luxury
00:05:18to ensure the binary is compatible with our old school Pi board.
00:05:22And now if we hit build in about two minutes, we should have a fully
00:05:26compiled optimized llama completion binary ready to be copied onto our Pi board.
00:05:31And now I'll use SSH to connect directly to my Pi via the network
00:05:35and create a fresh directory on the Pi and then SCP to copy our custom build binary onto it.
00:05:42And one last thing we need to do here.
00:05:44Let's go ahead and download the 2-bit, 4-bit and 8-bit legacy quantized models of the Falcon,
00:05:50because we'll be testing all of them sequentially.
00:05:53And then copy them to our Pi one by one over the network in the models folder.
00:05:58Now here comes the fun part.
00:05:59Let's move over to our Pi and execute our first inference test.
00:06:03We'll start with the most aggressive compression, the 2-bit quantized model.
00:06:07And here we need to run this long command.
00:06:10And basically what I'm doing here is prompting it with a simple
00:06:13"Hello, how are you?" and capping the output at 32 tokens.
00:06:18And we're specifying exactly one thread because, well, that's all we have.
00:06:22And we're also keeping the context size tiny at 128 tokens to save every possible byte of RAM.
00:06:29But the most important flag here is no M map.
00:06:32Typically, llama CPP uses memory mapping to load models, which is great for high-end GPUs,
00:06:38but it's a nightmare for our Pi board.
00:06:41On a 32-bit system with only 512 megabytes of RAM,
00:06:45M map can fail if it can't find a contiguous block of address space.
00:06:50So by disabling it, we force the model to load directly into the heap,
00:06:55giving us much more stable control over our limited memory.
00:06:58And with that said, let's run the command.
00:07:00And there it is, our first tokens.
00:07:03As we can see here, the 2-bit version is struggling a lot.
00:07:08First of all, we can see that it's processing a single token in like every three seconds,
00:07:14which is to be expected on an old Raspberry Pi board.
00:07:18But more importantly, the answer is just complete nonsense.
00:07:21On a 90 million parameter model, the weights are so compressed
00:07:25that the linguistic logic has basically collapsed.
00:07:28It's barely coherent, but technically it is working.
00:07:32So now let's see what happens if we swap it for the 4-bit model.
00:07:35And look at that, now we get a coherent greeting back.
00:07:40So that is a success.
00:07:42We now have an actual AI model running locally on the Pi
00:07:47and responding logically to our prompts.
00:07:49So yay!
00:07:50Now let's push it even further.
00:07:53Let's see if the Pi can handle an 8-bit model.
00:07:56And this time I'm going to ask it something more intelligent,
00:07:59like what is the capital of Albania?
00:08:02And well, that is just wrong because the capital of Albania is Tirana
00:08:08and that is clearly not factually correct.
00:08:10But if I ask what is the capital of Belgium, it responds correctly.
00:08:15So this is showing us something very interesting.
00:08:17It seems that the 90 million parameter crunch comes with its own cost.
00:08:22It might have accurate knowledge about larger, more popular countries,
00:08:26but lacks knowledge about lesser known countries and probably lesser known topics.
00:08:31And that is just the nature of knowledge.
00:08:33There is a finite amount of knowledge you can fit within those 90 million parameters.
00:08:38But nonetheless, the result is super cool.
00:08:41And this is a confirmation that yes, there are indeed AI models small enough
00:08:46and lean enough to run on a 12-year-old Raspberry Pi.
00:08:50Is it fast?
00:08:51Hell no.
00:08:52Is it precise?
00:08:53It might not be.
00:08:54Should you use it in production?
00:08:55Probably not.
00:08:57Unless you want to build a very, very, very, very slow robot.
00:09:02But most importantly, now we know that it is theoretically possible.
00:09:06So basically, that's all I wanted to prove in this video.
00:09:09And to be honest, this experiment was a lot of fun.
00:09:13So there you have it, folks.
00:09:14Those are the Falcon H1 Tiny models.
00:09:17Probably the smallest AI models currently out there.
00:09:20And now we know that they are in fact small enough to run on a first-gen Raspberry Pi,
00:09:25which is super cool.
00:09:27I can't stop celebrating how cool this fact is.
00:09:30Although the practical implementation of it is useless, it's still cool.
00:09:35So let me know, folks, if you have any funny thoughts,
00:09:37comments or remarks about what you just witnessed.
00:09:40Post them in the comment section down below.
00:09:42And folks, if you like these types of technical breakdowns,
00:09:45please let me know by smashing that like button underneath the video.
00:09:49And also don't forget to subscribe to our channel.
00:09:51This has been Andris from Better Stack, and I will see you in the next videos.
Community Posts
No posts yet. Be the first to write about this video!
Write about this video