00:00:00Most people assume that running a powerful vision language model requires a massive GPU
00:00:05or a paid subscription to a cloud service.
00:00:08However, Liquid AI recently released a demo of their newest LFM model running entirely
00:00:14within a web browser.
00:00:16Using WebGPU and the ONNX Runtime, this model can process images and videos locally.
00:00:23This means your data never leaves your computer and you don't even need an internet connection
00:00:28once the model is cached on your device.
00:00:30I honestly think that is super cool, so in this video, we're going to take a look at
00:00:34this model, see how it performs, run a little test, and figure out if it's actually as powerful
00:00:40as advertised.
00:00:41It's going to be a lot of fun, so let's dive into it.
00:00:48So LFM stands for Liquid Foundation Model.
00:00:52And instead of relying solely on the transformer architecture, Liquid AI uses a hybrid design.
00:00:58It combines convolutional blocks with something called grouped query attention.
00:01:03The 1.6 billion parameter model is specifically tuned for vision and language.
00:01:09It is trained on a massive 28 trillion token dataset, which helps it punch above its weight
00:01:15class.
00:01:16In the benchmarks, it often matches the performance of models twice its size, while also being
00:01:21significantly faster on edge devices like laptops and phones.
00:01:26Now you might wonder, how did they manage to shrink this level of intelligence into a package
00:01:31that fits under one gigabyte of RAM?
00:01:34Unlike other tiny models that use pruned or compressed versions of giant cloud models,
00:01:40Liquid AI uses a philosophy called efficiency by design.
00:01:44The liquid in their name refers to their linear input varying architecture, or LIV.
00:01:51While traditional transformers have a memory that grows larger the more you talk to them,
00:01:56Liquid model uses a hybrid system of adaptive convolutional blocks.
00:02:01These blocks basically act like smart filters that processes only the most relevant local
00:02:07information, effectively compressing the data as it flows through the model.
00:02:11This allows LFM to maintain its massive 32,000 token context window without the usual exponential
00:02:18slowdown or memory spikes that you see in traditional transformers.
00:02:23And there are specific technical details that makes this model stand out from the rest.
00:02:28First of all, it has a native resolution.
00:02:30It handles images up to 512 by 512 pixels without distortion or upscaling.
00:02:37And for larger images, it uses a tiling strategy, which basically splits the image into patches
00:02:42while keeping a thumbnail for the global context.
00:02:46And secondly, it's very efficient.
00:02:47Because of its hybrid architecture, it offers a very low memory footprint, often running
00:02:52under one gigabyte of RAM.
00:02:54But I think the most impressive is the web GPU integration.
00:02:58The Hugging Face space demo demonstrates how you can use it for real time webcam captioning.
00:03:04So let's try it out for ourselves and see how well it performs.
00:03:08All right, let's see how this thing actually works.
00:03:11I guess we should choose which vision model we want to load.
00:03:15Let's try the most powerful one with FP 16.
00:03:18And let's load that in.
00:03:20Now this model takes a considerable amount of time to download.
00:03:23And this is all being downloaded on your device.
00:03:25So next time you open the application, everything will be cached.
00:03:28All right.
00:03:29So now we have downloaded the FP 16 quantization model.
00:03:34And let's click on start and see how it works.
00:03:36Oh, look at that.
00:03:38A man with a beard and a hoodie is looking at the camera.
00:03:40Okay, so it's able to detect what kind of objects are presented in the video, which is
00:03:45pretty cool.
00:03:46So we can do like object detection.
00:03:50Let's see if it can detect a phone.
00:03:51Yep, it detects that I'm holding an iPhone with a black case.
00:03:57That's pretty cool.
00:03:58Look at that.
00:04:00It's really doing it in real time.
00:04:02I am impressed.
00:04:04So what if I do this?
00:04:05Does it recognize a sign holding a peace sign in his hand?
00:04:10That is pretty cool.
00:04:12What if I do a thumbs up?
00:04:13Yes, I'm getting a thumbs up.
00:04:15The model does detect everything that I'm doing in real time.
00:04:18Let's see if it can detect my microphone.
00:04:21Oh, it even detects that there's a sign road on it.
00:04:24Wow, it can even read text off the case, which is pretty, pretty cool.
00:04:29The fact that we're getting these captions in real time really shows that this model
00:04:33is very powerful.
00:04:35Let me try to turn off internet connection and see if it still works.
00:04:40So now I have turned off wifi and yeah, we're still getting the same inputs, which is pretty
00:04:50awesome.
00:04:51So there you have it folks.
00:04:52That is the newest liquid foundation model in a nutshell.
00:04:56I think it's super impressive how far these AI models have evolved in terms of quantization
00:05:01and the ability to run them on edge devices like my laptop over here.
00:05:05I think just two years ago, we couldn't believe that this could actually be reality, but now
00:05:10it's becoming more and more common to run these models on a web GPU.
00:05:14So what do you think about the liquid foundation model?
00:05:16Have you tried it?
00:05:17Will you use it?
00:05:18What are the best use cases to use such a model?
00:05:21Let us know your thoughts in the comment section down below.
00:05:23And folks, if you like these types of technical breakdowns, please let me know by smashing
00:05:27that like button underneath the video, and also don't forget to subscribe to our channel.
00:05:32This has been Andris from Better Stack, and I will see you in the next videos.