00:00:00This is deep speed Microsoft's open source library that plugs straight into pi torch and fixes the real problem of memory
00:00:07It lets you fit models that normally crash instantly all on one GPU without overloading it big models don't fail because they're slow
00:00:14They fail because optimizer states gradients and parameters end up blowing up your vram deep speed secret is zero which shards training states
00:00:23So you're not storing all the same stuff everywhere. We have videos coming out all the time. Be sure to subscribe
00:00:30Now let's just jump right in and get this running I'm gonna run all this on Google collab since I'm on a Mac and for pro
00:00:40So I don't have the Nvidia GPUs which kind of makes it more difficult, but I can still do this on collab instead first
00:00:46I'm gonna run a quick check on my GPUs then we can pip install all our packages
00:00:51I'm gonna install things like pi torch hugging face and deep speed then run DS report to sanity check your CUDA and compiler setup a
00:00:59Few more installs to make sure this runs smoothly
00:01:02Then we're gonna create our config JSON file so we can configure deep speed this config file is the whole game
00:01:09We'll start with zero stage two with shards optimizer states and gradient across
00:01:14GPUs to significantly reduce memory usage all while model parameters stay replicated don't overthink this because this drove me nuts
00:01:22Just start from the official docks change one thing at a time and resist that urge to add random
00:01:28Configs the config you can find on both the hugging face and the deep speed docks
00:01:34And I got most of my Python script from these docks as well
00:01:37But I did make some tweaks for my system just to handle it better that max system
00:01:42If this step fails stop here because most deep speed issues are CUDA mismatches. It's not actually your model
00:01:48We're gonna run all this now and just watch it work
00:01:51I'm gonna use a small imported data set for this example just to get it running faster
00:01:58And there we go after a few minutes. We can see the steps it took in the peak GPU memory, too
00:02:03Yes training loss on this run didn't actually change or go down much
00:02:08But we could optimize this or use a larger data set for better loss now
00:02:13Here is what people think they get but then they still go out of memory
00:02:16Zero comes in stages each one of these stages answers one question. What am I allowed to stop storing on a single GPU?
00:02:24Well stage one shards optimize states stage two does the same as well as gradients now
00:02:30You're cutting even deeper into this stuff that quietly eats away your memory. Then we have zero stage three
00:02:36This is the big one. It shards optimizer states gradients and parameters
00:02:40This is the biggest memory win, but even that might not be enough
00:02:45If you still can't fit the motto zero infinity can offload to CPU or even NVMe
00:02:50So yes, you're trading speed for scale. But sometimes the actual win is just fitting the model in the first place
00:02:56Now if you're thinking cool, but memory isn't my only problem. You're right deep speed can support 3d parallelism to data
00:03:04Pipeline and tensor parallelism and it has built-in support for mixture of experts models
00:03:09So sparse models don't eat you on compute
00:03:12So now we're getting some real options deep speed integrates really nicely with hugging face and accelerates
00:03:19So you don't need to build everything from scratch
00:03:21You basically take what you need and we can ignore all the rest of it now benchmarks depend heavily on your setup
00:03:27So don't always trust the big numbers. I tried running this a few times
00:03:30But again, since I'm on an m4 Pro, I couldn't get this optimized anymore just with this basic model
00:03:36So it's tough to say but other deep speed projects have shown major throughput wins
00:03:41Especially when memory was a limiting factor if you're in a Windows or Linux these could be huge wins
00:03:46So the best move honestly just try it out start by starting with the official configs
00:03:51That's mostly what I did here changed it for a bit for Mac then fix CUDA issues
00:03:56Then watch CPU RAM if you enable offload and if you decide to go multi GPU later
00:04:01Accelerate is going to help you with that deep speed is basically I refuse to be out of memory today button
00:04:07But once you understand zero and how offloading works in this context, it makes larger models practical unlimited hardware
00:04:14But getting set up can definitely be confusing at first
00:04:17Subscribe if this saved you GPU time or if you just love these type of dev tools. We'll see you in another video