How Big Models Fit on Small GPUs (DeepSpeed)
BBetter Stack
Computing/SoftwareAdult EducationInternet Technology
Transcript
00:00:00This is deep speed Microsoft's open source library that plugs straight into pi torch and fixes the real problem of memory
00:00:07It lets you fit models that normally crash instantly all on one GPU without overloading it big models don't fail because they're slow
00:00:14They fail because optimizer states gradients and parameters end up blowing up your vram deep speed secret is zero which shards training states
00:00:23So you're not storing all the same stuff everywhere. We have videos coming out all the time. Be sure to subscribe
00:00:30Now let's just jump right in and get this running I'm gonna run all this on Google collab since I'm on a Mac and for pro
00:00:40So I don't have the Nvidia GPUs which kind of makes it more difficult, but I can still do this on collab instead first
00:00:46I'm gonna run a quick check on my GPUs then we can pip install all our packages
00:00:51I'm gonna install things like pi torch hugging face and deep speed then run DS report to sanity check your CUDA and compiler setup a
00:00:59Few more installs to make sure this runs smoothly
00:01:02Then we're gonna create our config JSON file so we can configure deep speed this config file is the whole game
00:01:09We'll start with zero stage two with shards optimizer states and gradient across
00:01:14GPUs to significantly reduce memory usage all while model parameters stay replicated don't overthink this because this drove me nuts
00:01:22Just start from the official docks change one thing at a time and resist that urge to add random
00:01:28Configs the config you can find on both the hugging face and the deep speed docks
00:01:34And I got most of my Python script from these docks as well
00:01:37But I did make some tweaks for my system just to handle it better that max system
00:01:42If this step fails stop here because most deep speed issues are CUDA mismatches. It's not actually your model
00:01:48We're gonna run all this now and just watch it work
00:01:51I'm gonna use a small imported data set for this example just to get it running faster
00:01:58And there we go after a few minutes. We can see the steps it took in the peak GPU memory, too
00:02:03Yes training loss on this run didn't actually change or go down much
00:02:08But we could optimize this or use a larger data set for better loss now
00:02:13Here is what people think they get but then they still go out of memory
00:02:16Zero comes in stages each one of these stages answers one question. What am I allowed to stop storing on a single GPU?
00:02:24Well stage one shards optimize states stage two does the same as well as gradients now
00:02:30You're cutting even deeper into this stuff that quietly eats away your memory. Then we have zero stage three
00:02:36This is the big one. It shards optimizer states gradients and parameters
00:02:40This is the biggest memory win, but even that might not be enough
00:02:45If you still can't fit the motto zero infinity can offload to CPU or even NVMe
00:02:50So yes, you're trading speed for scale. But sometimes the actual win is just fitting the model in the first place
00:02:56Now if you're thinking cool, but memory isn't my only problem. You're right deep speed can support 3d parallelism to data
00:03:04Pipeline and tensor parallelism and it has built-in support for mixture of experts models
00:03:09So sparse models don't eat you on compute
00:03:12So now we're getting some real options deep speed integrates really nicely with hugging face and accelerates
00:03:19So you don't need to build everything from scratch
00:03:21You basically take what you need and we can ignore all the rest of it now benchmarks depend heavily on your setup
00:03:27So don't always trust the big numbers. I tried running this a few times
00:03:30But again, since I'm on an m4 Pro, I couldn't get this optimized anymore just with this basic model
00:03:36So it's tough to say but other deep speed projects have shown major throughput wins
00:03:41Especially when memory was a limiting factor if you're in a Windows or Linux these could be huge wins
00:03:46So the best move honestly just try it out start by starting with the official configs
00:03:51That's mostly what I did here changed it for a bit for Mac then fix CUDA issues
00:03:56Then watch CPU RAM if you enable offload and if you decide to go multi GPU later
00:04:01Accelerate is going to help you with that deep speed is basically I refuse to be out of memory today button
00:04:07But once you understand zero and how offloading works in this context, it makes larger models practical unlimited hardware
00:04:14But getting set up can definitely be confusing at first
00:04:17Subscribe if this saved you GPU time or if you just love these type of dev tools. We'll see you in another video
Community Posts
No posts yet. Be the first to write about this video!
Write about this video