How Big Models Fit on Small GPUs (DeepSpeed)

BBetter Stack
Computing/SoftwareAdult EducationInternet Technology

Transcript

00:00:00This is deep speed Microsoft's open source library that plugs straight into pi torch and fixes the real problem of memory
00:00:07It lets you fit models that normally crash instantly all on one GPU without overloading it big models don't fail because they're slow
00:00:14They fail because optimizer states gradients and parameters end up blowing up your vram deep speed secret is zero which shards training states
00:00:23So you're not storing all the same stuff everywhere. We have videos coming out all the time. Be sure to subscribe
00:00:30Now let's just jump right in and get this running I'm gonna run all this on Google collab since I'm on a Mac and for pro
00:00:40So I don't have the Nvidia GPUs which kind of makes it more difficult, but I can still do this on collab instead first
00:00:46I'm gonna run a quick check on my GPUs then we can pip install all our packages
00:00:51I'm gonna install things like pi torch hugging face and deep speed then run DS report to sanity check your CUDA and compiler setup a
00:00:59Few more installs to make sure this runs smoothly
00:01:02Then we're gonna create our config JSON file so we can configure deep speed this config file is the whole game
00:01:09We'll start with zero stage two with shards optimizer states and gradient across
00:01:14GPUs to significantly reduce memory usage all while model parameters stay replicated don't overthink this because this drove me nuts
00:01:22Just start from the official docks change one thing at a time and resist that urge to add random
00:01:28Configs the config you can find on both the hugging face and the deep speed docks
00:01:34And I got most of my Python script from these docks as well
00:01:37But I did make some tweaks for my system just to handle it better that max system
00:01:42If this step fails stop here because most deep speed issues are CUDA mismatches. It's not actually your model
00:01:48We're gonna run all this now and just watch it work
00:01:51I'm gonna use a small imported data set for this example just to get it running faster
00:01:58And there we go after a few minutes. We can see the steps it took in the peak GPU memory, too
00:02:03Yes training loss on this run didn't actually change or go down much
00:02:08But we could optimize this or use a larger data set for better loss now
00:02:13Here is what people think they get but then they still go out of memory
00:02:16Zero comes in stages each one of these stages answers one question. What am I allowed to stop storing on a single GPU?
00:02:24Well stage one shards optimize states stage two does the same as well as gradients now
00:02:30You're cutting even deeper into this stuff that quietly eats away your memory. Then we have zero stage three
00:02:36This is the big one. It shards optimizer states gradients and parameters
00:02:40This is the biggest memory win, but even that might not be enough
00:02:45If you still can't fit the motto zero infinity can offload to CPU or even NVMe
00:02:50So yes, you're trading speed for scale. But sometimes the actual win is just fitting the model in the first place
00:02:56Now if you're thinking cool, but memory isn't my only problem. You're right deep speed can support 3d parallelism to data
00:03:04Pipeline and tensor parallelism and it has built-in support for mixture of experts models
00:03:09So sparse models don't eat you on compute
00:03:12So now we're getting some real options deep speed integrates really nicely with hugging face and accelerates
00:03:19So you don't need to build everything from scratch
00:03:21You basically take what you need and we can ignore all the rest of it now benchmarks depend heavily on your setup
00:03:27So don't always trust the big numbers. I tried running this a few times
00:03:30But again, since I'm on an m4 Pro, I couldn't get this optimized anymore just with this basic model
00:03:36So it's tough to say but other deep speed projects have shown major throughput wins
00:03:41Especially when memory was a limiting factor if you're in a Windows or Linux these could be huge wins
00:03:46So the best move honestly just try it out start by starting with the official configs
00:03:51That's mostly what I did here changed it for a bit for Mac then fix CUDA issues
00:03:56Then watch CPU RAM if you enable offload and if you decide to go multi GPU later
00:04:01Accelerate is going to help you with that deep speed is basically I refuse to be out of memory today button
00:04:07But once you understand zero and how offloading works in this context, it makes larger models practical unlimited hardware
00:04:14But getting set up can definitely be confusing at first
00:04:17Subscribe if this saved you GPU time or if you just love these type of dev tools. We'll see you in another video

Key Takeaway

DeepSpeed enables the training of massive AI models on limited hardware by sharding memory states and offloading data, effectively serving as an “I refuse to be out of memory” solution for developers.

Highlights

DeepSpeed is a Microsoft open-source library that integrates with PyTorch to solve VRAM memory limitations.

The core technology is ZeRO (Zero Redundancy Optimizer), which shards optimizer states, gradients, and parameters across GPUs.

ZeRO is organized into stages (1, 2, and 3) that progressively reduce the memory footprint on individual GPUs.

ZeRO-Infinity allows for offloading data to CPU memory or NVMe storage, trading processing speed for the ability to handle massive models.

Configuration is managed via a JSON file, and the speaker recommends starting with official documentation to avoid CUDA mismatch errors.

Timeline

Introduction to DeepSpeed and ZeRO

The video introduces DeepSpeed as Microsoft's solution for fitting large models onto small GPUs by addressing memory crashes. The speaker explains that models typically fail not because they are slow, but due to the explosion of optimizer states, gradients, and parameters in VRAM. The secret weapon mentioned is ZeRO, which shards training states so that redundant data isn't stored everywhere. This section establishes the fundamental problem of memory bottlenecks in deep learning. It sets the stage for a technical walkthrough on how to implement these optimizations.

Setup and Configuration on Google Colab

The speaker demonstrates setting up the environment using Google Colab because their local Mac M4 Pro lacks Nvidia GPUs. Essential packages like PyTorch, Hugging Face, and DeepSpeed are installed, followed by a 'ds_report' sanity check for CUDA and compiler compatibility. A critical part of the process is creating the config JSON file, which defines the ZeRO stage and memory sharding behavior. The speaker emphasizes starting with stage two, which shards optimizer states and gradients while keeping parameters replicated. This part of the video focuses on the practical 'first steps' to ensure the environment is stable before training begins.

Debugging and Initial Training Run

Advice is given to resist the urge to add random configurations and instead stick to official documentation from Hugging Face or DeepSpeed. The speaker warns that most failures at this stage are due to CUDA mismatches rather than issues with the model itself. A small dataset is imported for a test run to demonstrate the library in action and monitor peak GPU memory usage. Although the training loss might not drop significantly in this quick test, the primary goal is to prove the model can run without crashing. This section highlights the importance of methodical debugging and baseline testing in machine learning workflows.

Deep Dive into ZeRO Stages and Offloading

The core technical breakdown explains the three stages of ZeRO and the concept of ZeRO-Infinity. Stage one shards optimizer states, stage two adds gradient sharding, and stage three shards parameters for the maximum memory win. If these stages are still insufficient, ZeRO-Infinity can offload tasks to the CPU or NVMe, though this trades speed for the ability to scale. The speaker also notes that DeepSpeed supports 3D parallelism and Mixture of Experts (MoE) models to manage compute efficiency. This technical deep dive explains the 'how' behind the library's ability to handle massive scale on limited hardware.

Ecosystem Integration and Final Recommendations

The final section covers how DeepSpeed integrates with the Hugging Face and Accelerate ecosystems to simplify multi-GPU setups. The speaker shares their experience testing on an M4 Pro, noting that while benchmarks vary, DeepSpeed often provides massive throughput wins on Windows and Linux. Viewers are encouraged to monitor CPU RAM when using offload features and to use Accelerate if they decide to scale to multiple GPUs later. The video concludes by framing DeepSpeed as a practical tool that makes large-scale modeling accessible to those without unlimited hardware. This wrap-up provides a roadmap for users to transition from local testing to high-performance training.

Community Posts

No posts yet. Be the first to write about this video!

Write about this video