This Netflix’s AI Model Removes Any Actor From Any Scene (VOID Model Breakdown)
BBetter Stack
컴퓨터/소프트웨어영화사진/예술AI/미래기술
Transcript
00:00:00Oh wow, that looks kinda sad, poor Kate Winslet, oh my god, just standing there alone, with
00:00:09no Jack.
00:00:11Netflix just released a very interesting open source AI tool called Video Object and Interaction
00:00:17Deletion or VOID.
00:00:19So most AI video tools are already great at erasing objects, that's nothing new.
00:00:24But they are terrible at erasing the consequences of those objects in the scene.
00:00:29So for example, if you're removing a bowling ball hitting pins, most models leave the pins
00:00:34falling over for no reason, but VOID tries to fix this issue.
00:00:39It's a new framework from Netflix and Insight that understands cause and effect and modifies
00:00:44the video content based on the removed objects.
00:00:47So in this video, we'll take a closer look at this model, see how it works, and I actually
00:00:52built a web app to test this model in all of its glory, so we'll do a few video tests on
00:00:57our own.
00:00:58It's gonna be a lot of fun, so let's dive into it.
00:01:05So VOID stands for Video Object and Interaction Deletion.
00:01:09To understand why this is such a big deal, you have to look at how video inpainting usually
00:01:15works.
00:01:16So standard AI erasers are basically content-aware fill on steroids.
00:01:20They look at the pixels around the hole and try to guess what should be there.
00:01:24This works for a watermark or a person standing still, but it falls apart the moment there
00:01:29is a physical interaction.
00:01:31If you remove a girl making a smoothie in a blender, a normal AI will erase the person,
00:01:36but it will leave the blender spinning and churning for no reason.
00:01:40It fixes the appearance, but it ignores the physics of other objects around it.
00:01:46VOID is designed to solve that ghost interaction problem by reimagining a counterfactual reality.
00:01:53Basically a version of the video where that object or person never existed in the first
00:01:57place.
00:01:58And the way it pulls this off is actually pretty clever.
00:02:01It doesn't just start painting immediately.
00:02:03Instead, it uses a two-pass system.
00:02:06In the first step, they do a reasoning phase.
00:02:08First, VOID uses a vision language model and SAM2 or Segment Anything Model 2 to look at
00:02:15the scene.
00:02:16I actually did a whole separate video on how SAM2 works, so check that out if you're interested.
00:02:22So while SAM2 creates a pixel-perfect track of the object you want to remove, the AI asks
00:02:28itself the question, "If I remove this, what else changes?"
00:02:32If you remove one domino from a stack of dominoes, the AI identifies that other dominoes are casually
00:02:38affected.
00:02:39It then creates what the researchers call a "Quad Mask", a specific map that tells the
00:02:44diffusion model not just where to erase, but where to rewrite the physics of the surrounding
00:02:50area.
00:02:51And then step two is the generation and refinement.
00:02:54Once it has generated that map, a video diffusion model generates the new footage.
00:03:00Now sometimes these models can be a bit dreamy, like objects might morph or lose their shape.
00:03:05So to fix this, VOID has an optional second pass.
00:03:08It uses something called flow warp noise to lock those shapes into place, making sure that
00:03:14while the physics change, the remaining objects stay solid and consistent.
00:03:19But you might be wondering, how do you teach an AI what didn't happen?
00:03:23So the team at Netflix and Insight couldn't just film a car crash and then uncrash it in
00:03:28real life to get the training data.
00:03:30Instead, they used synthetic environments like Kubrick.
00:03:34They ran thousands of physics simulations where they had a before and an after version.
00:03:40One version with a collision and one version where the object was never there.
00:03:44By showing AI both versions, it learned the relationship between an object's presence and
00:03:49its impact on the environment.
00:03:51So all of that sounds super fascinating, but let's actually test out this tool for ourselves.
00:03:57So the best way to run it would be to use a cloud GPU like a RunPod module running on
00:04:02an H100 GPU or something equivalent.
00:04:05But I'm going to tell you right off the bat, setting it up is not straightforward at all.
00:04:10The GitHub documentation has a lot of holes and misleading information.
00:04:14So to get it working correctly, there are a few things you have to watch out for.
00:04:18For example, this command will likely fail because they never specified that you need
00:04:23the SAM3 model for this procedure.
00:04:25And this command might fail because they never specified that quad masks must be strictly
00:04:30named quad mask underscore zero dot MP4 to work properly.
00:04:35So there are a lot of these little issues that are not documented here.
00:04:38And their Gradio demo is nice if you already have a mask segmented with SAM2, but they
00:04:44don't provide the graphical user interface to actually create that mask.
00:04:48So what I did is I built a custom web app that fixes all these issues and provides you
00:04:54with a ready to use UI that goes through the segmentation step, the inference step, and
00:05:00even the two pass system.
00:05:02So you can just upload your video, segment the mask and render out the final output.
00:05:07And that's exactly what we're going to do now.
00:05:09So first, you have to spin up a run pod instance with a beefy GPU.
00:05:14I'm going to be using an H100 for this test.
00:05:17And in the template section, make sure you increase the container size to 100 gigabytes.
00:05:22And in the port section, add the port 8998 because this is where we will be exposing
00:05:27our web app.
00:05:29Then all you have to do is SSH into the pod, clone my repo, CD into it and run the run dot
00:05:36SSH command.
00:05:38And it will also ask you to provide a hugging face token so you can actually download the
00:05:42models and also make sure you have access to the SAM3 repository because this is a gated
00:05:48model and you need to request permission to use it.
00:05:51But usually the process is pretty quick and you get approved in a few minutes.
00:05:55And then you will also need a Gemini API key because in the segmentation step, the model
00:06:00uses Gemini to determine pose estimation for a precise quad mask generation.
00:06:06All right.
00:06:07And if you have both of those credentials, then let the run dot SSH command install everything.
00:06:13And once that is done, we can now launch the web app with the following command outlined
00:06:18here.
00:06:19And now on the run pod page, you have to click on this port and that will open up our web
00:06:24app.
00:06:25And now we can finally start testing the model.
00:06:28So for my first test, I will use this famous scene from The Matrix and I will try to remove
00:06:32Neo from the scene and see what happens.
00:06:35So the very first thing you have to do is specify the removal instruction prompt.
00:06:41In this case, we can specify something like remove the fighter in the white kimono from
00:06:45the scene.
00:06:46And after that, we get to the section where you just segment a bunch of points around the
00:06:51object or person you want to remove so that the SAM2 model knows which shape to focus on
00:06:57and then specify the output folder where we will store our result files.
00:07:02And you have to remember the name of this folder because this will be the unique identifier
00:07:06that we will be using in other tabs to identify which video we are working with.
00:07:11After that, we can proceed to the second tab, which will run our segmentation step and run
00:07:16the process.
00:07:17And once that is done, we can move to tab three, which is the inference step, which is where
00:07:22the model will actually try to remove the desired object or person.
00:07:26And here we need to type in that folder name again.
00:07:29And here we need to specify a prompt that describes what the video should look like without the
00:07:34existence of our removed object or person.
00:07:37So in our case, that would be something like a fighter in a dark kimono standing inside
00:07:42a gym.
00:07:43And they also recommend not mentioning the removed object or person, just focusing on
00:07:48what needs to be in the video and run the inference step.
00:07:52And once that is done, we can now head to the results tab and see our final video.
00:07:58And once again, we need to specify the video folder.
00:08:01And there you go.
00:08:03Look at that.
00:08:04Yeah, it looks like Morpheus is fighting a ghost.
00:08:07We can see that there are some inconsistencies with the removal of the hands and other things.
00:08:12So it's not perfect, but there is another thing we can do to try to improve it.
00:08:18We can now run it through the second pass filter, which is tab for to try to achieve better results.
00:08:24And so after running the second pass, we now get this additional window where we see the
00:08:29result of the second pass.
00:08:32And once again, it still looks kind of weird.
00:08:34It still feels like Morpheus is fighting a ghost or dancing or something.
00:08:39So as you can see, it does not work for every scene.
00:08:42Some scenes are just going to be very weird, but it does do a good job of removing Neo from
00:08:48the scene completely.
00:08:49That being said, let's try two more fun examples.
00:08:53So here is the famous dancing scene from La La Land.
00:08:56And here I'm going to try to remove Emma Stone from the scene and see what happens.
00:09:01Wow, look at that.
00:09:03This looks almost flawless.
00:09:05I can really believe that Ryan Gosling is just dancing by himself here.
00:09:09And you see the moment where Emma Stone goes in front of Ryan Gosling.
00:09:13This transition is almost seamless.
00:09:15We can see some minor artifacts, but for the most part, wow, this is a stunning result.
00:09:21So from all of the results I tested, this one was the best.
00:09:24And for some reason, I thought this is going to be the hardest example to run.
00:09:28But surprisingly, this yielded the best results from all the tests I did.
00:09:33All right.
00:09:34I want to try one more example.
00:09:35And this one, I want to try to remove Leonardo DiCaprio from the famous Titanic scene and
00:09:41see what happens.
00:09:42Oh, wow, that looks kind of sad.
00:09:48Poor Kate Winslet.
00:09:49Oh my God.
00:09:50Just standing there alone with no Jack.
00:09:53That looks interesting.
00:09:55We can see that this model did a great job of removing Leo from the scene.
00:09:59Although we can see some leftover artifacts on Kate Winslet's arm.
00:10:03And oh my God, this is so creepy.
00:10:06There is still a creepy leftover hand holding Kate's arm on the other side.
00:10:10Oh no.
00:10:11I can't unsee it now.
00:10:14Honestly, this is my bad because I did not segment those specific points for removal
00:10:19in the segmentation step.
00:10:21So that's on me.
00:10:23And we also see that Kate Winslet's face morphs a bit.
00:10:26So there's a bit of uncanny valley going on here for sure.
00:10:30So overall, I think this tool does what it advertises.
00:10:33It's just a matter of the specific video and the nature of it.
00:10:37Obviously, we can't force Morpheus to be standing still in this scene.
00:10:41But if we look at some other examples on their project page, they are absolutely incredible.
00:10:46So I think this model does have some solid capabilities and maybe with extra training,
00:10:51it might get even better.
00:10:52So there you have it folks.
00:10:53That is the void model in a nutshell.
00:10:55Honestly, I had so much fun testing this.
00:10:58And since it's developed by Netflix, I'm actually super curious to know what will they be using
00:11:03this for?
00:11:04Could it be used to alter some video narratives based on user preferences or choices?
00:11:09Similarly, how Netflix added that choose your own adventure type of interactive experience
00:11:15on the Black Mirror Bandersnatch show?
00:11:17You remember that?
00:11:18Who knows?
00:11:19But in any case, it's going to be very interesting to see how the use of this tool evolves going
00:11:23forward.
00:11:24Well, what do you think about this framework?
00:11:27What kind of use cases would this tool be useful for?
00:11:30Let us know your thoughts in the comment section down below.
00:11:33And folks, if you like these types of technical breakdowns, please let me know by smashing
00:11:37that like button underneath the video.
00:11:39And also don't forget to subscribe to our channel.
00:11:42This has been Andres from better stack and I will see you in the next videos.