This Netflix’s AI Model Removes Any Actor From Any Scene (VOID Model Breakdown)

BBetter Stack
Computing/SoftwareMoviesPhotography/ArtInternet Technology

Transcript

00:00:00Oh wow, that looks kinda sad, poor Kate Winslet, oh my god, just standing there alone, with
00:00:09no Jack.
00:00:11Netflix just released a very interesting open source AI tool called Video Object and Interaction
00:00:17Deletion or VOID.
00:00:19So most AI video tools are already great at erasing objects, that's nothing new.
00:00:24But they are terrible at erasing the consequences of those objects in the scene.
00:00:29So for example, if you're removing a bowling ball hitting pins, most models leave the pins
00:00:34falling over for no reason, but VOID tries to fix this issue.
00:00:39It's a new framework from Netflix and Insight that understands cause and effect and modifies
00:00:44the video content based on the removed objects.
00:00:47So in this video, we'll take a closer look at this model, see how it works, and I actually
00:00:52built a web app to test this model in all of its glory, so we'll do a few video tests on
00:00:57our own.
00:00:58It's gonna be a lot of fun, so let's dive into it.
00:01:05So VOID stands for Video Object and Interaction Deletion.
00:01:09To understand why this is such a big deal, you have to look at how video inpainting usually
00:01:15works.
00:01:16So standard AI erasers are basically content-aware fill on steroids.
00:01:20They look at the pixels around the hole and try to guess what should be there.
00:01:24This works for a watermark or a person standing still, but it falls apart the moment there
00:01:29is a physical interaction.
00:01:31If you remove a girl making a smoothie in a blender, a normal AI will erase the person,
00:01:36but it will leave the blender spinning and churning for no reason.
00:01:40It fixes the appearance, but it ignores the physics of other objects around it.
00:01:46VOID is designed to solve that ghost interaction problem by reimagining a counterfactual reality.
00:01:53Basically a version of the video where that object or person never existed in the first
00:01:57place.
00:01:58And the way it pulls this off is actually pretty clever.
00:02:01It doesn't just start painting immediately.
00:02:03Instead, it uses a two-pass system.
00:02:06In the first step, they do a reasoning phase.
00:02:08First, VOID uses a vision language model and SAM2 or Segment Anything Model 2 to look at
00:02:15the scene.
00:02:16I actually did a whole separate video on how SAM2 works, so check that out if you're interested.
00:02:22So while SAM2 creates a pixel-perfect track of the object you want to remove, the AI asks
00:02:28itself the question, "If I remove this, what else changes?"
00:02:32If you remove one domino from a stack of dominoes, the AI identifies that other dominoes are casually
00:02:38affected.
00:02:39It then creates what the researchers call a "Quad Mask", a specific map that tells the
00:02:44diffusion model not just where to erase, but where to rewrite the physics of the surrounding
00:02:50area.
00:02:51And then step two is the generation and refinement.
00:02:54Once it has generated that map, a video diffusion model generates the new footage.
00:03:00Now sometimes these models can be a bit dreamy, like objects might morph or lose their shape.
00:03:05So to fix this, VOID has an optional second pass.
00:03:08It uses something called flow warp noise to lock those shapes into place, making sure that
00:03:14while the physics change, the remaining objects stay solid and consistent.
00:03:19But you might be wondering, how do you teach an AI what didn't happen?
00:03:23So the team at Netflix and Insight couldn't just film a car crash and then uncrash it in
00:03:28real life to get the training data.
00:03:30Instead, they used synthetic environments like Kubrick.
00:03:34They ran thousands of physics simulations where they had a before and an after version.
00:03:40One version with a collision and one version where the object was never there.
00:03:44By showing AI both versions, it learned the relationship between an object's presence and
00:03:49its impact on the environment.
00:03:51So all of that sounds super fascinating, but let's actually test out this tool for ourselves.
00:03:57So the best way to run it would be to use a cloud GPU like a RunPod module running on
00:04:02an H100 GPU or something equivalent.
00:04:05But I'm going to tell you right off the bat, setting it up is not straightforward at all.
00:04:10The GitHub documentation has a lot of holes and misleading information.
00:04:14So to get it working correctly, there are a few things you have to watch out for.
00:04:18For example, this command will likely fail because they never specified that you need
00:04:23the SAM3 model for this procedure.
00:04:25And this command might fail because they never specified that quad masks must be strictly
00:04:30named quad mask underscore zero dot MP4 to work properly.
00:04:35So there are a lot of these little issues that are not documented here.
00:04:38And their Gradio demo is nice if you already have a mask segmented with SAM2, but they
00:04:44don't provide the graphical user interface to actually create that mask.
00:04:48So what I did is I built a custom web app that fixes all these issues and provides you
00:04:54with a ready to use UI that goes through the segmentation step, the inference step, and
00:05:00even the two pass system.
00:05:02So you can just upload your video, segment the mask and render out the final output.
00:05:07And that's exactly what we're going to do now.
00:05:09So first, you have to spin up a run pod instance with a beefy GPU.
00:05:14I'm going to be using an H100 for this test.
00:05:17And in the template section, make sure you increase the container size to 100 gigabytes.
00:05:22And in the port section, add the port 8998 because this is where we will be exposing
00:05:27our web app.
00:05:29Then all you have to do is SSH into the pod, clone my repo, CD into it and run the run dot
00:05:36SSH command.
00:05:38And it will also ask you to provide a hugging face token so you can actually download the
00:05:42models and also make sure you have access to the SAM3 repository because this is a gated
00:05:48model and you need to request permission to use it.
00:05:51But usually the process is pretty quick and you get approved in a few minutes.
00:05:55And then you will also need a Gemini API key because in the segmentation step, the model
00:06:00uses Gemini to determine pose estimation for a precise quad mask generation.
00:06:06All right.
00:06:07And if you have both of those credentials, then let the run dot SSH command install everything.
00:06:13And once that is done, we can now launch the web app with the following command outlined
00:06:18here.
00:06:19And now on the run pod page, you have to click on this port and that will open up our web
00:06:24app.
00:06:25And now we can finally start testing the model.
00:06:28So for my first test, I will use this famous scene from The Matrix and I will try to remove
00:06:32Neo from the scene and see what happens.
00:06:35So the very first thing you have to do is specify the removal instruction prompt.
00:06:41In this case, we can specify something like remove the fighter in the white kimono from
00:06:45the scene.
00:06:46And after that, we get to the section where you just segment a bunch of points around the
00:06:51object or person you want to remove so that the SAM2 model knows which shape to focus on
00:06:57and then specify the output folder where we will store our result files.
00:07:02And you have to remember the name of this folder because this will be the unique identifier
00:07:06that we will be using in other tabs to identify which video we are working with.
00:07:11After that, we can proceed to the second tab, which will run our segmentation step and run
00:07:16the process.
00:07:17And once that is done, we can move to tab three, which is the inference step, which is where
00:07:22the model will actually try to remove the desired object or person.
00:07:26And here we need to type in that folder name again.
00:07:29And here we need to specify a prompt that describes what the video should look like without the
00:07:34existence of our removed object or person.
00:07:37So in our case, that would be something like a fighter in a dark kimono standing inside
00:07:42a gym.
00:07:43And they also recommend not mentioning the removed object or person, just focusing on
00:07:48what needs to be in the video and run the inference step.
00:07:52And once that is done, we can now head to the results tab and see our final video.
00:07:58And once again, we need to specify the video folder.
00:08:01And there you go.
00:08:03Look at that.
00:08:04Yeah, it looks like Morpheus is fighting a ghost.
00:08:07We can see that there are some inconsistencies with the removal of the hands and other things.
00:08:12So it's not perfect, but there is another thing we can do to try to improve it.
00:08:18We can now run it through the second pass filter, which is tab for to try to achieve better results.
00:08:24And so after running the second pass, we now get this additional window where we see the
00:08:29result of the second pass.
00:08:32And once again, it still looks kind of weird.
00:08:34It still feels like Morpheus is fighting a ghost or dancing or something.
00:08:39So as you can see, it does not work for every scene.
00:08:42Some scenes are just going to be very weird, but it does do a good job of removing Neo from
00:08:48the scene completely.
00:08:49That being said, let's try two more fun examples.
00:08:53So here is the famous dancing scene from La La Land.
00:08:56And here I'm going to try to remove Emma Stone from the scene and see what happens.
00:09:01Wow, look at that.
00:09:03This looks almost flawless.
00:09:05I can really believe that Ryan Gosling is just dancing by himself here.
00:09:09And you see the moment where Emma Stone goes in front of Ryan Gosling.
00:09:13This transition is almost seamless.
00:09:15We can see some minor artifacts, but for the most part, wow, this is a stunning result.
00:09:21So from all of the results I tested, this one was the best.
00:09:24And for some reason, I thought this is going to be the hardest example to run.
00:09:28But surprisingly, this yielded the best results from all the tests I did.
00:09:33All right.
00:09:34I want to try one more example.
00:09:35And this one, I want to try to remove Leonardo DiCaprio from the famous Titanic scene and
00:09:41see what happens.
00:09:42Oh, wow, that looks kind of sad.
00:09:48Poor Kate Winslet.
00:09:49Oh my God.
00:09:50Just standing there alone with no Jack.
00:09:53That looks interesting.
00:09:55We can see that this model did a great job of removing Leo from the scene.
00:09:59Although we can see some leftover artifacts on Kate Winslet's arm.
00:10:03And oh my God, this is so creepy.
00:10:06There is still a creepy leftover hand holding Kate's arm on the other side.
00:10:10Oh no.
00:10:11I can't unsee it now.
00:10:14Honestly, this is my bad because I did not segment those specific points for removal
00:10:19in the segmentation step.
00:10:21So that's on me.
00:10:23And we also see that Kate Winslet's face morphs a bit.
00:10:26So there's a bit of uncanny valley going on here for sure.
00:10:30So overall, I think this tool does what it advertises.
00:10:33It's just a matter of the specific video and the nature of it.
00:10:37Obviously, we can't force Morpheus to be standing still in this scene.
00:10:41But if we look at some other examples on their project page, they are absolutely incredible.
00:10:46So I think this model does have some solid capabilities and maybe with extra training,
00:10:51it might get even better.
00:10:52So there you have it folks.
00:10:53That is the void model in a nutshell.
00:10:55Honestly, I had so much fun testing this.
00:10:58And since it's developed by Netflix, I'm actually super curious to know what will they be using
00:11:03this for?
00:11:04Could it be used to alter some video narratives based on user preferences or choices?
00:11:09Similarly, how Netflix added that choose your own adventure type of interactive experience
00:11:15on the Black Mirror Bandersnatch show?
00:11:17You remember that?
00:11:18Who knows?
00:11:19But in any case, it's going to be very interesting to see how the use of this tool evolves going
00:11:23forward.
00:11:24Well, what do you think about this framework?
00:11:27What kind of use cases would this tool be useful for?
00:11:30Let us know your thoughts in the comment section down below.
00:11:33And folks, if you like these types of technical breakdowns, please let me know by smashing
00:11:37that like button underneath the video.
00:11:39And also don't forget to subscribe to our channel.
00:11:42This has been Andres from better stack and I will see you in the next videos.

Key Takeaway

Netflix's VOID model solves the ghost interaction problem in video inpainting by using a two-pass diffusion system and synthetic physics training to ensure surrounding objects react naturally when a primary object is removed.

Highlights

  • The VOID model uses a two-pass system to remove objects while rewriting the physics of the surrounding scene to avoid ghost interactions.

  • Training data for the model relies on synthetic environments like Kubrick, featuring thousands of physics simulations comparing collision and non-collision versions of the same event.

  • A vision language model combined with Segment Anything Model 2 (SAM2) creates pixel-perfect tracking and identifies causal effects of removed objects.

  • The generation phase utilizes a Quad Mask to instruct the diffusion model on where to erase pixels and where to rewrite physical interactions.

  • Flow warp noise is applied in an optional second pass to stabilize object shapes and maintain visual consistency during movement.

  • Successful implementation on cloud GPUs requires 100GB of container storage and access to gated models like SAM3 and Gemini API keys for pose estimation.

Timeline

Limitations of standard AI video erasers

  • Standard video inpainting tools function as content-aware fill by guessing pixels based on surrounding data.
  • Traditional models fail to remove the physical consequences of an object, such as bowling pins falling after a ball is erased.
  • A blender continues to spin in standard AI erasers even if the person operating it is removed from the scene.

Most AI tools focus on visual appearance rather than the physics of a scene. They effectively remove static objects or watermarks but struggle with dynamic physical interactions. VOID addresses this by creating a counterfactual reality where the removed object never existed.

Two-pass framework and Quad Mask generation

  • The reasoning phase uses vision language models to determine how removing one item affects the rest of the scene.
  • The Quad Mask identifies specific areas where the model must rewrite physics rather than just filling in background pixels.
  • Flow warp noise in the second pass prevents objects from morphing or losing solid shapes during the rendering process.

The process begins with SAM2 creating a precise track of the target object. The AI then calculates causal changes, such as how removing one domino affects a falling stack. This data guides a video diffusion model to generate new footage while the second pass ensures remaining objects stay solid and consistent.

Synthetic training and technical setup

  • AI learns cause and effect through synthetic physics simulations in the Kubrick environment.
  • Deployment requires high-performance hardware like an H100 GPU and specific gated model permissions for SAM3.
  • Gemini API keys provide the pose estimation necessary for precise quad mask generation during the segmentation step.

Because real-world data of collisions being 'undone' does not exist, researchers use simulated environments to show the AI both versions of an event. Setting up the model manually involves navigating undocumented requirements, such as specific file naming conventions like 'quad_mask_zero.mp4'. Custom web apps can streamline these steps by integrating segmentation and inference into a single interface.

Performance results and cinematic tests

  • Removing Emma Stone from a La La Land dance sequence produced nearly flawless results with seamless transitions.
  • Complex interactions, such as Morpheus fighting Neo in The Matrix, can result in 'ghost-like' movements where the remaining actor appears to be fighting thin air.
  • Partial segmentation in a Titanic scene left behind a 'creepy' hand artifact on the remaining actor's arm.

Testing across different film scenes shows varying degrees of success based on the nature of the interaction. While some scenes look stunningly realistic, others suffer from the 'uncanny valley' effect where faces morph or limbs remain partially visible. The model has potential for interactive media, such as altering narratives based on user choices in shows like Black Mirror.

Community Posts

View all posts