Google’s Latest Genie 3 AI Hype Deserves a Closer Look

BBetter Stack
컴퓨터/소프트웨어게임/e스포츠주식 투자가전제품/카메라

Transcript

00:00:00So last week, Google unveiled Genie 3, their flagship infinite world model, where you get
00:00:05to simulate an environment and interact with it like in a real video game.
00:00:10And suddenly all the video game stocks absolutely plummeted out of fear that this might be the
00:00:16beginning of the end of the video game industry.
00:00:20And then something even more interesting happened.
00:00:22A Chinese tech company called Robiant released their own open source Genie competitor, which
00:00:28appears to have even better graphics than its Google counterpart.
00:00:32And now all of a sudden the floodgates are open for the race to determine which company
00:00:37will be the first one to replace traditional video games with this new kind of gaming tech.
00:00:43But while everyone is hyping up this new infinite world model craze, I'm here to tell you this
00:00:49might just be a hyped up promise with no actual substance.
00:00:54What makes me so sure of it?
00:00:55Well, that's what we're going to talk about in today's video.
00:01:02So as soon as Genie 3 came out, I rushed to the site to try it for myself.
00:01:07But as soon as I clicked the explore button, I was presented with a disappointing 404 button.
00:01:14And that's because I live in Canada.
00:01:16And for the time being, Google has only allowed the citizens of the United States to try out
00:01:20this state of the art technological wonder.
00:01:23So obviously I turned on my VPN and tried again from a US location.
00:01:27And this time I was met with another disappointing rejection, stating that I need to be an UltraPlan
00:01:33member to access this revolutionary piece of software.
00:01:37And if you're wondering how much does the UltraPlan cost, well, let's just say it's a bit too much
00:01:41of what I would be comfortable paying just to try out this overhyped AI tool.
00:01:46But this begs the question, why is it so hard to get your hands on Genie 3 in the first place?
00:01:51And the answer to this question will be very important to our story, but I'll get back to
00:01:56that later in this video.
00:01:57So although I had no luck or no disposable funds to try out Genie 3, meanwhile, luckily,
00:02:04over the other side of the globe, a Chinese company called Robiont, which appears to be
00:02:09a subsidiary to Ant Group, which in turn is an affiliate company of Alibaba Group, which
00:02:15happens to be the same company that owns Quen, came out with their own infinite world model
00:02:20called Lingbot World, which surprisingly is open source.
00:02:25So that means we can actually test it out and see what it's capable of.
00:02:29And looking at their examples, it looked absolutely stunning.
00:02:32But once I started inspecting the project page, I was met with another huge disappointment.
00:02:38Although their project page is filled with example videos where you can freely walk around
00:02:43the space with your arrow keys, in reality, this model version that involves full character
00:02:48controls is still under development.
00:02:51They are planning to release Lingbot fast, which would be a full Genie 3 equivalent, but
00:02:56we don't know when that is coming yet.
00:02:57For the time being, we get access to their 14 billion parameter base model, which offers
00:03:03quote high fidelity controllable and logically consistent simulations.
00:03:08But basically the only thing this model is capable of doing as of now is generate a video.
00:03:14Yep, just the video.
00:03:16So I was kinda confused, where does the control factor come in?
00:03:20Well, they do have the option to provide your own intrinsic camera position values, so you
00:03:25can in a sense control the camera movement, which I guess offers an alternative to navigation
00:03:31using the arrow keys, but you would have to pre-record that.
00:03:35How is it different from any other video generator out there that also offers the ability to control
00:03:40camera movements?
00:03:41Well, here's the key distinction.
00:03:44In a regular AI video generator, the AI model tries to always predict the next frame as the
00:03:50reference video progresses, and we've seen in many internet meme videos how terribly wrong
00:03:55this gets if the video just keeps on going, and that is because the model doesn't retain
00:04:00information about what's going on outside of the frame.
00:04:04So if a camera pans away from the object and then pans back, the object might not be there
00:04:09anymore because the whole scene is generated on the fly.
00:04:13This is where the 14 billion parameter geometric brain of the Lingbot World model comes into
00:04:18play.
00:04:19Unlike a standard video generator that simply guesses the next set of pixels, Lingbot World
00:04:24uses camera intrinsics data and 6 degrees of freedom poses to match every pixel to a specific
00:04:31point in 3D space.
00:04:33It creates what researchers call "object permanence" because it understands the mathematical relationship
00:04:39between the camera's lens and the environment.
00:04:42So basically it remembers that a specific object exists at specific coordinates.
00:04:47And this structural integrity is why this model is so massive and computationally hungry.
00:04:52How hungry?
00:04:53Oh boy, let me tell you.
00:04:55I tried deploying the Lingbot World model on an instance with a single RTX 1590 GPU and
00:05:02I tried running the basic sample demo they provided and it just crashed immediately.
00:05:07It was kind of naive of me to think that a single 1590 would be able to handle that load.
00:05:13Then I tried running it with dual 1590s and nope, it still crashed.
00:05:18Then I tried it with 4 1590s and once again, it still crashed.
00:05:23Then I spun up a container with 8 RTX 1590s and tried running the basic demo example and
00:05:31it still crashed.
00:05:32See, the reason is that when running this infinite world model for a prolonged period of time,
00:05:38the amount of memory this model has to store about the scenes keeps getting bigger and bigger
00:05:44up to a point where you will just get an out of memory error because you just ran out of
00:05:49RAM.
00:05:50But I did manage to successfully run the sample demo on an 8 GPU setup by lowering the sample
00:05:55size from the default 70 to just 20.
00:05:59And honestly, the difference between 70 and 20 samples was not that noticeable.
00:06:03But this just shows how insanely computationally expensive running this infinite world model
00:06:09becomes.
00:06:10And getting back to Genie 3, this is exactly why they allow access to it for ultra members
00:06:16only because they need to somehow recuperate the GPU costs of running this thing.
00:06:21And this is also why you only get a certain amount of seconds for one demo because at some
00:06:27point the memory just balloons to a point that the whole system just comes crashing down.
00:06:32And to give you an idea of how insanely expensive it would be to run such a model on consumer
00:06:37grade hardware, a single RTX 1590 costs up to $5,000.
00:06:43Now take 8 of those, which is the minimum required of running this thing.
00:06:48Man, even saying that out loud sounds ridiculous.
00:06:51But anyway, 8 of those will cost you up to $40,000, not to mention all the other parts
00:06:57and RAM which is also exploding in price right now.
00:07:01And when you take that into account, this number, plus the max runtime limit of 60 seconds at
00:07:06which Genie is capping their runs, plus the ballooning RAM memory issue are exactly the
00:07:12reasons why this whole infinite world model thing is just a hype and is not really remotely
00:07:18achievable on consumer hardware with the current architecture that we have right now.
00:07:24And even the authors of both of these tools are admitting these problems.
00:07:28The high inference cost currently necessitates enterprise grade GPUs making the technology
00:07:34inaccessible on consumer hardware.
00:07:37The simulation lacks long term stability.
00:07:39This often leads to environmental drifting where the scene gradually loses structural
00:07:44integrity over extended durations.
00:07:46Exactly.
00:07:48And at least the LinkBot team is being open about it.
00:07:51Let's see what Google has to say about it.
00:07:53The model can support a few minutes of continuous interaction rather than extended hours.
00:07:59I mean, they're not openly admitting it, but at this point we all know why that is.
00:08:04So that's why I'm telling you folks, traditional video games are not disappearing anytime soon.
00:08:09This just seems like a pipe dream at this point and maybe, just maybe, in the future, if they
00:08:15figure out how to solve these computational problems, we might start thinking about this.
00:08:20But right now, bruh, come on.
00:08:23I'm also super curious to try out LinkBot fast when it finally arrives.
00:08:27But until then, I don't think this technology is going mainstream anytime soon.
00:08:32But if you're curious about trying out LinkBot world for yourself, here's my advice.
00:08:37Don't do what I did.
00:08:38Don't stack up eight RTX 1590s together because such a configuration on a platform like RunPod
00:08:45will drain $7 every hour of its runtime.
00:08:48Instead, spin up a single H200 container, which only costs $3.50 per hour and set the
00:08:55"nproc/node" flag to 1 and maybe lower the sample count to 50 or even 20 and you'll be
00:09:01good to go.
00:09:02You could also use the 4-bit quantized version of this model, created by the user Caelan Humphries,
00:09:08which significantly reduces GPU memory consumption while maintaining comparative visual quality
00:09:13for inference.
00:09:15So you technically could try to run that on a single RTX 1590.
00:09:19And if you do so, let me know how it goes.
00:09:21So as for myself, I ran the basic demo on an H200 container and yeah, basically got the
00:09:28same result as their demo page.
00:09:30And then I generated an AI image of this Viking fighting against Loki and fed this image to
00:09:36the same command.
00:09:37And this is the result I got.
00:09:39I guess you can see how the model maintains the integrity of the environment and the castle
00:09:44throughout the video, but it still generates some weird artifacts.
00:09:48So honestly, I don't know what to think of it, to be honest.
00:09:52I'm pretty sure I could generate a better gameplay video using a standard comfy UI pipeline, which
00:09:59by the way, if you're interested of learning how to make your own video generator like Sora
00:10:04without the heavy compute cost, check out this video I did a while ago on that topic.
00:10:09So there you have it, folks, that is my take on Genie 3 and all the hype and the future
00:10:15of video games.
00:10:16I really appreciate the team behind Lingbot open sourcing their models so we can get a
00:10:20better insight as to how a Genie like model works.
00:10:25But those are just my two cents on the topic.
00:10:27More importantly, what do you think about these infinite world models?
00:10:30I'm curious to know what you think, so drop your thoughts in the comments section down
00:10:35below.
00:10:36And folks, if you found this video useful, let me know by smashing that like button underneath
00:10:40the video.
00:10:41And also don't forget to subscribe to our channel for more videos like this one.
00:10:45This has been Andris from Better Stack and I will see you in the next videos.
00:11:00(upbeat music)

Key Takeaway

While infinite world models like Genie 3 and Lingbot World offer a revolutionary glimpse into AI-generated gaming, extreme hardware requirements and technical instability currently make them more of a high-priced 'hype' than a viable replacement for traditional video games.

Highlights

Google's Genie 3 and Robiant's Lingbot World are 'infinite world models' designed to simulate interactive, game-like environments.

The technology faces massive computational barriers, requiring enterprise-grade GPUs like the H200 or multiple RTX 1590s to run.

Memory management is a critical flaw, as these models often crash due to ballooning RAM usage during extended sessions.

Lingbot World uses 14 billion parameters and 3D geometric data to achieve 'object permanence,' distinguishing it from standard video generators.

Current limitations such as 'environmental drifting' and high inference costs mean traditional video games are not at risk of being replaced yet.

Timeline

The Rise of Infinite World Models

The speaker introduces Google's Genie 3, a flagship model designed to simulate interactive environments similar to video games. This announcement caused a temporary panic in the stock market regarding the future of the traditional gaming industry. Shortly after, a Chinese company named Robiant released an open-source competitor that claims to offer even better graphics. The race to determine which company will successfully replace traditional game engines has officially begun. However, the speaker expresses skepticism, suggesting that the current craze may lack actual substance.

Accessibility Barriers and Google's Paywall

The analyst details his frustrating attempt to access Genie 3, which was met with regional 404 errors in Canada. Even after using a VPN to appear in the United States, the software was locked behind an expensive 'Ultra Plan' membership. This high cost of entry raises questions about why the technology is so restricted to the general public. The speaker hints that the answer lies in the massive infrastructure costs required to sustain the model. This section highlights the disparity between AI marketing hype and actual consumer availability.

Lingbot World: The Open Source Alternative

The video shifts focus to Robiant's Lingbot World, a subsidiary of Ant Group and Alibaba, which released an open-source 14 billion parameter model. While the project page shows impressive demos of character movement, the currently available version is limited to generating videos rather than full interactive control. The speaker explains that the 'Lingbot Fast' version, which would allow for real-time navigation, is still under development. Currently, users can only manipulate camera intrinsics to simulate movement within a generated scene. This highlights the gap between promotional videos and the current state of open-source AI tools.

The Science of Object Permanence in AI

The speaker explains the technical distinction between standard AI video generators and world models like Lingbot. Unlike regular models that simply guess pixels and lose track of objects off-screen, Lingbot uses 6 degrees of freedom (6DoF) and camera intrinsics to map pixels to 3D space. This creates 'object permanence,' allowing the AI to remember that an object exists even if the camera pans away and returns. This geometric understanding is what makes the model so massive and computationally demanding compared to simpler architectures. It represents a significant leap in how AI perceives and maintains structural integrity in a virtual environment.

The Hardware Nightmare: GPU Crashing and Costs

The analyst recounts his attempts to run the Lingbot model on consumer-grade hardware, starting with a single RTX 1590 GPU. Even after scaling up to a setup with eight RTX 1590s—costing roughly $40,000—the system continued to crash due to out-of-memory errors. The model's memory requirements balloon as the simulation continues, eventually exceeding available RAM and causing a total system failure. This explains why Google limits Genie 3 sessions to 60 seconds and restricts access to paying members. It proves that the current architecture is fundamentally unsuited for standard household computers.

Structural Integrity and Environmental Drifting

This section addresses the technical shortcomings admitted by the researchers themselves, such as the lack of long-term stability. Over extended durations, these models suffer from 'environmental drifting,' where the scene loses its logical structure and begins to warp. Google acknowledges that their model supports only a few minutes of interaction rather than hours of gameplay. For these reasons, the speaker asserts that traditional video games are not disappearing anytime soon. The technology remains a 'pipe dream' until the industry solves the massive computational and stability hurdles.

Practical Advice for Testing and Final Verdict

In the concluding segment, the speaker offers advice for developers interested in testing Lingbot World without spending a fortune on hardware. He recommends using a single H200 container on platforms like RunPod and lowering the sample count to manage costs. He also mentions a 4-bit quantized version of the model that might actually run on a single high-end consumer GPU. Despite the impressive tech, the speaker concludes that standard tools like ComfyUI can often produce better results with less effort. He ends by thanking the developers for open-sourcing the model and asking the audience for their thoughts on the future of AI gaming.

Community Posts

View all posts