GLM 4.7 is INSANE For Software Dev...

AAI LABS
Computing/SoftwareSmall Business/StartupsTelecommutingInternet Technology

Transcript

00:00:00The guys over at ZAI just dropped GLM 4.7, and at $29 a year, this is absurdly cheap for a model they claim hits 73% on SWE bench, right up there with Sonnet 4.5.
00:00:11The timing isn't random. They're going public and need to show western traction.
00:00:15They even did a live Q&A on Reddit, which I've never seen a Chinese AI lab do.
00:00:19But 4.6 had real problems. Is 4.7 actually fixed?
00:00:23Hey everyone, if you're new here, this is AI Labs, and welcome to another episode of Debunked,
00:00:27a series where we actually take AI tools and AI models, strip away the marketing hype, and show what they can actually do with real testing and honest results.
00:00:35The new model is mainly improved through post-training, not architecture change.
00:00:40It's heavily optimized for Claude code, and the ZAI team explicitly said this is their priority framework.
00:00:46Currently, it's actually beating a lot of the top tier models, including GPT-5, especially on coding benchmarks.
00:00:52In all of their coding plans, one additional thing they've added is these new MCP tools, which are not integrated directly.
00:00:58They're separate MCP servers. They've listed three right now.
00:01:01And for all of them to work, you just need an API key. That's why they're included with the plan, but separate from the model.
00:01:07As far as the usage limits go, they're pretty much the same as they were on 4.6.
00:01:11But if you don't know what they were before, I actually generated a report on that.
00:01:15What's funny is I first tried to generate it with Gemini 3, and for some reason it wasn't able to give me a proper comparison of the plans.
00:01:21I went to Claude again, and it researched it nicely.
00:01:24Basically, all you need to know is that for the entry-level plan, you get 10 to 40 prompts in Claude code,
00:01:29while in GLM coding, you're getting 120 prompts for just $3, which is a huge difference.
00:01:34This only increases as you go into the higher tiers, where the $200 plan gets you up to 800 prompts in that 5-hour window with Claude,
00:01:42while $30 gets you 2,400.
00:01:44All of these rates are discounted for the first month, then they double.
00:01:47But if you're on the yearly plan, it's much more affordable.
00:01:50Another significant benchmark was humanity's last exam.
00:01:53For those who don't know, it's one of those unsaturated benchmarks,
00:01:56and most newer models still score low on it because it's genuinely difficult.
00:02:00To actually test the UI, we do have this prompt, which doesn't really focus on the architecture.
00:02:05It mainly focuses on the design logic the model is supposed to implement, while also providing some design options.
00:02:11We can then see, based on the company I'm proposing, which in this case is an AI-powered code review platform, what it makes.
00:02:18We also subscribe to the MAX plan, and there are two ways you can actually connect it with Claude code.
00:02:22In both cases, you do change the settings.json, but one is located in the root of your project, which changes the global settings.
00:02:29If you do it inside your project, then it just changes it for that project.
00:02:33We did this so we could actually compare it with Sonnet 4.5.
00:02:36This is what Sonnet 4.5 came up with.
00:02:38The prompt is actually pretty good, and we've been using it to really identify which of these models build UI and how creative they are in doing that.
00:02:45It's simple vanilla JS, so we're not looking at the architecture right now, just the design.
00:02:49This is what GLM 4.7 came up with.
00:02:52In terms of the design, it's pretty good, but it did make an error here where it didn't really account for the length, which is why the artifacts are breaking up a little bit.
00:02:59Other than that, the design is solid, but I do not like these emojis at all.
00:03:02Sonnet did not use any emojis, which is good and does match the design language.
00:03:06To actually test them both out, I have this premade Next.js project, which has this context initialized that it needs to build a scalable and backend-ready UI.
00:03:15This part is important because, as I'm going to evaluate the reasons why GLM surprisingly performed better, it's going to come back to this point.
00:03:22Framer Motion and ShadCN components have been pre-installed for it to build the UI.
00:03:27Both of them have been asked to build the main browser page for a Netflix-like streaming platform.
00:03:31They've been specified on what to actually build and what needs to be on the page.
00:03:35If you're talking about the usability of the GLM model with Claude code, one problem with GLM 4.6 was that it was extremely slow in code generation.
00:03:43Here, that issue, in my experience, has not been solved. It's still extremely slow.
00:03:48But there is one change. With GLM 4.6, the model actually didn't think, meaning it didn't think inside Claude code.
00:03:54The detailed transcript you get here clearly shows thinking, but that wasn't showing in 4.6.
00:03:59You can clearly see here that it does think with the 4.7 model, so that's been fixed.
00:04:04Other than that, there are some quirks you need to know. GLM 4.7 is not that autonomous.
00:04:09I found this during my testing. As you can see here, this GLM folder already has a UI benchmark folder in which it needs to implement the app, but it chose to ignore that.
00:04:18Although it was clearly written inside the context, it went ahead and made another Next.js app on its own.
00:04:22It didn't even initialize it, it just started writing code. Sometimes it does act really dumb.
00:04:27But after I corrected it and steered it in the right direction, in terms of the implementation, this is what Claude created.
00:04:32Again, being the higher model, it's pretty good at UI.
00:04:35This is what GLM 4.7 created. Claude obviously created a better UI because, in our opinion, it's still better at design.
00:04:42For the price, that is okay. But after I looked at the code and dug into it, since they were told this was supposed to be back and ready and that for now they need to use mock data,
00:04:50the GLM model actually implemented a better architecture by placing all the mock data in one file.
00:04:56Then when we need to swap it out, we just need to change that file because the imports are connected there, as opposed to what Claude implemented where every other component has its own import.
00:05:05When we actually do implement the backend, we'll have to change all of those files one by one.
00:05:09In terms of basic architecture and code quality, GLM actually did pretty well, and it surprised me because 4.6 wasn't this good in my testing.
00:05:17The previous plan wasn't really justified by how much I had to steer it and how many mistakes it made, but this one is definitely a huge leap.
00:05:24Those benchmarks are definitely justified by the testing I've done.
00:05:27I've also looked at a few other small things in the code, and GLM 4.7 is actually a good model.
00:05:32Given these unexpected results, we're honestly recommending everyone getting the $29 per year plan.
00:05:38If you already have the $20 Claude plan, this is basically nothing in comparison.
00:05:42That said, it's still not a model you'd use for completely autonomous coding.
00:05:46Even though Claude really messed up the architecture here, it's good enough that it can correct and improve on that later.
00:05:52But with the small quirks GLM still has, we don't think it's a good idea to solely depend on it.
00:05:57That brings us to the end of this video.
00:05:58If you'd like to support the channel and help us keep making videos like this, you can do so by using the super thanks button below.
00:06:05As always, thank you for watching and I'll see you in the next one.

Key Takeaway

GLM 4.7 delivers impressive coding performance at an unbeatable price point, making it a strong value-add for software developers despite limitations in autonomy and some design trade-offs compared to premium models.

Highlights

GLM 4.7 achieves 73% on SWE bench at only $29/year, offering exceptional value compared to premium alternatives like Sonnet 4.5

GLM 4.7 shows significant improvements over 4.6, with explicit optimization for Claude Code and new MCP tool integrations

Pricing is dramatically cheaper than competitors: $3 for 120 prompts in GLM coding versus $10-40 prompts in Claude's entry-level plan

GLM 4.7 implements superior backend architecture with centralized mock data management, outperforming Claude in code organization despite slightly weaker UI design

The model suffers from poor autonomy and still has slow code generation speeds, requiring user steering and corrections

GLM 4.7 now displays visible thinking process in Claude Code, fixing a critical issue from version 4.6

Not recommended for fully autonomous coding projects, but excellent as a supplementary tool at its price point

Timeline

Introduction: GLM 4.7 Release and Market Context

The video opens with context about ZAI's latest GLM 4.7 release, priced at $29 annually, claiming 73% performance on SWE bench benchmarks comparable to Sonnet 4.5. The speaker highlights the strategic timing of this release coinciding with ZAI's public offering and their unprecedented Reddit Q&A engagement targeting Western audiences. The model represents primarily a post-training improvement rather than architectural changes, with heavy optimization specifically for Claude Code as the priority framework. This section establishes why the release is significant: aggressive pricing combined with competitive benchmarks marks a notable shift in the AI model market landscape.

Technical Features and Pricing Comparison

This section details GLM 4.7's technical specifications, including new MCP tool integrations that operate as separate servers requiring only API keys for functionality. The speaker provides a detailed pricing breakdown comparing GLM to Claude's plans, revealing substantial cost advantages: entry-level GLM pricing offers 120 prompts for $3 versus Claude's 10-40 prompts, with the gap widening at higher tiers ($30 gets 2,400 prompts versus Claude's 800 for $200). Monthly rates are doubled after the first month, though yearly plans offer significantly better value. These concrete numbers demonstrate why the pricing strategy is generating interest despite the model's architectural similarities to predecessors.

Benchmark Testing and UI Evaluation

The speaker evaluates GLM 4.7's performance on the Humanity's Last Exam benchmark and tests both GLM and Claude's UI generation capabilities using a prompt for an AI-powered code review platform. While Claude produces cleaner design without emojis (aligning better with design language), GLM's output demonstrates acceptable design quality but with minor rendering issues where content length wasn't properly accounted for. The detailed comparison reveals both models' strengths: Claude excels in visual polish and consistency, while GLM shows competent but less refined design execution. This hands-on testing approach validates whether the claimed benchmark improvements translate to practical real-world performance.

Autonomous Coding Performance and Limitations

Testing with a pre-configured Next.js project for a Netflix-like streaming platform exposes critical limitations in GLM 4.7's autonomy and code generation speed. The model exhibits slow processing times (unresolved from version 4.6) and poor contextual awareness, ignoring explicit instructions to use existing project folders and instead creating redundant applications. However, a significant fix from 4.6 is now visible: detailed thinking processes are displayed within Claude Code, indicating internal reasoning was previously hidden. The speaker notes these behavioral inconsistencies—sometimes performing capably, other times acting 'really dumb'—requiring constant user steering and correction, which contradicts claims of improved autonomy.

Code Architecture and Backend Implementation Quality

After steering the model correctly, the analysis reveals surprising architectural strengths in GLM 4.7's code structure compared to Claude's implementation. GLM centralizes mock data into a single file with connected imports, enabling efficient backend migration when needed, whereas Claude scatters imports across multiple components requiring individual file updates. This architectural superiority demonstrates that GLM prioritizes scalability and maintainability despite weaker UI design capabilities. The speaker acknowledges this unexpected finding validates the benchmark claims, representing a dramatic leap from the problematic version 4.6 performance. This discovery is crucial for developers prioritizing backend-ready, scalable code over visual polish.

Final Recommendation and Use Case Assessment

The speaker concludes by recommending the $29 yearly plan as an exceptional value proposition, particularly for developers already using Claude's $20 plan. However, the recommendation comes with clear caveats: GLM 4.7 is unsuitable for completely autonomous coding tasks due to persistent autonomy limitations and quirks requiring frequent user intervention. The appropriate use case is supplementary development assistance rather than primary code generation, where human oversight remains essential. The video ends by reiterating that while benchmarks are justified by real-world testing, the model's practical limitations prevent total reliance, making it an excellent value-add tool for hybrid development workflows.

Community Posts

View all posts