Gemini 3.5 Flash is just... fine

BBetter Stack
Computing/SoftwareBusiness NewsInternet Technology

Transcript

00:00:00Google just released Gemini 3.5 flash and they're making some pretty bold claims. Frontier
00:00:04performance at four times the speed, often at less than half the cost. Which all sounds
00:00:09incredible, but the reality is a lot worse than Google is advertising.
00:00:12And that was only half of what they released. They also released Anti-Gravity 2, which is
00:00:16their new standalone agent app, basically their answer to Codex, as well as the Anti-Gravity
00:00:20CLI, which actually replaces the Gemini CLI, so that's another one for Killed by Google.
00:00:30Let's start with the headline stats. This has a million token context window, 64,000
00:00:34output tokens and it takes in text, images, video, audio and PDFs as input. Google has
00:00:39always been pretty good at these multimodal models.
00:00:42As for actual performance, Google's own benchmarks have this model being in line with GPT 5.5
00:00:46when it comes to coding, being only a few percent behind on SWBench Pro and Terminal
00:00:50Bench and in fact it's actually beating Opus 4.7 on Terminal Bench by around 10%, but Claude
00:00:56Opus does get its own back on SWBench Pro by beating Gemini by around 10% as well.
00:01:01For agentic workflows, this model is actually winning on both the MCP and Toolathon benchmarks
00:01:06and overall these benchmarks are not bad results, but all of this is according to Google.
00:01:11If instead we take a look at third-party benchmarks, like artificial analysis, it's not doing
00:01:15too great. That coding index has Gemini 3.5 flash scoring 45, which is actually below models
00:01:21like Kimi K2.6 and it's not even beating Gemini 3.1 Pro even though on all of their own benchmarks
00:01:27it was ahead in everything. It's actually only a few points ahead of Gemini 3 flash as
00:01:31well.
00:01:32The story does get a little bit better when you look at agentic performance. It's made
00:01:35a nice jump over Gemini 3.1 Pro and yes, technically it is up there competing with the Frontier
00:01:41models.
00:01:42Looking at our benchmarks, it appears that 75% of you watching this aren't subscribed
00:01:45so I'm going to ask you nicely to do so. Please subscribe.
00:01:48The one key highlight of this model is definitely its speed. They actually got 278 tokens per
00:01:53second out of this model, which massively outperforms Opus 4.7 and GPT 5.5 and even models
00:01:59like Haiku and the open-source OpenAI ones. So when it comes to intelligence vs speed,
00:02:04this model definitely is the best.
00:02:06Overall it's just a mixed bag of results. It's not the best model and it's not the worst,
00:02:10but it is really really fast and I wouldn't mind these results if it was actually half
00:02:14the cost of the other models, but this is where things start to fall apart.
00:02:18The price of this model is $1.50 for a million input tokens and $9 for a million output tokens,
00:02:23which is actually 3 times more than Gemini 3 flash was, but it is still way cheaper than
00:02:27the likes of Opus 4.7 and GPT 5.5, at least on paper that is.
00:02:32When actually running their benchmarks though, artificial analysis found that Gemini 3.5 flash
00:02:36cost $1,552 to run the intelligence index, which is actually 5.5 times more expensive
00:02:42than Gemini 3 flash and 75% more expensive than Gemini 3.1 Pro. What's even worse though
00:02:48is this is more expensive than GPT 5.5 when on high reasoning, which massively beats flash
00:02:54when it comes to coding performance, and in fact I'll just highlight every model on this
00:02:57chart that is cheaper and outperforms flash when it comes to coding. It just does not look
00:03:02good at all and it's certainly not at half the cost like their marketing claimed.
00:03:06Digging a bit deeper into this, it seems like the problem with this model is while fast,
00:03:10it is token hungry. On agentic evaluations it averaged 49 turns per task, which is one
00:03:15of the highest of any models they've tested. It just really likes to burn through your
00:03:19input tokens. So overall I'm just not really sure where this actually leaves us. This model
00:03:23just feels meh. The speed is super cool, so if you value that over everything else, perhaps
00:03:28this is the model to use. The same if you want great multimodal capabilities, but the
00:03:33coding performance is just not enough for me to even consider testing this for a longer
00:03:37period of time than I have in this video. So let's just move on to talk about the other
00:03:41big announcement which was anti-gravity 2 and the new CLI.
00:03:44This is anti-gravity 2? Wait no sorry that's t3 code. Maybe this one then? Wait nope that's
00:03:50codecs. What about this one? Nope that's cursor. This one is actually anti-gravity 2 and I think
00:03:55you can see my point. Basically all of these apps have started to look the same. A funny
00:03:59part of one of our demos is when the developer tries to create a new project and you can just
00:04:03see the codecs folder right there. So to be honest I won't spend much time going through
00:04:07this app. It's exactly the same as all of the other ones. We have our conversations on the
00:04:11left, we have our projects, we have scheduled tasks and in here you can click into any of
00:04:15these files if you want to see the diff view. The only thing to note is that this is not
00:04:18the anti-gravity IDE anymore. This is just a completely standalone app. What you're seeing
00:04:22is what you get. Now I did actually try out a couple of test prompts in here. One of them
00:04:26was to create a full stack personal finance dashboard and the other one was much simpler
00:04:30just testing out the UI of how it would build me out a cafe website in a single index.html.
00:04:35This is the result of the very simple cafe prompt and I've got to say I do really like
00:04:39the website that it's built here so it does seem like 3.5 flash is pretty good at UI design.
00:04:44I'd say this is overall just a very nice site. It does still have a little bit of an AI feel
00:04:48to it. I think it's mostly that card and gradient style that AI seems to like at the moment but
00:04:53the site is pretty functional and does look how I would expect it to. For context this
00:04:58is what Opus 4.7 gave me when I gave it the exact same prompt and I do think Gemini 3.5
00:05:03flash wins on this one but obviously this is just a one-off test. As for the more complicated
00:05:07finance dashboard prompt that's a full stack application it's done well to actually make
00:05:11the application work but I definitely don't like the UI design. It's not bad but it just
00:05:16has that I've been designed by AI look and feel and also minus points for calling this
00:05:20aura wealth. When you compare that to what Opus 4.7 gave me it's just a world of difference.
00:05:25Opus 4.7 here looks really nice and to be honest I don't have that many notes on how
00:05:29I would change this UI. Opus actually spent 20 minutes on that prompt whereas Gemini took
00:05:33five minutes so yes it's definitely quicker but it also could have used the extra 15 to
00:05:38make it look better. Moving on from that though we also got the anti-gravity CLI and this one's
00:05:42probably gonna anger some people because they're actually shutting down the Gemini CLI you won't
00:05:46be able to use it after June 18th this year and the new CLI is basically the same at the
00:05:51moment except it's been rewritten in Go and it's also closed source now which does suck
00:05:56and I didn't actually install this one as again it's just Claude code but for Gemini
00:06:00there is nothing new to show you. To summarise all of my thoughts on this then right now 3.5
00:06:05flash is good for agents but it's expensive and too weak on coding to be the whole package
00:06:10so I do hope we see a bit more from Gemini 3.5 Pro which is apparently coming next month
00:06:15but for now it just seems like Google is not going to be the leader for coding and to be
00:06:19honest with you I don't really think they need to be. It seems that Google's market is more
00:06:23the everyday person building this into all of your experiences like Gmail search Workspace
00:06:28Android and everything else so maybe developers just aren't going to be that focus. Let me
00:06:33know what you think in the comments down below while you're there subscribe and as always
00:06:36see you in the next one.

Key Takeaway

While Gemini 3.5 Flash offers industry-leading speed and competent multimodal UI design, its high token consumption and lackluster coding performance compared to rivals make it a niche tool rather than a comprehensive development solution.

Highlights

  • Gemini 3.5 Flash achieves a throughput of 278 tokens per second, exceeding the speed of GPT 5.5 and Opus 4.7.

  • Artificial Analysis benchmarks indicate the actual cost of running Gemini 3.5 Flash in agentic workflows is 5.5 times higher than its predecessor, Gemini 3 Flash.

  • Gemini 3.5 Flash averages 49 turns per task in agentic evaluations, reflecting a tendency to consume more tokens than other models.

  • The model performs well in basic UI design tasks but lacks the coding reasoning necessary for complex full-stack applications when compared to Opus 4.7.

  • Google is replacing the open-source Gemini CLI with a closed-source version written in Go, effective June 18th.

Timeline

Gemini 3.5 Flash Capabilities and Benchmarks

  • The model supports a one-million token context window and 64,000 output tokens.
  • Internal benchmarks show the model competing with GPT 5.5 in coding and outperforming Opus 4.7 on Terminal Bench.
  • Third-party coding indices rank the model below Kimi K2.6 and Gemini 3.1 Pro.
  • The model achieves a peak speed of 278 tokens per second.

Google claims frontier performance for the new model, highlighting multimodal inputs including video and audio. While internal data shows strong results, third-party testing suggests lower coding proficiency. Speed remains the standout feature, significantly outpacing established frontier models.

Cost and Agentic Efficiency

  • The model is priced at $1.50 per million input tokens and $9 per million output tokens.
  • Actual operational costs during benchmark testing proved to be 75% higher than Gemini 3.1 Pro.
  • High token consumption stems from an average of 49 turns per task in agentic workflows.

Marketing claims suggest cost-effectiveness, but real-world testing reveals that the model's high turn count per agentic task inflates overall expenses. It currently ranks as more expensive to run than some higher-reasoning models that deliver superior coding results.

Anti-Gravity 2 App and CLI Changes

  • Anti-Gravity 2 functions as a standalone agent app similar to existing tools like Codecs and Cursor.
  • The model excels at generating simple UI elements but struggles with complex, full-stack application logic.
  • Google is discontinuing the current Gemini CLI in favor of a closed-source Go-based implementation.

The new application interface mirrors common industry standards for AI-assisted coding tools. While it produces functional UI code, it lacks the depth required for complex project structures compared to longer-running models. The shift to a closed-source CLI signals a change in the developer experience strategy for Google.

Community Posts

No posts yet. Be the first to write about this video!

Write about this video