I Gave 7 AI Agents the Same Swift Challenge. It Was BRUTAL!

BBetter Stack
컴퓨터/소프트웨어스마트폰/모바일AI/미래기술

Transcript

00:00:00Most AI coding models have one huge problem - they just can't handle Swift.
00:00:06We've all seen the flashy demos of agents building web apps and JavaScript tools in seconds,
00:00:11but as soon as you ask them to touch Swift code, things fall apart fast.
00:00:16Now why are the world's smartest models failing at iOS development?
00:00:22So that's what we're going to find out in today's video.
00:00:25Today I'm putting the top coding agents through the same Swift app coding challenge to see
00:00:30which models can actually handle this task and which ones are just a web-dev one-trick
00:00:35pony.
00:00:36I'll give you a little spoiler - one of these models actually aced the test completely.
00:00:40Which one that is, you'll see that later in this video.
00:00:43It's going to be a lot of fun, so let's dive into it.
00:00:50So first of all, let's address the key issue.
00:00:52Why are AI coding models bad at Swift development?
00:00:56And just to be clear, this is not just my observation.
00:00:59A study titled "Evaluating Large Language Models for Code Generation - A Comparative Study"
00:01:05on Python, Java and Swift found out that across all models tested, including GPT and CLOD,
00:01:12performance in Swift was consistently lower than in Python or Java.
00:01:17And the reason boils down to three main bottlenecks that effectively handicap AI when it touches
00:01:22Apple's ecosystem.
00:01:24First there's the data gap.
00:01:25While the web is flooded with open source JavaScript and Python code, a huge portion of professional
00:01:31Swift code lives behind closed doors of private or commercial repositories.
00:01:36Second we have API drift.
00:01:38Apple is famous for moving fast and breaking things.
00:01:42Swift UI and Swift's concurrency models have changed more in the last three years than some
00:01:47web standards have in a decade.
00:01:49And because most AI models have a knowledge cut-off, they are often trying to write Swift
00:01:54code using outdated rules that simply don't work in the latest version of Xcode.
00:01:59And finally there's the benchmarking bias.
00:02:02Most of the AI models we're testing today like Quen or Grok are trained to the specific test.
00:02:08They are optimized to pass massive benchmarks like human eval which are almost entirely focused
00:02:13on Python and web-based logic.
00:02:16Since there aren't many major benchmarks for complex iOS UI, these models simply haven't
00:02:21been graded on their ability to build a functional app.
00:02:25So I chose some of the most popular AI coding models out there and I gave each one of them
00:02:30the exact same prompt.
00:02:32I tasked each of them to build a simple Tinder-like app clone using Swift called Dogtinder, where
00:02:38you are presented with different dogs using the Dog CEO API.
00:02:43And you can swipe left or right to choose which ones you like and if there is a match,
00:02:47you can open up a chat interface to exchange funny messages with the matched dog.
00:02:52So it's supposed to be cute and simple enough for an agent to complete and it also involves
00:02:58some interesting challenges like building a swipe animation functionality in native Swift.
00:03:03So for the tests themselves, we are going to start from the worst performer going all the
00:03:07way to the best one.
00:03:09And in the worst performing place, we unfortunately have the new Quen 3 Coder Next model.
00:03:15Quen has been advertising this new model as an open source alternative to heavyweights
00:03:20like Kimi or Claude with a smaller model size but a higher performance.
00:03:25And while that may be true for web apps, it did not hold up for the Swift challenge unfortunately.
00:03:32So whenever possible, I tried using their own CLI tools that were available for that model
00:03:37and in this case, I used the Quen CLI tool to conduct this challenge.
00:03:42And once it was done generating the code, I could not open the project file that Quen
00:03:46had produced.
00:03:48So then I prompted it to fix the error that was presented when I tried to open the file.
00:03:53But even then, Quen could not fix the error and instead provided me with a long read me
00:03:58file on how to build this project on my own from scratch and then copy the files over to
00:04:03the project folder, which is not something I want to do manually for this challenge because
00:04:08that would defeat the purpose.
00:04:09And as you will see later, I noticed that some of the models had a very hard time producing
00:04:14the final collection of files for this project, which we could open successfully on a first
00:04:19try.
00:04:20So for these cases like Quen here, I decided to give it an easier challenge instead.
00:04:26So I created a new iOS app project on Xcode manually, and I decided that this could be
00:04:31a good time to try out the new coding intelligence functionality that is now packaged with the
00:04:37newest version of Xcode.
00:04:38And this is pretty cool because finally Xcode has their own AI assistant feature.
00:04:43So I hooked it up to my open router account and chose the Quen 3 coder next model from
00:04:49the dropdown and tried the challenge again.
00:04:52Even with all this handholding, Quen still could not produce a successful project on the
00:04:57first try because here we got some issues with accurately setting up the Swift models.
00:05:02And now with the new AI assistant feature, we can highlight all of these issues and then
00:05:07task the assistant to generate the fix for all the selected issues at once.
00:05:12So finally, after a few rounds of prompting Quen to fix the remaining issues, we finally
00:05:16got a working version of the dog Tinder app, but honestly the result was pretty bad.
00:05:23It could not even load the images from the dog CEO API and the whole UI was also very
00:05:29primitive and not exciting at all.
00:05:32Not to mention that there was a bug in the matches section where none of the matches were
00:05:36actually appearing.
00:05:37So unfortunately, Quen totally failed the Xcode app challenge.
00:05:42So moving on to our second to last place, we have Grok with its Grok code fast model.
00:05:48For this one, I tried to use it through the VS Copilot extension on VS code and once again,
00:05:53I ran into the same issue where Grok was not able to produce all the project files needed
00:05:59for the complete Swift project package.
00:06:02And instead it provided me with instructions how to copy the files manually.
00:06:06So once again, I had to revert back to using the AI assistant on Xcode by calling the Grok
00:06:12model from OpenRouter.
00:06:14And Grok also ran into a couple of issues, so I had to prompt it twice to fix the remaining
00:06:19errors.
00:06:20But after all of that, it was able to successfully complete the app.
00:06:23And by the first look of it, Grok did a terrible job with the design.
00:06:27The design was not exciting at all and there weren't even any sections where we could see
00:06:32the matches.
00:06:33The only reason why I put Grok higher than Quen is because at least from a functionality
00:06:38standpoint everything is working including the chat functionality, but to be honest, they
00:06:44both were very close in terms of poor performance.
00:06:48And nothing about this app seems exciting or visually pleasing.
00:06:51So I wouldn't say Grok failed the challenge, but it does get the lowest passing grade it
00:06:57can possibly get.
00:06:58Next up on our leaderboard is Kimi with their newest Kimi K2.5 model.
00:07:04And Kimi had the same issue as Quen where when using their native CLI it produced the
00:07:08project file, but I could not open it.
00:07:11Even after fixing it through the CLI it did not resolve the issue.
00:07:15So once again for the Kimi's test I had to use the built-in Xcode AI assistant feature
00:07:20with Kimi K2 provided by OpenRouter.
00:07:23And Kimi's performance was similar to that of Quen's and Grok's because it did not complete
00:07:29the challenge on the first try.
00:07:31So I had to prompt it again to fix the remaining issues.
00:07:34But after just one round of issue fixing, Kimi was able to produce the final result.
00:07:39And this version was actually a step up from Quen and Grok because at least now we got an
00:07:44app that actually looks like a Tinder-like app.
00:07:47And we now got this nice left and right swipe animation along with the like and nope stickers
00:07:53on the sides and a fancy pop-up when we got a match.
00:07:57But the animation was very buggy and very finicky.
00:08:00At times I couldn't even see the image at all because it was floating somewhere off screen.
00:08:05But at least Kimi was able to store the matches properly.
00:08:08And we actually got a section where we could see our matches and open any of them and start
00:08:12chatting with the specific dog.
00:08:14So this is already a big step up from Quen and Grok.
00:08:18But if I have to compare it with other examples you'll see later in this video, I would say
00:08:22it is still a subpar result.
00:08:25And that's why I give Kimi a lower place on the leaderboard.
00:08:29And next up we have Gemini 3 Pro.
00:08:31And this one is interesting because I got totally different results when testing the same model
00:08:36from their own CLI versus from the Xcode's AI assistant.
00:08:41So first let's see what we got when we used the Gemini CLI.
00:08:45It does say that the model is still in preview mode on the CLI.
00:08:49So maybe that was the core issue.
00:08:50But once again, when I prompted it with the same prompt I used for every model in this
00:08:55challenge, it could not give me a project file at the end.
00:08:59And this is because in order to create an Xcode project file, you first need to create a YAML
00:09:04file with the project details and then use the CodeGen CLI command to generate it.
00:09:09But for some reason, some models refuse to do it or don't know how to do it.
00:09:14But nonetheless, once I prompted Gemini to specifically create the file, it did so.
00:09:18And I just needed to give it access to execute the CodeGen command.
00:09:22And once we opened the project, we got an asset error.
00:09:25But that was quick for Gemini to fix.
00:09:28And once that was solved, the app was finally compiling.
00:09:31But the result was bad, surprisingly bad.
00:09:35It was broken.
00:09:37The matches system was not working properly and everything was buggy.
00:09:41So at this point, I was willing to give Gemini a failing grade.
00:09:45But just out of curiosity, I decided to give Gemini another shot and conduct the challenge
00:09:50using Xcode's native AI assistant by running Gemini 3 Pro through OpenRouter.
00:09:56And once I did that, this time it got it right on the first try.
00:10:01And not only that, but the app was amazingly good.
00:10:04I mean, the design was great.
00:10:06The functionality was in place.
00:10:08It even added a nice little logo on top.
00:10:10Honestly, there was nothing to fault in this version of the app.
00:10:14So I'm a bit baffled as to why running the same prompt through the same model, but through
00:10:20different AI coding tools produced such two distinct results.
00:10:24But nonetheless, I was very impressed with the version that Gemini finally gave me through
00:10:29Xcode's tooling and on the first try, might I add.
00:10:32So that's why I put Gemini a bit higher on the leaderboard, because the end result was
00:10:37actually pretty great.
00:10:38OK, so next up on the leaderboard, we have GPT 5.3 codecs.
00:10:43And since OpenAI has their own codecs app, I decided to conduct a challenge from their
00:10:48own app.
00:10:49And unlike the previous models we have seen so far, GPT 5.3 was actually able to produce
00:10:55the final working product on the first try.
00:10:58So this is already a big step up.
00:11:00But I got to say, the app itself was not pretty exciting.
00:11:03It had a very monotone blue color theme.
00:11:06And the biggest issue that bugged me is that it couldn't fit the width of the image within
00:11:11the frame of the app.
00:11:13So for some dogs, you ended up with a very stretched out container that goes outside of
00:11:18the bounds of the app.
00:11:20So this is a big design flaw that codecs was not able to produce properly.
00:11:25But the app itself is functional with all the necessary UI elements.
00:11:29And we also got the matches section working properly where we could chat with the dogs.
00:11:34So the reason why I give GPT 5.3 such a high place on the leaderboard is that this is the
00:11:40first model that was actually able to produce the entire Swift project package without any
00:11:46handholding or without setting up the Xcode project beforehand.
00:11:50So overall, not too bad, but also not too exciting.
00:11:53All right.
00:11:54And finally, we get to the first place on the leaderboard.
00:11:57And I'm just going to give you a moment to guess which model could that be.
00:12:01And yes, I think we all know which model that is.
00:12:04It is, of course, Opus 4.6, which absolutely aced this challenge right off the bat.
00:12:11I prompted it with the same prompt as the other models, but I used their own Claude code CLI
00:12:17tool and I just needed to provide the necessary permissions.
00:12:20And the model did everything on its own, including creating a fully functional Xcode project file
00:12:27without me having to set it up beforehand.
00:12:29And not only that, but the app itself was absolutely beautiful.
00:12:34The design was there.
00:12:35The animations were nice and fluid.
00:12:37The matches section was working correctly as well as the chat window.
00:12:41The only thing we didn't get in this version was a fancier logo like Gemini produced in
00:12:46the previous versions.
00:12:48But other than that, this was the best looking version of them all.
00:12:52And it even managed to produce this on the first try.
00:12:55So I would say Opus performance is absolutely incredible compared to all the other models.
00:13:01So it definitely deserves the first place on the leaderboard.
00:13:05But wait, there's more.
00:13:07Here's a little bonus for you folks.
00:13:09There is still one more model that we need to review that hasn't been shown on the leaderboard
00:13:13yet.
00:13:14You see, while I was making this video, there was an announcement that GLM just released
00:13:18their latest model version five, and they are bold enough to claim that this model scores
00:13:23even higher in coding than Opus 4.6.
00:13:26So obviously I had to test it out on the same Swift challenge.
00:13:31And since GLM does not have their own CLI tool, I once again used Xcode's AI assistant tool
00:13:37by hooking it up to open router and using GLM five from there.
00:13:41And first of all, GLM did not complete this challenge on the first try.
00:13:45So that already shows a worse performance than Opus 4.6.
00:13:49But secondly, I had to go through three rounds of bug fixes to finally get it to compile successfully.
00:13:56So let's see what the final result is for GLM five.
00:13:59As you can see, it already looks like a failing grade to me.
00:14:03It cannot seem to load up any of the dog images.
00:14:06It does not have the swipe functionality.
00:14:08And what's even worse, it only cycles through three dogs and then shows a message that there
00:14:13are no more dogs available.
00:14:15And furthermore, if we go to the matches section, it cannot click on any of the matches to open
00:14:20the chat interface with any of the dogs.
00:14:23So this section is clearly not finished.
00:14:25So judging by this result, where do we put GLM based on this performance?
00:14:29Well, I'm afraid we have to put it in this second to last place just above Quinn because
00:14:36this performance was just not acceptable and not nearly as good as any of the other models.
00:14:42So stating the GLM five is stronger than Opus 4.6 is a pretty bold claim.
00:14:47Now, I haven't tested this model on any other coding tasks, and it might just be the case
00:14:52that maybe for simpler web dev projects, it works just as well or maybe even better than
00:14:57Opus 4.6.
00:14:59But this is definitely not a good model for coding in Swift.
00:15:02So what did we learn today?
00:15:04Clearly while the AI revolution is moving at light speed, the Swift problem for these models
00:15:10is still very real Opus 4.6 and GPT 5.3 proved that if the model is large enough and the reasoning
00:15:18is strong enough, they can overcome the lack of open source Swift code data.
00:15:23But for models like Quinn and Grok, the data gap and API drift we talked about earlier are
00:15:29clearly hitting them hard.
00:15:31And I was also surprised how helpful Xcode's new AI assistant actually is for Swift apps.
00:15:36We could clearly see this in the difference between the two Gemini app versions.
00:15:40So if you're an iOS developer, it's probably helpful to use their internal AI tooling to
00:15:46get better results.
00:15:47So there you have it folks, I hope you enjoyed this leaderboard breakdown.
00:15:51I think this opens up a wider conversation about the fact that maybe we should start having
00:15:55language specific models.
00:15:57Because clearly a lot of these models are more heavily biased towards web apps, JavaScript
00:16:03or Python projects.
00:16:04But for some bespoke coding solutions, we might need some custom coding models.
00:16:09But what is your take on all of this?
00:16:11Let us know in the comment section down below.
00:16:13And folks, if you enjoyed this video, please let me know by smashing that like button underneath
00:16:18the video.
00:16:19And also don't forget to subscribe to our channel.
00:16:22This has been Andris from better stack and I will see you in the next videos.

Key Takeaway

While general AI coding models struggle with Swift due to outdated training data and benchmark bias, top-tier models like Opus 4.6 and GPT 5.3 prove that high-level reasoning can overcome these specialized ecosystem hurdles.

Highlights

Most AI coding models consistently underperform in Swift development compared to Python or Java due to a data gap in private repositories and frequent API changes.

Benchmark bias is a significant issue, as models are often optimized for Python-centric tests like HumanEval rather than complex iOS UI tasks.

Opus 4.6 emerged as the clear winner, successfully building a functional and aesthetically pleasing 'DogTinder' app on the first attempt.

The new Xcode AI assistant feature significantly improved Gemini 3 Pro's performance, highlighting the value of environment-specific integration.

Open-source models like Qwen 3 Coder Next and Grok Code Fast struggled heavily with Swift project structure and UI design.

GLM 5, despite claims of superior coding ability, failed the Swift challenge with broken imagery and incomplete features.

Timeline

The Problem with AI and Swift Development

The speaker introduces the central thesis that most AI models fail at iOS development despite their success with web-based languages. He cites a comparative study showing that models like GPT and Claude perform significantly worse in Swift than in Python or Java. Three main bottlenecks are identified: the data gap from private commercial repositories, the rapid 'API drift' in Apple's ecosystem, and the benchmarking bias of existing tests. These factors combine to leave AI models with outdated knowledge and a lack of training on complex UI logic. This section sets the stage for the 'DogTinder' challenge meant to test these specific weaknesses.

Lower Tier Performers: Qwen, Grok, and Kimi

The coding challenge begins with the lowest-ranked models, starting with Qwen 3 Coder Next, which failed to produce an openable project file. Grok Code Fast performed slightly better by providing functional code, though the design was uninspired and required manual file management. Kimi K2.5 showed a marked improvement by creating a recognizable Tinder-like interface with swipe animations, though it remained buggy and finicky. Each of these models required significant 'handholding' through Xcode's AI assistant to reach a semi-functional state. The speaker notes that while these models might excel at web development, they are currently 'one-trick ponies' when it comes to Swift.

Mid-to-High Tier Performers: Gemini and GPT 5.3

Gemini 3 Pro initially failed through its CLI but achieved impressive results when utilized through the native Xcode AI assistant, producing a high-quality UI on the first try. GPT 5.3 Codecs followed, distinguishing itself as the first model to generate a complete project package without any manual setup. However, GPT's version suffered from design flaws, such as images exceeding the bounds of the app frame. This section highlights the importance of the tool used to access the model, as Gemini's performance varied wildly depending on the interface. Both models represent a tier where professional-grade code generation becomes more realistic.

The Champion: Opus 4.6 and the GLM 5 Bonus

Opus 4.6 is crowned the winner for acing the challenge on the first attempt using the Claude code CLI, delivering a beautiful and fluid user experience. The speaker then introduces a surprise test for GLM 5, which recently claimed to outperform Opus in coding tasks. Contrary to these claims, GLM 5 failed the Swift test miserably, requiring three rounds of bug fixes only to produce an app that couldn't load images or handle navigation. This comparison reinforces that specialized benchmarks are often misleading when applied to specific ecosystems like iOS. The speaker concludes that Opus remains the most reliable model for this particular use case.

Final Verdict and the Future of Coding Models

The video concludes by reflecting on the lessons learned from the competition between these seven AI agents. The speaker emphasizes that large model size and strong reasoning capabilities currently provide the only bridge over the Swift 'data gap.' He suggests that the industry may need to shift toward language-specific models to handle the unique demands of bespoke ecosystems. Using internal AI tooling, like the Xcode assistant, is recommended for developers who want to maximize the performance of existing models. The session ends with a call for viewer feedback on whether specialized or general models will win in the long run.

Community Posts

View all posts