00:00:00Most AI coding models have one huge problem - they just can't handle Swift.
00:00:06We've all seen the flashy demos of agents building web apps and JavaScript tools in seconds,
00:00:11but as soon as you ask them to touch Swift code, things fall apart fast.
00:00:16Now why are the world's smartest models failing at iOS development?
00:00:22So that's what we're going to find out in today's video.
00:00:25Today I'm putting the top coding agents through the same Swift app coding challenge to see
00:00:30which models can actually handle this task and which ones are just a web-dev one-trick
00:00:35pony.
00:00:36I'll give you a little spoiler - one of these models actually aced the test completely.
00:00:40Which one that is, you'll see that later in this video.
00:00:43It's going to be a lot of fun, so let's dive into it.
00:00:50So first of all, let's address the key issue.
00:00:52Why are AI coding models bad at Swift development?
00:00:56And just to be clear, this is not just my observation.
00:00:59A study titled "Evaluating Large Language Models for Code Generation - A Comparative Study"
00:01:05on Python, Java and Swift found out that across all models tested, including GPT and CLOD,
00:01:12performance in Swift was consistently lower than in Python or Java.
00:01:17And the reason boils down to three main bottlenecks that effectively handicap AI when it touches
00:01:22Apple's ecosystem.
00:01:24First there's the data gap.
00:01:25While the web is flooded with open source JavaScript and Python code, a huge portion of professional
00:01:31Swift code lives behind closed doors of private or commercial repositories.
00:01:36Second we have API drift.
00:01:38Apple is famous for moving fast and breaking things.
00:01:42Swift UI and Swift's concurrency models have changed more in the last three years than some
00:01:47web standards have in a decade.
00:01:49And because most AI models have a knowledge cut-off, they are often trying to write Swift
00:01:54code using outdated rules that simply don't work in the latest version of Xcode.
00:01:59And finally there's the benchmarking bias.
00:02:02Most of the AI models we're testing today like Quen or Grok are trained to the specific test.
00:02:08They are optimized to pass massive benchmarks like human eval which are almost entirely focused
00:02:13on Python and web-based logic.
00:02:16Since there aren't many major benchmarks for complex iOS UI, these models simply haven't
00:02:21been graded on their ability to build a functional app.
00:02:25So I chose some of the most popular AI coding models out there and I gave each one of them
00:02:30the exact same prompt.
00:02:32I tasked each of them to build a simple Tinder-like app clone using Swift called Dogtinder, where
00:02:38you are presented with different dogs using the Dog CEO API.
00:02:43And you can swipe left or right to choose which ones you like and if there is a match,
00:02:47you can open up a chat interface to exchange funny messages with the matched dog.
00:02:52So it's supposed to be cute and simple enough for an agent to complete and it also involves
00:02:58some interesting challenges like building a swipe animation functionality in native Swift.
00:03:03So for the tests themselves, we are going to start from the worst performer going all the
00:03:07way to the best one.
00:03:09And in the worst performing place, we unfortunately have the new Quen 3 Coder Next model.
00:03:15Quen has been advertising this new model as an open source alternative to heavyweights
00:03:20like Kimi or Claude with a smaller model size but a higher performance.
00:03:25And while that may be true for web apps, it did not hold up for the Swift challenge unfortunately.
00:03:32So whenever possible, I tried using their own CLI tools that were available for that model
00:03:37and in this case, I used the Quen CLI tool to conduct this challenge.
00:03:42And once it was done generating the code, I could not open the project file that Quen
00:03:46had produced.
00:03:48So then I prompted it to fix the error that was presented when I tried to open the file.
00:03:53But even then, Quen could not fix the error and instead provided me with a long read me
00:03:58file on how to build this project on my own from scratch and then copy the files over to
00:04:03the project folder, which is not something I want to do manually for this challenge because
00:04:08that would defeat the purpose.
00:04:09And as you will see later, I noticed that some of the models had a very hard time producing
00:04:14the final collection of files for this project, which we could open successfully on a first
00:04:19try.
00:04:20So for these cases like Quen here, I decided to give it an easier challenge instead.
00:04:26So I created a new iOS app project on Xcode manually, and I decided that this could be
00:04:31a good time to try out the new coding intelligence functionality that is now packaged with the
00:04:37newest version of Xcode.
00:04:38And this is pretty cool because finally Xcode has their own AI assistant feature.
00:04:43So I hooked it up to my open router account and chose the Quen 3 coder next model from
00:04:49the dropdown and tried the challenge again.
00:04:52Even with all this handholding, Quen still could not produce a successful project on the
00:04:57first try because here we got some issues with accurately setting up the Swift models.
00:05:02And now with the new AI assistant feature, we can highlight all of these issues and then
00:05:07task the assistant to generate the fix for all the selected issues at once.
00:05:12So finally, after a few rounds of prompting Quen to fix the remaining issues, we finally
00:05:16got a working version of the dog Tinder app, but honestly the result was pretty bad.
00:05:23It could not even load the images from the dog CEO API and the whole UI was also very
00:05:29primitive and not exciting at all.
00:05:32Not to mention that there was a bug in the matches section where none of the matches were
00:05:36actually appearing.
00:05:37So unfortunately, Quen totally failed the Xcode app challenge.
00:05:42So moving on to our second to last place, we have Grok with its Grok code fast model.
00:05:48For this one, I tried to use it through the VS Copilot extension on VS code and once again,
00:05:53I ran into the same issue where Grok was not able to produce all the project files needed
00:05:59for the complete Swift project package.
00:06:02And instead it provided me with instructions how to copy the files manually.
00:06:06So once again, I had to revert back to using the AI assistant on Xcode by calling the Grok
00:06:12model from OpenRouter.
00:06:14And Grok also ran into a couple of issues, so I had to prompt it twice to fix the remaining
00:06:19errors.
00:06:20But after all of that, it was able to successfully complete the app.
00:06:23And by the first look of it, Grok did a terrible job with the design.
00:06:27The design was not exciting at all and there weren't even any sections where we could see
00:06:32the matches.
00:06:33The only reason why I put Grok higher than Quen is because at least from a functionality
00:06:38standpoint everything is working including the chat functionality, but to be honest, they
00:06:44both were very close in terms of poor performance.
00:06:48And nothing about this app seems exciting or visually pleasing.
00:06:51So I wouldn't say Grok failed the challenge, but it does get the lowest passing grade it
00:06:57can possibly get.
00:06:58Next up on our leaderboard is Kimi with their newest Kimi K2.5 model.
00:07:04And Kimi had the same issue as Quen where when using their native CLI it produced the
00:07:08project file, but I could not open it.
00:07:11Even after fixing it through the CLI it did not resolve the issue.
00:07:15So once again for the Kimi's test I had to use the built-in Xcode AI assistant feature
00:07:20with Kimi K2 provided by OpenRouter.
00:07:23And Kimi's performance was similar to that of Quen's and Grok's because it did not complete
00:07:29the challenge on the first try.
00:07:31So I had to prompt it again to fix the remaining issues.
00:07:34But after just one round of issue fixing, Kimi was able to produce the final result.
00:07:39And this version was actually a step up from Quen and Grok because at least now we got an
00:07:44app that actually looks like a Tinder-like app.
00:07:47And we now got this nice left and right swipe animation along with the like and nope stickers
00:07:53on the sides and a fancy pop-up when we got a match.
00:07:57But the animation was very buggy and very finicky.
00:08:00At times I couldn't even see the image at all because it was floating somewhere off screen.
00:08:05But at least Kimi was able to store the matches properly.
00:08:08And we actually got a section where we could see our matches and open any of them and start
00:08:12chatting with the specific dog.
00:08:14So this is already a big step up from Quen and Grok.
00:08:18But if I have to compare it with other examples you'll see later in this video, I would say
00:08:22it is still a subpar result.
00:08:25And that's why I give Kimi a lower place on the leaderboard.
00:08:29And next up we have Gemini 3 Pro.
00:08:31And this one is interesting because I got totally different results when testing the same model
00:08:36from their own CLI versus from the Xcode's AI assistant.
00:08:41So first let's see what we got when we used the Gemini CLI.
00:08:45It does say that the model is still in preview mode on the CLI.
00:08:49So maybe that was the core issue.
00:08:50But once again, when I prompted it with the same prompt I used for every model in this
00:08:55challenge, it could not give me a project file at the end.
00:08:59And this is because in order to create an Xcode project file, you first need to create a YAML
00:09:04file with the project details and then use the CodeGen CLI command to generate it.
00:09:09But for some reason, some models refuse to do it or don't know how to do it.
00:09:14But nonetheless, once I prompted Gemini to specifically create the file, it did so.
00:09:18And I just needed to give it access to execute the CodeGen command.
00:09:22And once we opened the project, we got an asset error.
00:09:25But that was quick for Gemini to fix.
00:09:28And once that was solved, the app was finally compiling.
00:09:31But the result was bad, surprisingly bad.
00:09:35It was broken.
00:09:37The matches system was not working properly and everything was buggy.
00:09:41So at this point, I was willing to give Gemini a failing grade.
00:09:45But just out of curiosity, I decided to give Gemini another shot and conduct the challenge
00:09:50using Xcode's native AI assistant by running Gemini 3 Pro through OpenRouter.
00:09:56And once I did that, this time it got it right on the first try.
00:10:01And not only that, but the app was amazingly good.
00:10:04I mean, the design was great.
00:10:06The functionality was in place.
00:10:08It even added a nice little logo on top.
00:10:10Honestly, there was nothing to fault in this version of the app.
00:10:14So I'm a bit baffled as to why running the same prompt through the same model, but through
00:10:20different AI coding tools produced such two distinct results.
00:10:24But nonetheless, I was very impressed with the version that Gemini finally gave me through
00:10:29Xcode's tooling and on the first try, might I add.
00:10:32So that's why I put Gemini a bit higher on the leaderboard, because the end result was
00:10:37actually pretty great.
00:10:38OK, so next up on the leaderboard, we have GPT 5.3 codecs.
00:10:43And since OpenAI has their own codecs app, I decided to conduct a challenge from their
00:10:48own app.
00:10:49And unlike the previous models we have seen so far, GPT 5.3 was actually able to produce
00:10:55the final working product on the first try.
00:10:58So this is already a big step up.
00:11:00But I got to say, the app itself was not pretty exciting.
00:11:03It had a very monotone blue color theme.
00:11:06And the biggest issue that bugged me is that it couldn't fit the width of the image within
00:11:11the frame of the app.
00:11:13So for some dogs, you ended up with a very stretched out container that goes outside of
00:11:18the bounds of the app.
00:11:20So this is a big design flaw that codecs was not able to produce properly.
00:11:25But the app itself is functional with all the necessary UI elements.
00:11:29And we also got the matches section working properly where we could chat with the dogs.
00:11:34So the reason why I give GPT 5.3 such a high place on the leaderboard is that this is the
00:11:40first model that was actually able to produce the entire Swift project package without any
00:11:46handholding or without setting up the Xcode project beforehand.
00:11:50So overall, not too bad, but also not too exciting.
00:11:53All right.
00:11:54And finally, we get to the first place on the leaderboard.
00:11:57And I'm just going to give you a moment to guess which model could that be.
00:12:01And yes, I think we all know which model that is.
00:12:04It is, of course, Opus 4.6, which absolutely aced this challenge right off the bat.
00:12:11I prompted it with the same prompt as the other models, but I used their own Claude code CLI
00:12:17tool and I just needed to provide the necessary permissions.
00:12:20And the model did everything on its own, including creating a fully functional Xcode project file
00:12:27without me having to set it up beforehand.
00:12:29And not only that, but the app itself was absolutely beautiful.
00:12:34The design was there.
00:12:35The animations were nice and fluid.
00:12:37The matches section was working correctly as well as the chat window.
00:12:41The only thing we didn't get in this version was a fancier logo like Gemini produced in
00:12:46the previous versions.
00:12:48But other than that, this was the best looking version of them all.
00:12:52And it even managed to produce this on the first try.
00:12:55So I would say Opus performance is absolutely incredible compared to all the other models.
00:13:01So it definitely deserves the first place on the leaderboard.
00:13:05But wait, there's more.
00:13:07Here's a little bonus for you folks.
00:13:09There is still one more model that we need to review that hasn't been shown on the leaderboard
00:13:13yet.
00:13:14You see, while I was making this video, there was an announcement that GLM just released
00:13:18their latest model version five, and they are bold enough to claim that this model scores
00:13:23even higher in coding than Opus 4.6.
00:13:26So obviously I had to test it out on the same Swift challenge.
00:13:31And since GLM does not have their own CLI tool, I once again used Xcode's AI assistant tool
00:13:37by hooking it up to open router and using GLM five from there.
00:13:41And first of all, GLM did not complete this challenge on the first try.
00:13:45So that already shows a worse performance than Opus 4.6.
00:13:49But secondly, I had to go through three rounds of bug fixes to finally get it to compile successfully.
00:13:56So let's see what the final result is for GLM five.
00:13:59As you can see, it already looks like a failing grade to me.
00:14:03It cannot seem to load up any of the dog images.
00:14:06It does not have the swipe functionality.
00:14:08And what's even worse, it only cycles through three dogs and then shows a message that there
00:14:13are no more dogs available.
00:14:15And furthermore, if we go to the matches section, it cannot click on any of the matches to open
00:14:20the chat interface with any of the dogs.
00:14:23So this section is clearly not finished.
00:14:25So judging by this result, where do we put GLM based on this performance?
00:14:29Well, I'm afraid we have to put it in this second to last place just above Quinn because
00:14:36this performance was just not acceptable and not nearly as good as any of the other models.
00:14:42So stating the GLM five is stronger than Opus 4.6 is a pretty bold claim.
00:14:47Now, I haven't tested this model on any other coding tasks, and it might just be the case
00:14:52that maybe for simpler web dev projects, it works just as well or maybe even better than
00:14:57Opus 4.6.
00:14:59But this is definitely not a good model for coding in Swift.
00:15:02So what did we learn today?
00:15:04Clearly while the AI revolution is moving at light speed, the Swift problem for these models
00:15:10is still very real Opus 4.6 and GPT 5.3 proved that if the model is large enough and the reasoning
00:15:18is strong enough, they can overcome the lack of open source Swift code data.
00:15:23But for models like Quinn and Grok, the data gap and API drift we talked about earlier are
00:15:29clearly hitting them hard.
00:15:31And I was also surprised how helpful Xcode's new AI assistant actually is for Swift apps.
00:15:36We could clearly see this in the difference between the two Gemini app versions.
00:15:40So if you're an iOS developer, it's probably helpful to use their internal AI tooling to
00:15:46get better results.
00:15:47So there you have it folks, I hope you enjoyed this leaderboard breakdown.
00:15:51I think this opens up a wider conversation about the fact that maybe we should start having
00:15:55language specific models.
00:15:57Because clearly a lot of these models are more heavily biased towards web apps, JavaScript
00:16:03or Python projects.
00:16:04But for some bespoke coding solutions, we might need some custom coding models.
00:16:09But what is your take on all of this?
00:16:11Let us know in the comment section down below.
00:16:13And folks, if you enjoyed this video, please let me know by smashing that like button underneath
00:16:18the video.
00:16:19And also don't forget to subscribe to our channel.
00:16:22This has been Andris from better stack and I will see you in the next videos.