Bye, Bye OpenAI & Anthropic?
MMaximilian Schwarzmüller
Computing/SoftwareBusiness NewsInternet Technology
Transcript
00:00:00A couple of hours ago there was a pretty big announcement. Or some pretty big hype. We don't
00:00:06know yet and I definitely wouldn't rule out the hype part. The pointless hype part. But if it's
00:00:13true, it's indeed a big announcement. Because Alexander Wedin, which I didn't know you probably
00:00:20didn't know either, announced sub-q, which stands for sub-quadratic, a major breakthrough in LLM
00:00:28intelligence. And what he announced here is a brand new type of large language model that excels at
00:00:36long-context tasks without losing — at least that is what he claims — without losing the "intelligence"
00:00:45— in quotes, the models are generating tokens but that gives them their intelligence in the end — so
00:00:52without losing the intelligence that you're used to from current frontier models like Opus 4.7,
00:00:59GPT 5.5 and so on. Now what he mentions in the announcement post on X — and then there
00:01:04also is an announcement blog post with more technical details at which we'll have a look
00:01:08because we'll dive in deep in this episode and video here — what he announces here is a model that is
00:01:16way faster when doing inference on one million token context tasks and costs way less. Five percent
00:01:26of what Opus costs. He also promises that their initial model will have a 12 million token context
00:01:35window which, just to put that number into perspective, means you can fit entire code bases,
00:01:42huge code bases into that context window. You can fit multiple large legal documents in there and
00:01:49that's of course why models like this, if they exist and work, could be super useful and totally
00:01:57game-changing. No other way of putting it. If they work — we don't have a lot of details yet,
00:02:02I'll get back to that — but if they work, that of course means that all these workarounds that we're
00:02:08using right now, like sub-agents, RAG and so on, which are all workarounds around the problem that
00:02:15the model only sees a small part of the thing it should see. So if you're working on a code base,
00:02:22existing frontier models, depending on the size of your code base, can't see the entire code base.
00:02:28They can't load the entire code base. So if you're asking it to change something, you have to hope
00:02:33that the model finds the right parts in your code base to make the change you're asking it for.
00:02:40And that of course becomes more and more of a problem the bigger the code base or the bigger
00:02:45the amount of documents you want the model to work on. So if you have a model that can reliably
00:02:52use a 12 million token context window with good quality, that naturally would be a game changer.
00:02:59Speaking of game-changing, we'll dive deep in this video and I will dive deep in all my courses. So
00:03:06if you're interested in learning how to practically use tools like Claude Code, Codex, other AI tasks,
00:03:13or coding, or the combination of all that, then my courses may be worth a look. They're practical,
00:03:19they're hands-on, they're in-depth, and you can get the individual courses or the membership,
00:03:24which gives you access to all the courses for one monthly or annual price. Links below.
00:03:31So let's dive in a bit deeper now. And as mentioned, there is an announcement blog post with
00:03:36some technical details, but not a lot to be very clear here. There's a lot of information missing,
00:03:43and we also don't have a lot of benchmarks. Specifically, they only published three
00:03:49benchmarks. The ruler benchmark that tests retrieval and reasoning behaviors beyond simple
00:03:56needle lookup, including multi-hopper retrieval, aggregation, variable tracking, and selective
00:04:01filtering. So that is a benchmark, which in the end is all about a model finding multiple pieces
00:04:06of relevant information from a relatively big context window. 128,000 tokens. So not super large
00:04:15of a context window, not nearly close to the 12 millions they promised, but also not just 5K or so.
00:04:22So this is a benchmark that tests how well a model can find and piece together different parts from a
00:04:28more or less large context window or document base. And here their model is on the same level as
00:04:36OPUS 4.6. In that post, they also mentioned another benchmark, the MRCRv2 benchmark, which is also about long
00:04:45context retrieval tasks where their model is in the range, as they stated, of OPUS 4.6. Though it's,
00:04:53yeah, it's in the range if you look at all the other results here, but it's definitely worse.
00:05:00Which of course is interesting since their entire thing is the long context retrieval here. But then
00:05:07again, of course, you could also argue that for super long context window use cases, the other
00:05:15models aren't usable at all, whilst theirs might still give you very good results, which may be
00:05:22better than nothing. And of course, their models also can definitely improve over time. So I wouldn't
00:05:29take this as a super bad sign for the initial model. It's just something worth noting. And of
00:05:35course, it's also worth noting that it's far better than Gemini 3.1 Pro, for example, or OPUS 4.7 in
00:05:43that table. And they also released one benchmark, which I found interesting, which is about coding
00:05:49related tasks. Now, I will say that all these benchmarks, I'm not a huge fan of them. We all know
00:05:56that they can kind of be gamed, many of them can at least, models can deliberately or undeliberately
00:06:05be fine-tuned or optimized to perform well in benchmarks. We had plenty such cases in the past,
00:06:12but still, they give us something to look at. And I find this software engineering benchmark here
00:06:20interesting, because here we can see that their model is pretty much in the range of the OPUS
00:06:27models. And that, of course, shows that it's not just able to find information in long context
00:06:36windows, in lots of documents, big code bases, but that it's also able to do something useful with it,
00:06:42that it's able to generate meaningful, good code as a result of its intelligence and of the data it is
00:06:50able to retrieve in these long context windows, so to say. So it's not just about retrieving,
00:06:54it's also about doing useful stuff. And it seems to be good there. But as mentioned, that is about
00:07:00it. We got no other deep dives or technical details. There is no model card yet. And therefore,
00:07:09all we have is a description, essentially, how their model uses sparse attention instead of dense
00:07:16attention to make these long context tasks work or to make the model work efficiently
00:07:22in long context windows scenarios, and how the model achieves its speed up and its cost efficiency,
00:07:29because it is faster and cheaper, right? That is what they announced. So let's take a look at
00:07:37dense versus sparse attention to understand what is going on here. Now, dense attention is
00:07:45what you have in the current frontier models. So your GPD 5.5, Opus 4.7, all the other models,
00:07:52these are all dense models, which essentially means that for every new token, let's say token D,
00:07:58in order to generate that token, all other tokens have to be evaluated and the connections between
00:08:08these tokens have to be evaluated because the entire idea in large language models is that you
00:08:13derive a future token, which could be an entire word or a part of a word based on what came before
00:08:20that token. So if you have, for example, a sentence like a contract can be terminated at any dot dot
00:08:28dot, then the next word thereafter is what you want to predict. You may have asked a model, "Hey,
00:08:35when can I terminate my contract?" And you may have fed that contract as a PDF document or as plain
00:08:42text into your prompt as well. So the prompt in front of this sentence, which the model is
00:08:48generating as an output is your question and then maybe some other context. So the contract, for
00:08:57example, right? That is how we currently use models. And in order to produce this token here,
00:09:03and in order to produce each token that came in front of it, the model basically had a look at the
00:09:10entire conversation, all the tokens in there. So that's your question and any additional context
00:09:16you put in there. And it split that into multiple tokens and then combined all these tokens or
00:09:23calculated weights in the end based on all the combinations of the prior tokens. So for example,
00:09:30if that were our entire conversation, obviously deliberately short, it's an example, then this is
00:09:38how it would have been split up into tokens for the GPT-5 models, for example. So some tokens are
00:09:46just a word or a word with a blank in front of it. Some tokens are just special characters.
00:09:51And in order to generate that next token, all previous tokens are in the end combined with
00:09:58each other to understand the meaning in the end. Because of course, a question mark has a very
00:10:05different meaning and implication for a future token, depending on what came in front of that
00:10:11question mark. So that question mark is combined with all previous tokens. And it's the combination
00:10:17of all these combinations in the end, that's then used to derive that final token. That's on a
00:10:22very high level, how you can think of dense attention and how it works. Now, naturally,
00:10:29that is very inefficient, but it's kind of the best we have right now, at least when it comes to the
00:10:36intelligence and the quality of the output. But it is quadratic because it's n times n,
00:10:44which means in order to derive a new token, we have to combine all previous tokens. There are
00:10:49optimization mechanisms like KV caching, which in the end caches the results of calculated weights
00:10:56that have been calculated in the past. So that for a new token, you don't have to recalculate
00:11:01all previous combinations, but you still have to calculate that new token by comparing it to all
00:11:08the previous cached weights. So you still end up in that quadratic situation here. And that of course
00:11:16is inefficient and slow, which is why these frontier models we have right now are very compute hungry,
00:11:24slow, especially when you do get into the higher context window areas and why there are pretty
00:11:31strict context window size limits. Because since it's quadratic, of course, a 12 million context
00:11:38window size is pretty much impossible to compute. It would take forever and compute time is just one
00:11:46dimension, memory that must be reserved is another one. So that's how dense models work in a nutshell
00:11:54and what their limitations are. Now, the opposite or an alternative approach that is used by that
00:12:00new model, the sub q model that was announced yesterday, is to use sparse attention. Now,
00:12:06how does sparse attention work? The idea with sparse attention is that in order to calculate a new
00:12:14token, you don't look at all the previous tokens, you don't have the combinations of all the previous
00:12:20tokens, but just of a few selected tokens. So for example, if you want to derive the token D here,
00:12:28you may just be looking at B and C, but not at A. Now, of course, the big question then is,
00:12:33how do you decide at which previous tokens to look or which previous tokens are interesting for
00:12:40producing that new token. And there are different approaches that have been used in the past because
00:12:46this new model is not the first sparse attention model. But the reason why they haven't really
00:12:52taken off here is that they have serious limitations. For example, one way is to use a
00:12:59local window approach. Now, what does that mean? That means that in order to produce a new token,
00:13:06let's say the token number five, the fifth token in a sequence, we take a look at, let's say,
00:13:13just the two tokens before it. So three plus four, for example. So you have a sliding window of tokens
00:13:22and you always just take a look at the tokens in front of the token you're about to generate. Now,
00:13:27as you can imagine, this has some serious limitations because if I'm only looking at the last
00:13:33few tokens, if I, for example, wonder when a contract can be terminated, the information
00:13:39may be here in the extra context I passed into the prompt, but it's not part of that local window
00:13:45if the local window is just the last few tokens, for example. So that next token that's about to be
00:13:50predicted has no idea of what was before in that context. So that's not useful. You can have an
00:13:55unlimited context window size with this approach, but all the context doesn't matter. So that's an
00:14:01obvious limitation. Another approach is a so-called global token approach. Here, the idea is that you
00:14:09have a global summary token. So on a high level, you can think of this as a special token that comes
00:14:16at the beginning of the token sequence that's inserted at the beginning of the token sequence
00:14:20by the model, so to say, which summarizes the tokens after it. That's kind of how you can think of it.
00:14:27And then for predicting the next token, that global token is taken into account. Now, that may work
00:14:34very well if we go back to this example here with the legal text that you may have passed to a model
00:14:40in your prompt. If that summary that was generated here for your conversation, if that includes the
00:14:46contract termination terms, for example, then of course this next token can be predicted very well
00:14:53based on that summary. But if you're unlucky and the summary does not include these details,
00:15:00well then you're out of luck and you're back to the state where the information is totally missing.
00:15:04So a global token approach can work, but of course the longer your context window gets,
00:15:12the more generic the summary gets. I mean, that's easy to imagine. If you have like a
00:15:16hundred page PDF document and you were to summarize that in a sentence or two, it would be very
00:15:22unspecific, right? So of course, predicting the next token based on that summary won't really work.
00:15:29Now, another approach would be to use a router, which is that you have like an extra neural
00:15:37network. So you have two models, essentially your large language model, and then you have an extra
00:15:43routing model. And that routing model takes a look at the prompt by the user or at the context of the
00:15:51next token to be generated and then routes that token, so to say, to the other tokens it deems
00:15:59relevant. But now that of course means that you now have a routing model, which somehow needs to
00:16:04keep track of all the other tokens that come after it. So that probably goes back into the quadratic
00:16:10attention area or is very unspecific and you're relying on that. So you're again either back going
00:16:17to the quadratic complexity and you're not gaining that much compared to a dense model or you don't
00:16:23do that and you'll probably have some loss because the router is not very good. So just as with the
00:16:30summary, you would be hoping that the router does a good job and activates the right tokens for
00:16:37predicting the next token. And that is why sparse attention is interesting but hasn't really taken
00:16:46off thus far because all these different approaches have meaningful trade-offs and to this point,
00:16:54to my knowledge, there hasn't been a sparse attention model that would have produced
00:17:00equal quality comparable to the current frontier dense models and would be able to act over a big
00:17:07context window. And they promise to change this with their new model. In that announcement blog post,
00:17:14they mention that their model does content-dependent selection. For each query, the model selects which
00:17:22parts of the sequence are worth attending to and computes attention exactly over those positions. So
00:17:28in the end, we're back to this routing approach but they kind of promise here, mention here,
00:17:35that their mechanism seems to be very efficient for activating the right tokens for predicting
00:17:43the next token. They mention that dense attention assumes every pair might matter, so it evaluates
00:17:49all of them. In practice, almost non-do. SSA, which stands for sub-quadratic selective attention,
00:17:55which is their approach, removes that assumption. It does not approximate attention. It restricts
00:18:01attention to the positions that actually carry signal and skips the rest. That is their approach.
00:18:08They're doing content-dependent routing to activate the right tokens or to use the right tokens for
00:18:14predicting the next token and that is what gives them their efficiency boost. And we have yet to see
00:18:21how well this actually works because, as mentioned, we have a very limited subset of benchmarks here.
00:18:30Not a lot of other or no other benchmarks. We have no model card. We have no details on how exactly
00:18:36their content-dependent selection works and therefore we have a lot of question marks here.
00:18:42And if there's one thing we definitely learned over the last months and years is that
00:18:49AI is obviously a useful tool and I use it every day. You probably use it every day and
00:18:57tools like codecs or cloud code are very useful. I have no doubt about that and, well, that is my
00:19:04experience with them but we also learned that we're in an industry with a lot of hype. We're in a
00:19:10transition period. Everything is changing or a lot is changing right now and therefore of course there
00:19:16are a lot of promises everywhere and not all promises get realized, materialized to actually
00:19:26something useful. I mean, take the models by Meta for example which were dense models. The Llama 4
00:19:35models had amazing benchmark numbers but weren't that great. So there are a lot of hyped up examples
00:19:42and that's just one example of course. There are many examples out there. It's definitely worth
00:19:49being cautious but if they publish these models and you can apply for early access right now,
00:19:56I did but I didn't get access yet. If these models do live up to their promises, if they are useful,
00:20:05intelligent across large context window sizes, that of course will change a lot. That will help with
00:20:13the compute constraints we have right now because there is not even close to enough compute out there
00:20:19in the world. We need way more data centers, chips, electricity and everything. So having a model that
00:20:25is way more efficient would help with that. Well, maybe we would use it that much more that the
00:20:33problem stays the same but still it would definitely enable more use right now. And of course it would
00:20:40unlock brand new use cases. It would make it possible to simply shove an entire code base in
00:20:45there and act on that. So all these workarounds we're using right now would go away. We wouldn't
00:20:52need sub-agents necessarily. We wouldn't need rack systems if that would work. But that's a would
00:21:00of course and we have yet to see if that lives up to the big promises they're making. If it does,
00:21:07they definitely founded a billion, multi-billion or trillion dollar company there.
Community Posts
No posts yet. Be the first to write about this video!
Write about this video