Bye, Bye OpenAI & Anthropic?

MMaximilian Schwarzmüller
Computing/SoftwareBusiness NewsInternet Technology

Transcript

00:00:00A couple of hours ago there was a pretty big announcement. Or some pretty big hype. We don't
00:00:06know yet and I definitely wouldn't rule out the hype part. The pointless hype part. But if it's
00:00:13true, it's indeed a big announcement. Because Alexander Wedin, which I didn't know you probably
00:00:20didn't know either, announced sub-q, which stands for sub-quadratic, a major breakthrough in LLM
00:00:28intelligence. And what he announced here is a brand new type of large language model that excels at
00:00:36long-context tasks without losing — at least that is what he claims — without losing the "intelligence"
00:00:45— in quotes, the models are generating tokens but that gives them their intelligence in the end — so
00:00:52without losing the intelligence that you're used to from current frontier models like Opus 4.7,
00:00:59GPT 5.5 and so on. Now what he mentions in the announcement post on X — and then there
00:01:04also is an announcement blog post with more technical details at which we'll have a look
00:01:08because we'll dive in deep in this episode and video here — what he announces here is a model that is
00:01:16way faster when doing inference on one million token context tasks and costs way less. Five percent
00:01:26of what Opus costs. He also promises that their initial model will have a 12 million token context
00:01:35window which, just to put that number into perspective, means you can fit entire code bases,
00:01:42huge code bases into that context window. You can fit multiple large legal documents in there and
00:01:49that's of course why models like this, if they exist and work, could be super useful and totally
00:01:57game-changing. No other way of putting it. If they work — we don't have a lot of details yet,
00:02:02I'll get back to that — but if they work, that of course means that all these workarounds that we're
00:02:08using right now, like sub-agents, RAG and so on, which are all workarounds around the problem that
00:02:15the model only sees a small part of the thing it should see. So if you're working on a code base,
00:02:22existing frontier models, depending on the size of your code base, can't see the entire code base.
00:02:28They can't load the entire code base. So if you're asking it to change something, you have to hope
00:02:33that the model finds the right parts in your code base to make the change you're asking it for.
00:02:40And that of course becomes more and more of a problem the bigger the code base or the bigger
00:02:45the amount of documents you want the model to work on. So if you have a model that can reliably
00:02:52use a 12 million token context window with good quality, that naturally would be a game changer.
00:02:59Speaking of game-changing, we'll dive deep in this video and I will dive deep in all my courses. So
00:03:06if you're interested in learning how to practically use tools like Claude Code, Codex, other AI tasks,
00:03:13or coding, or the combination of all that, then my courses may be worth a look. They're practical,
00:03:19they're hands-on, they're in-depth, and you can get the individual courses or the membership,
00:03:24which gives you access to all the courses for one monthly or annual price. Links below.
00:03:31So let's dive in a bit deeper now. And as mentioned, there is an announcement blog post with
00:03:36some technical details, but not a lot to be very clear here. There's a lot of information missing,
00:03:43and we also don't have a lot of benchmarks. Specifically, they only published three
00:03:49benchmarks. The ruler benchmark that tests retrieval and reasoning behaviors beyond simple
00:03:56needle lookup, including multi-hopper retrieval, aggregation, variable tracking, and selective
00:04:01filtering. So that is a benchmark, which in the end is all about a model finding multiple pieces
00:04:06of relevant information from a relatively big context window. 128,000 tokens. So not super large
00:04:15of a context window, not nearly close to the 12 millions they promised, but also not just 5K or so.
00:04:22So this is a benchmark that tests how well a model can find and piece together different parts from a
00:04:28more or less large context window or document base. And here their model is on the same level as
00:04:36OPUS 4.6. In that post, they also mentioned another benchmark, the MRCRv2 benchmark, which is also about long
00:04:45context retrieval tasks where their model is in the range, as they stated, of OPUS 4.6. Though it's,
00:04:53yeah, it's in the range if you look at all the other results here, but it's definitely worse.
00:05:00Which of course is interesting since their entire thing is the long context retrieval here. But then
00:05:07again, of course, you could also argue that for super long context window use cases, the other
00:05:15models aren't usable at all, whilst theirs might still give you very good results, which may be
00:05:22better than nothing. And of course, their models also can definitely improve over time. So I wouldn't
00:05:29take this as a super bad sign for the initial model. It's just something worth noting. And of
00:05:35course, it's also worth noting that it's far better than Gemini 3.1 Pro, for example, or OPUS 4.7 in
00:05:43that table. And they also released one benchmark, which I found interesting, which is about coding
00:05:49related tasks. Now, I will say that all these benchmarks, I'm not a huge fan of them. We all know
00:05:56that they can kind of be gamed, many of them can at least, models can deliberately or undeliberately
00:06:05be fine-tuned or optimized to perform well in benchmarks. We had plenty such cases in the past,
00:06:12but still, they give us something to look at. And I find this software engineering benchmark here
00:06:20interesting, because here we can see that their model is pretty much in the range of the OPUS
00:06:27models. And that, of course, shows that it's not just able to find information in long context
00:06:36windows, in lots of documents, big code bases, but that it's also able to do something useful with it,
00:06:42that it's able to generate meaningful, good code as a result of its intelligence and of the data it is
00:06:50able to retrieve in these long context windows, so to say. So it's not just about retrieving,
00:06:54it's also about doing useful stuff. And it seems to be good there. But as mentioned, that is about
00:07:00it. We got no other deep dives or technical details. There is no model card yet. And therefore,
00:07:09all we have is a description, essentially, how their model uses sparse attention instead of dense
00:07:16attention to make these long context tasks work or to make the model work efficiently
00:07:22in long context windows scenarios, and how the model achieves its speed up and its cost efficiency,
00:07:29because it is faster and cheaper, right? That is what they announced. So let's take a look at
00:07:37dense versus sparse attention to understand what is going on here. Now, dense attention is
00:07:45what you have in the current frontier models. So your GPD 5.5, Opus 4.7, all the other models,
00:07:52these are all dense models, which essentially means that for every new token, let's say token D,
00:07:58in order to generate that token, all other tokens have to be evaluated and the connections between
00:08:08these tokens have to be evaluated because the entire idea in large language models is that you
00:08:13derive a future token, which could be an entire word or a part of a word based on what came before
00:08:20that token. So if you have, for example, a sentence like a contract can be terminated at any dot dot
00:08:28dot, then the next word thereafter is what you want to predict. You may have asked a model, "Hey,
00:08:35when can I terminate my contract?" And you may have fed that contract as a PDF document or as plain
00:08:42text into your prompt as well. So the prompt in front of this sentence, which the model is
00:08:48generating as an output is your question and then maybe some other context. So the contract, for
00:08:57example, right? That is how we currently use models. And in order to produce this token here,
00:09:03and in order to produce each token that came in front of it, the model basically had a look at the
00:09:10entire conversation, all the tokens in there. So that's your question and any additional context
00:09:16you put in there. And it split that into multiple tokens and then combined all these tokens or
00:09:23calculated weights in the end based on all the combinations of the prior tokens. So for example,
00:09:30if that were our entire conversation, obviously deliberately short, it's an example, then this is
00:09:38how it would have been split up into tokens for the GPT-5 models, for example. So some tokens are
00:09:46just a word or a word with a blank in front of it. Some tokens are just special characters.
00:09:51And in order to generate that next token, all previous tokens are in the end combined with
00:09:58each other to understand the meaning in the end. Because of course, a question mark has a very
00:10:05different meaning and implication for a future token, depending on what came in front of that
00:10:11question mark. So that question mark is combined with all previous tokens. And it's the combination
00:10:17of all these combinations in the end, that's then used to derive that final token. That's on a
00:10:22very high level, how you can think of dense attention and how it works. Now, naturally,
00:10:29that is very inefficient, but it's kind of the best we have right now, at least when it comes to the
00:10:36intelligence and the quality of the output. But it is quadratic because it's n times n,
00:10:44which means in order to derive a new token, we have to combine all previous tokens. There are
00:10:49optimization mechanisms like KV caching, which in the end caches the results of calculated weights
00:10:56that have been calculated in the past. So that for a new token, you don't have to recalculate
00:11:01all previous combinations, but you still have to calculate that new token by comparing it to all
00:11:08the previous cached weights. So you still end up in that quadratic situation here. And that of course
00:11:16is inefficient and slow, which is why these frontier models we have right now are very compute hungry,
00:11:24slow, especially when you do get into the higher context window areas and why there are pretty
00:11:31strict context window size limits. Because since it's quadratic, of course, a 12 million context
00:11:38window size is pretty much impossible to compute. It would take forever and compute time is just one
00:11:46dimension, memory that must be reserved is another one. So that's how dense models work in a nutshell
00:11:54and what their limitations are. Now, the opposite or an alternative approach that is used by that
00:12:00new model, the sub q model that was announced yesterday, is to use sparse attention. Now,
00:12:06how does sparse attention work? The idea with sparse attention is that in order to calculate a new
00:12:14token, you don't look at all the previous tokens, you don't have the combinations of all the previous
00:12:20tokens, but just of a few selected tokens. So for example, if you want to derive the token D here,
00:12:28you may just be looking at B and C, but not at A. Now, of course, the big question then is,
00:12:33how do you decide at which previous tokens to look or which previous tokens are interesting for
00:12:40producing that new token. And there are different approaches that have been used in the past because
00:12:46this new model is not the first sparse attention model. But the reason why they haven't really
00:12:52taken off here is that they have serious limitations. For example, one way is to use a
00:12:59local window approach. Now, what does that mean? That means that in order to produce a new token,
00:13:06let's say the token number five, the fifth token in a sequence, we take a look at, let's say,
00:13:13just the two tokens before it. So three plus four, for example. So you have a sliding window of tokens
00:13:22and you always just take a look at the tokens in front of the token you're about to generate. Now,
00:13:27as you can imagine, this has some serious limitations because if I'm only looking at the last
00:13:33few tokens, if I, for example, wonder when a contract can be terminated, the information
00:13:39may be here in the extra context I passed into the prompt, but it's not part of that local window
00:13:45if the local window is just the last few tokens, for example. So that next token that's about to be
00:13:50predicted has no idea of what was before in that context. So that's not useful. You can have an
00:13:55unlimited context window size with this approach, but all the context doesn't matter. So that's an
00:14:01obvious limitation. Another approach is a so-called global token approach. Here, the idea is that you
00:14:09have a global summary token. So on a high level, you can think of this as a special token that comes
00:14:16at the beginning of the token sequence that's inserted at the beginning of the token sequence
00:14:20by the model, so to say, which summarizes the tokens after it. That's kind of how you can think of it.
00:14:27And then for predicting the next token, that global token is taken into account. Now, that may work
00:14:34very well if we go back to this example here with the legal text that you may have passed to a model
00:14:40in your prompt. If that summary that was generated here for your conversation, if that includes the
00:14:46contract termination terms, for example, then of course this next token can be predicted very well
00:14:53based on that summary. But if you're unlucky and the summary does not include these details,
00:15:00well then you're out of luck and you're back to the state where the information is totally missing.
00:15:04So a global token approach can work, but of course the longer your context window gets,
00:15:12the more generic the summary gets. I mean, that's easy to imagine. If you have like a
00:15:16hundred page PDF document and you were to summarize that in a sentence or two, it would be very
00:15:22unspecific, right? So of course, predicting the next token based on that summary won't really work.
00:15:29Now, another approach would be to use a router, which is that you have like an extra neural
00:15:37network. So you have two models, essentially your large language model, and then you have an extra
00:15:43routing model. And that routing model takes a look at the prompt by the user or at the context of the
00:15:51next token to be generated and then routes that token, so to say, to the other tokens it deems
00:15:59relevant. But now that of course means that you now have a routing model, which somehow needs to
00:16:04keep track of all the other tokens that come after it. So that probably goes back into the quadratic
00:16:10attention area or is very unspecific and you're relying on that. So you're again either back going
00:16:17to the quadratic complexity and you're not gaining that much compared to a dense model or you don't
00:16:23do that and you'll probably have some loss because the router is not very good. So just as with the
00:16:30summary, you would be hoping that the router does a good job and activates the right tokens for
00:16:37predicting the next token. And that is why sparse attention is interesting but hasn't really taken
00:16:46off thus far because all these different approaches have meaningful trade-offs and to this point,
00:16:54to my knowledge, there hasn't been a sparse attention model that would have produced
00:17:00equal quality comparable to the current frontier dense models and would be able to act over a big
00:17:07context window. And they promise to change this with their new model. In that announcement blog post,
00:17:14they mention that their model does content-dependent selection. For each query, the model selects which
00:17:22parts of the sequence are worth attending to and computes attention exactly over those positions. So
00:17:28in the end, we're back to this routing approach but they kind of promise here, mention here,
00:17:35that their mechanism seems to be very efficient for activating the right tokens for predicting
00:17:43the next token. They mention that dense attention assumes every pair might matter, so it evaluates
00:17:49all of them. In practice, almost non-do. SSA, which stands for sub-quadratic selective attention,
00:17:55which is their approach, removes that assumption. It does not approximate attention. It restricts
00:18:01attention to the positions that actually carry signal and skips the rest. That is their approach.
00:18:08They're doing content-dependent routing to activate the right tokens or to use the right tokens for
00:18:14predicting the next token and that is what gives them their efficiency boost. And we have yet to see
00:18:21how well this actually works because, as mentioned, we have a very limited subset of benchmarks here.
00:18:30Not a lot of other or no other benchmarks. We have no model card. We have no details on how exactly
00:18:36their content-dependent selection works and therefore we have a lot of question marks here.
00:18:42And if there's one thing we definitely learned over the last months and years is that
00:18:49AI is obviously a useful tool and I use it every day. You probably use it every day and
00:18:57tools like codecs or cloud code are very useful. I have no doubt about that and, well, that is my
00:19:04experience with them but we also learned that we're in an industry with a lot of hype. We're in a
00:19:10transition period. Everything is changing or a lot is changing right now and therefore of course there
00:19:16are a lot of promises everywhere and not all promises get realized, materialized to actually
00:19:26something useful. I mean, take the models by Meta for example which were dense models. The Llama 4
00:19:35models had amazing benchmark numbers but weren't that great. So there are a lot of hyped up examples
00:19:42and that's just one example of course. There are many examples out there. It's definitely worth
00:19:49being cautious but if they publish these models and you can apply for early access right now,
00:19:56I did but I didn't get access yet. If these models do live up to their promises, if they are useful,
00:20:05intelligent across large context window sizes, that of course will change a lot. That will help with
00:20:13the compute constraints we have right now because there is not even close to enough compute out there
00:20:19in the world. We need way more data centers, chips, electricity and everything. So having a model that
00:20:25is way more efficient would help with that. Well, maybe we would use it that much more that the
00:20:33problem stays the same but still it would definitely enable more use right now. And of course it would
00:20:40unlock brand new use cases. It would make it possible to simply shove an entire code base in
00:20:45there and act on that. So all these workarounds we're using right now would go away. We wouldn't
00:20:52need sub-agents necessarily. We wouldn't need rack systems if that would work. But that's a would
00:21:00of course and we have yet to see if that lives up to the big promises they're making. If it does,
00:21:07they definitely founded a billion, multi-billion or trillion dollar company there.

Key Takeaway

By replacing traditional dense, quadratic attention with Sub-quadratic Selective Attention, the sub-q model aims to enable 12 million token context windows at 5% of the inference cost of current frontier models.

Highlights

  • The sub-q (sub-quadratic) model targets a 12 million token context window, enough to ingest entire large codebases.

  • Initial benchmarks indicate sub-q performs comparably to Opus 4.6 on retrieval tasks and software engineering benchmarks.

  • Inference costs for sub-q are reportedly 5% of those for current Opus models.

  • Sub-q utilizes Sub-quadratic Selective Attention (SSA), a content-dependent routing method that selectively attends to relevant tokens instead of evaluating every pair.

  • Current frontier models rely on dense attention, which exhibits quadratic complexity, making massive context windows computationally prohibitive.

Timeline

Sub-q Announcement and Scope

  • Alexander Wedin announced sub-q, a new large language model architecture designed for long-context tasks.
  • The model promises a 12 million token context window, significantly exceeding current industry standards.
  • Inference costs are projected to be 5% of current frontier models like Opus 4.7.
  • Successful implementation could eliminate the need for current workarounds like RAG and sub-agents for handling large datasets or codebases.

The announcement introduces a major potential shift in LLM architecture. By enabling a 12 million token window, the model aims to directly load massive codebases or multiple large legal documents, bypassing the constraints that currently force developers to use inefficient retrieval-augmented generation (RAG) systems. If functional, this architecture directly addresses the limitation where models only see small portions of data at a time.

Benchmark Performance

  • The model was tested against RULER, MRCRv2, and software engineering benchmarks.
  • Performance on retrieval tasks matches Opus 4.6 in many scenarios, though it lags in specific instances.
  • Software engineering benchmarks show the model generating code at a level comparable to Opus models.
  • Benchmark data remains limited, and the absence of a detailed model card makes comprehensive evaluation difficult.

The current technical evidence consists of three benchmarks. While the model demonstrates proficiency in retrieving information and generating code, the limited scope of testing and potential for benchmark gaming warrant caution. Despite these limitations, the results suggest the model's ability to act upon retrieved information in long-context scenarios is at least competitive with established frontier models.

Dense vs. Sparse Attention Mechanisms

  • Current frontier models use dense attention, which exhibits quadratic complexity by comparing every token pair.
  • Quadratic complexity makes massive context windows, such as 12 million tokens, computationally impossible to manage.
  • Sub-q uses Sub-quadratic Selective Attention (SSA) to dynamically route and attend only to tokens carrying actual signal.
  • Previous sparse attention approaches, such as sliding windows or global summary tokens, suffered from significant loss of information and context.

Dense attention requires comparing all prior tokens to generate a new one, causing memory and compute requirements to scale quadratically. Sparse attention attempts to mitigate this by only attending to a subset of tokens. While previous attempts like sliding windows or global summary tokens limited model intelligence, the proposed Sub-quadratic Selective Attention (SSA) claims to achieve efficiency by using content-dependent routing to activate only necessary tokens, theoretically retaining model quality while significantly boosting performance.

Community Posts

No posts yet. Be the first to write about this video!

Write about this video