Anecdotally, as someone who works on non-AGI-targetting AI research, I find pop-sci articles on AI research to be horribly misrepresentive.
A paper that introduces a new algorithm that guides drones around a simulator by creating sub-tasks might be presented as "AI researchers create a new kind of digital brain - and it has its own goals". That's obviously a clip-bait headline, but the article itself usually does little to clean things up.
However, I would imagine that AI is currently among the worst fields for this kind of thing due to manufactured hype, culture wars, and the age-old anthropomorphization of AI algorithms.
This was the classical intuition, but turned out to be untrue in the regime of large NNs.
The modern view is double descent (https://en.wikipedia.org/wiki/Double_descent), where small models generalize better until the number of parameters exceeds the number of training examples, then larger models generalize better with the same amount of data.
But why would this error accumulation be a problem in recurrent forward passes and not one long forward pass?
I think the question is whether applying quantization to hidden states in the middle of the forward pass during both training and inference would improve performance, which your argument would seem to imply.
In this post you seem to imply that the slow training is due to a lack of parallelization, but don't MP-SAEs also require more total flops?
At each iteration you need to recompute the encoder dot products using a matmul with the encoder matrix (a look at your code confirms this), so I would think that the total flops would scale almost linearly as you increase the number of iterations.
The problem I see with this claim is that in the acedemic realm, good value maximization is what researchers get excited about, even to a fault. It is a lot easier to get a paper published by saying "our method gets a higher reward than previous methods" than "our method does xyz interesting thing". If researchers could publish a better paperclip maximizer they almost certainly would.
If you instead looked at curiosity algorithms or reward-free (self-supervised) RL, where "success" is a bit more ambiguous, then I would agree that the inductive biases of deep NNs probably play a bigger role than usually acknowledged. In fact, a paper about the role of NN depth on self-supervised RL recently won best paper at NEURIPS: https://wang-kevin3290.github.io/scaling-crl/
I'm sure you already know that your method of asking Opus 4.5 for time estimates is the weakest part of your methodology. The thing that really throws me for a loop about it is that Claude models are consistently on the pareto frontier of performance while Anthropic has never seemed the strongest in math.
I would be interested to see how the data changes when you use different models for time estimates, or even average their estimates.
I don't have a causal reason to expect it, but it would really be fascinating if there is a continued trend of the time estimator family dominating the pareto frontier (ex ChatGPT models are the pareto frontier when ChatGPT is the time estimator).
I would also be curious to see where Kimi K2 stands, since it is among the strongest non-reasoning models and reportedly has a different "feel" than other models (possibly due to the Muon optimizer).
The Searchlight Institute recently released a survey of Americans' views and usage of AI:
There is a lot of information, but the most clear take-away is that the majority of those surveyed support AI regulation.
Another result that surprises (and concerns) me is this side note:
A question that was interesting, but didn’t lead to a larger conclusion, was asking what actually happens when you ask a tool like ChatGPT a question. 45% think it looks up an exact answer in a database, and 21% think it follows a script of prewritten responses.
Yes, that's what I'm implying. We don't have super concrete evidence for it, but a large contingent of researchers (including myself) believes things like that are happening.
To understand why neural networks might do this, you can view them as a sort of Solomonoff induction. The neural network is a "program" (similar to a python program written in code) that has to model its data as well as possible, but the model is only so big, which means that the program is limited in length. Thinking about how you might write a program to generate images, it would be much more code-efficient to write abstractions and rules for things like geometry than it would be to enumerate every possibility. The training/optimization algorithm figures that out also.
You might also be interested in emergent world representations [1] (models simulate the underlying processes or "world" that generates their data in order to predict it) and the platonic representation hypothesis [2] (different models trained on seperate modalities like text and images form similar representations, meaning that an image model will represent a picture of a dog in a similar way to how a text model will represent the word "dog").
I believe that the biggest bottleneck for continual learning is data.
First, I am defining continual learning (CL) as extreme long-context modelling with particularly good in-context supervised and reinforcement learning. You seem to have a similar implied definition, given that Titans and Hope are technically sequence modelling architectures more than classical continual learning architectures.
Titans might already be capable of crudely performing CL as I defined it, but we wouldn't know. The reason is that we haven't trained it on data that looks like CL. The long-context data that we currently use looks like pdfs, books, and synthetically concatenated snippets. None of that data, if you saw a model producing it, would you consider to be CL. The data doesn't contain failures, feedback, and an entity learning from them. If we just trained the architecture on (currently non-existent to the public) data that looks like CL, then I think we would have CL.
The obvious solution to this problem is to collect better data. This would be expensive, but the big players could probably afford it.
Another solution that I see is to bake a strong inductive bias into the architecture. If CL is an out-of-distribution behavior relative to the training data, then the best option is an architecture that "wants" to exhibit CL-like behavior. Taken to the extreme, such an architecture would exhibit CL-like behavior without any prior training at all. One example would be an "architecture" that just fine-tunes a sliding-window transformer on the stream of context. Of the current weight-based architectures, I think E2E-TTT is the closest to this vision, since it is essentially meta-learned fine-tuning.
The final solution is to use reinforcement learning instead of pretraining to get CL abilities. If getting high rewards necessitates CL, then we would expect RL to eventually bake in continual learning. The problem is that RL is just so costly and inefficient, and we lack open-ended environments with unhackable rewards.