For me, the OP brought to mind another kind of "not really math, not really science": string theory. My criticisms of agent foundations research are analogous to Sabine Hossenfelder's criticisms of string theory, in that string theory and agent foundations both screen themselves off from the possibility of experimental testing in their choice of subject matter: the Planck scale and very early universe for the former, and idealized superintelligent systems for the latter. For both, real-world counterparts (known elementary particles and fundamental forces; humans and existing AI systems) of the objects they study are primarily used as targets to which to overfit their theoretical models. They don't make testable predictions about current or near-future systems. Unlike with early computer science, agent foundations doesn't come with an expectation of being able to perform experiments in the future, or even to perform rigorous observational studies.
Building on what you said, pre-LLM agent foundations research appears to have made the following assumptions about what advanced AI systems would be like:
None of these assumptions ended up being true of LLMs. In an LLM, the world-model and decision process are mixed together in a single neural network instead of being separate entities. LLMs don't come with decision-related concepts like "hypothesis" and "causality" pre-loaded; those concepts are learned over the course of training and are represented in the same messy, polysemantic way as any other learned concept. There's no way to separate out the reasoning-related features to get a decision process you could plug into a different world-model. In addition, when LLMs are scaled up, their decision-making becomes more complex and inscrutable due to being distributed across the neural network. The LLM's decision-making process doesn't converge into a simple and human-comprehensible decision theory.
More useful. It would save us the step of having to check for hallucinations when doing research.
Another example of this pattern that's entered mainstream awareness is tilt. When I'm playing chess and get tilted, I might think things like "all my opponents are cheating, "I'm terrible at this game and therefore stupid," or "I know I'm going to win this time, how could I not win against such a low-rated opponent." But if I take a step back, notice that I'm tilted, and ask myself what information I'm getting from the feeling of being tilted, I notice that it's telling me to take a break until I can stop obsessing over the result of the previous game.
Tilt is common, but also easy to fix once you notice the pattern of what it's telling you and start taking breaks when you experience it. The word "tilt" is another instance of a hangriness-type stance that's caught on because of its strong practical benefits--having access to the word "tilt" makes it easier to notice.
It's working now. I think the problem was on my end.
This strategy suggests that decreasing ML model sycophancy should be a priority for technical researchers. It's probably the biggest current barrier to the usefulness of ML models as personal decision-making assistants. Hallucinations are probably the second-biggest barrier.
The new feed doesn't load at all for me.
There's another way in which pessimism can be used as a coping mechanism: it can be an excuse to avoid addressing personal-scale problems. A belief that one is doomed to fail, or that the world is inexorably getting worse, can be used as an excuse to give up, on the grounds that comparatively small-scale problems will be swamped by uncontrollable societal forces. Compared to confronting those personal-scale problems, giving up can seem very appealing, and a comparison to a large-scale but abstract problem can act as an excuse for surrender. You probably know someone who spends substantial amounts of their free time watching videos, reading articles, and listening to podcasts that blame all of the world's problems on "capitalism," "systemic racism," "civilizational decline," or something similar, all while their bills are overdue and dishes pile up in their sink.
This use of pessimism as a coping mechanism is especially pronounced in the case of apocalypticism. If the world is about to end, every other problem becomes much less relevant in comparison, including all those small-scale problems that are actionable but unpleasant to work on. Apocalypticism can become a blanket pretext for giving in to your ugh fields. And while you're giving in to them, you end up thinking you're doing a great job of utilizing the skill of staring into the abyss (you're confronting the possibility of the end of the world, right?) when you're actually doing this exact opposite. Rather than something related to preverbal trauma, this usability as a coping mechanism is the more likely source of the psychological appeal of AI apocalypticism for many people who encounter it.
Another experiment idea: testing whether the reduction in hallucinations that Yao et al. achieved with unlearning can be made robust.
I was describing reasoning about idealized superintelligent systems as the method used in agent foundations research, rather than its goal. In the same way that string theory is trying to figure out "what is up with elementary particles at all," and tries to answer that question by doing not-really-math about extreme energy levels, agent foundations is trying to figure out "what is up with agency at all" by doing not-really-math about extreme intelligence levels.
If you've made enough progress in your research that it can make testable predictions about current or near-future systems, I'd like to see them. But the persistent failure of agent foundations research to come up with any such bridge between idealized models and real-world system has made me doubtful that the former are relevant to the latter.