Here's a non-obvious way it could fail. I don't expect researchers to make this kind of mistake, but if this reasoning is correct, public access of such an AI is definitely not a good idea.
Also, consider a text predictor which is trying to roleplay as an unaligned superintelligence. This situation could be triggered even without the knowledge of the user by accidentally creating a conversation which the AI relates to a story about a rogue SI, for example. In that case it may start to output manipulative replies, suggest blueprints for agentic AIs, and maybe even cause the user to run an obfuscated version of the program from the linked post. The AI doesn't need to be an agent for any of this to happen (though it would be clearly much more likely if it were one).
I don't think that any of those failure modes (including the model developing some sort of internal agent to better predict text) are very likely to happen in a controlled environment. However, as others have mentioned, agent AIs are simply more powerful, so we're going to build them too.
I disagree with your last point. Since we're agents, we can get a much better intuitive understanding of what causality is, how it works and how to apply it in our childhood. As babies, we start doing lots and lots of experiments. Those are not exactly randomized controlled trials, so they will not fully remove confounders, but it gets close when we try to do something different in a relatively similar situation. Doing lots of gymnastics, dropping stuff, testing the parent's limits etc., is what allows us to quickly learn causality.
LLMs, as they are currently trained, don't have this privilege of experimentation. Also, LLMs are missing so many potential confounders as they can only look at text, which is why I think that systems like Flamingo and Gato are important (even though the latter was a bit disappointing).
I posted a somewhat similar response to MSRayne, with the exception that what you accidentally summon is not an agent with a utility function, but something that tries to appear like one and nevertheless tricks you into making some big mistake.
Here, what you get is a genuine agent which works across prompts by having some internal value function which outputs a different value after each prompt, and acts accordingly, if I understand correctly. It doesn't seem incredibly unlikely, as there is nothing in the process of evolution that necessarily has to make humans themselves be optimizers, but it happened anyways because that is what best performed in the overall goal of reproduction. This AI will still probably have to somehow convince the people communicating with it to give it "true" agency independent of the user's inputs. Seems like an instrumental value in this case.
That makes a lot of sense, thanks for the link. It is not as dangerous of a situation as a true agent AGI as this failure mode involves a (relatively stupid) user error. I trust researchers not to make that mistake, but it seems like there is no way to safely make those systems available to the public.
A way to make this more plausible I thought of after reading this is that of accidentally making it think it's hostile. Perhaps you make a joking remark about paperclip maximizers, or maybe it just so happens that the chat history is similar to the premise of a story about a hostile AGI in its dataset, and it thinks you're making a reference. Suddenly, it's trying to model an unaligned AGI. This system can then generate outputs which deceive you into doing something stupid, such as running the shell script described in the linked post, or creating a seemingly aligned AGI agent with its suggestions.
Small remark; BIG-bench does include tasks on self-awareness, and I'd argue that it is a requirement for your definition "an AI that can do any cognitive tasks that humans can", as well as being generally important for problem solving. Being able to correctly answer the question "Can I do task X?" is evidence of self-awareness and is clearly beneficial.
Again, there seems to be an assumption in your argument which I don't understand. Namely, that a society/superintelligence which is intelligent enough to create a convincing simulation for an AGI would necessarily possess the tools (or be intelligent enough) to assess its alignment without running it. Superintelligence does not imply omniscience.
Maybe showing the alignment of an AI without running it is vastly more difficult than creating a good simulation. This feels unlikely, but I genuinely do not see any reason why this can't be the case. If we create a simulation which is "correct" up to the nth digit of pi, beyond which the simpler explanation for the observed behavior becomes the simulation theory rather than a complex physics theory, then no matter how intelligent you are, you'd need to calculate n digits of pi to figure this out. And if n is huge, this will take a while.
Are you curious about this position mostly for its own sake or mostly because it might shed light on the question of how much hope there is for us in an SI's being uncertain about whether it is in a simulation?
The latter, but I believe there are simply too many maybes for your or OP's arguments to be made.
I don't follow. Why are you assuming that we could adequately evaluate the alignment of an AI system without running it if we were also able to create a simulation accurate enough to make the AI question what's real? This doesn't seem like it would be true necessarily.
I think the word "explainable" isn't really the best fit. What we really mean is that the model has to be able to construct theories of the world, and prioritize the ones which are more compact. An AI that has simply memorized that a stone will fall if it's exactly 5, 5.37, 7.8 (etc) meters above the ground is not explainable in that sense, whereas one that discovered general relativity would be considered explainable.
And yeah, at some point, even maximally compressed theories become so complex that no human can hope to understand them. But explainability should be viewed as an intrinsic property of AI models rather than in connection with humans.
Or, maybe we should think of "explainability" as the AI's lossy compression quality for its theories, in which case it must be evaluated together with our ability, as all modern lossy compression takes the human ear, eye and brain into account. In this case it could be measured by how close our reconstruction is to the real theory for each compression.
I agree it's irrelevant, but I've never actually seen these terms in the context of AI safety. It's more about how we should treat powerful AIs. Are we supposed to give them rights? It's a difficult question which requires us to rethink much of our moral code, and one which may shift it to the utilitarian side. While it's definitely not as important as AI safety, I can still see it causing upheavals in the future.
At this point I have to ask what exactly is meant by this. The bigger model beats the average human performance on the national math exam in Poland. Sure, the people taking this exam are usually not adults, but for many it may be where they peak in their mathematical abilities, so I wouldn't be surprised if it beats average human performance in the US. It's all rather vague though; looking at the MATH dataset paper all I could find regarding human performance was the following:
So, for solving undergraduate-level math problems, this model would be somewhere between university students who dislike mathematics and ones who are neutral towards it? Maybe. Would be nice to get more details here, I assume they didn't think much about human-level performance since the previous SOTA was clearly very far from it.