Views purely my own unless clearly stated otherwise
Anthropic is trying pretty hard with Claude to build something that's robustly aligned, and it's just quite hard. When o3 or claude cheat on programming tasks, they get caught, and the consequences aren't that dire. But when there are millions of iterations of AI-instances making choices, and when it is smarter than humanity, the amount of robustness you need is much much higher.
I agree with this. If the risk being discussed was "AI will be really capable but sometimes it'll make mistakes when doing high-stakes tasks because it misgeneralized our objectives" I would wholeheartedly agree. But I think the risks here can be mitigated with "prosaic" scalable oversight/control approaches. And of course it's not a solved problem. But that doesn't mean that the current status quo is the AI misgeneralizing so badly that it doesn't just reward hack on coding unit tests but also goes off and kills everyone. Claude, in its current state, isn't not killing everyone just because it isn't smart enough.
The answer is "because if it wasn't, that wouldn't be the limit, and the AI would notice it was behaving suboptimally, and figure out a way to change that."
Not every AI will do that, automatically. But, if you're deliberately pushing the AI to be a good problem solver, and if it ends up in a position where it is capable of improving it's cognition, once it notices 'improve my cognition' as a viable option, there's not a reason for it to stop.
Why are you equivocating between "improve my cognition","behave more optimally" and "resolve different drives into a single coherent goal (presumably one that is non-trivial, i.e. some target future world state)". If "optimal" is synonymous with utility-maximizing, then the fact that utility-maximizers have coherent preferences is trivial. You can fit preferences and utility functions to basically anything.
Also, why do you think that insofar as a coherent, non-trivial goal emerges, it is likely to eventually result in humanity's destruction. I find the arguments unconvincing here also; you can't just appeal to some unjustified prior over the "space of goals" (whatever that means). Empirically, the opposite seems to be true. Though you can point to OOD misgeneralization cases like unit test reward hacking, in general LLMs are both very general and aligned enough to mostly want to do helpful and harmless stuff.
It sounds like a lot of your objection is maybe to the general argument "things that can happen, eventually will." (in particular, when billions of dollars worth of investment are trying to push towards things-nearby-that-attractor happening).
Yes, I object to the "things that can happen, eventually will" line of reasoning. It proves way too much, including contradictory facts. You need to argue why one thing is more likely than another.
if that's what you meant by Defense in Depth, as Joe said, the book's argument is "we don't know how
We will never "know how" if your standard is "provide an exact proof that the AI will never do anything bad". We do know how to make AIs mostly do what we want, and this ability will likely improve with more research. Techniques in our toolbox include pretraining on human-written text (which elicits roughly correct concepts), instruction-following finetuning, RLHF, model-based oversight/RLAIF.
Overfitting networks are free to implement a very simple function— like the identity function or a constant function— outside the training set, whereas generalizing networks have to exhibit complex behaviors on unseen inputs. Therefore overfitting is simpler than generalizing, and it will be preferred by SGD.
You're conflating the simplicity of a function in terms of how many parameters are required to specify it, in the abstract, and simplicity in terms of how many neural network parameters are fixed.
The actual question should be something like "how precisely does the neural network have to be specified in order for it to maintain low loss".
An overfitted network is usually not simple because more of the parameter space is constrained by having to exactly fit the training data. There are fewer free remaining parameters. Whether or not those free remaining parameters implement a "simple" function or not is besides the point; if the loss is invariant to them, they don't count as "part of the program" anyway. And a non-overfitted network will have more free dimensions like this, to which the loss is (near-)invariant.
Is there a good definition of non-myopic in spacetime?
Coherence is not about whether a system "can be well-modeled as a utility maximizer" for some utility function over anything at all, it's about whether a system can be well-modeled as a utility maximizer for utility over some specific stuff.
The utility in the toy coherence theorem in this post is very explicitly over final states, and the theorem says nontrivial things mainly when the agent is making decisions at earlier times in order to influence that final state - i.e. the agent is optimizing the state "far away" (in time) from its current decision. That's the prototypical picture in my head when I think of coherence. Insofar as an incoherent system can be well-modeled as a utility maximizer, its optimization efforts must be dominated by relatively short-range, myopic objectives. Coherence arguments kick in when optimization for long-range objectives dominates.
My understanding based on this is that your definition of “reasonable” as per my post is “non-myopic” or “concerned with some future world state”?
This is an argument for why AIs will be good at circumventing safeguards. I agree future AIs will be good at circumventing safeguards.
By "defense-in-depth" I don't (mainly) mean stuff like "making the weights very hard to exfiltrate" and "monitor the AI using another AI" (though these things are also good to do). By "defense-in-depth" I mean at every step, make decisions and design choices that increase the likelihood of the model "wanting" (in the book sense) to not harm (or kill) humans (or to circumvent our safeguards).
My understanding is that Y&S think this is doomed because ~"at the limit of <poorly defined, handwavy stuff> the model will end up killing us [probably as a side-effect] anyway" but I don't see any reason to believe this. Perhaps it stems from some sort of map-territory confusion. An AI having and optimizing various real-world preferences is a good map for predicting its behavior in many cases. And then you can draw conclusions about what a perfect agent with those preferences would do. But there's no reason to believe your map always applies.
I don't "get off the train" at any particular point, I just don't see why any of these steps are particularly likely to occur. I agree they could occur, but I think a reasonable defense-in-depth approach could reduce the likelihood of each step enough that likelihood of the final outcome is extremely low.
I don't think anyone said "coherent"
It sounds like your argument is the AI will start with with 'psuedo-goals' that conflict and will be eventually be driven to resolve them into a single goal so that it doesn't 'dutch-book itself' i.e. lose resources because of conflicting preferences. So it does rely on some kind of coherence argument, or am I misunderstanding?
Think clearly about the current AI training approach trajectory
If you start by discussing what you expect to be the outcome of pretraining + light RLHF then you're not talking about AGI or superintelligence or even the current frontier of how AI models are trained. Powerful, general AI requires serious RL on a diverse range of realistic environments, and the era of this has just begun. Many startups are working on building increasingly complex, diverse, and realistic training environments.
It's kind of funny that so much LessWrong arguing has been around why a base model might start trying to take over the world. When that's beyond the point. Of course we will eventually start RL'ing models on hard, real-world goals.
Example post / comment to illustrate what I mean.
Goal Directedness is pernicious. Corrigibility is anti-natural.
The way an AI would develop the ability to think extended, useful creative research thoughts that you might fully outsource to, is via becoming perniciously goal directed. You can't do months or years of openended research without fractally noticing subproblems, figuring out new goals, and relentless finding new approaches to tackle them.
The fact that being very capable generally involves being good at pursuing various goals does not imply that a super-duper capable system will necessarily have its own coherent unified real-world goal that it relentlessly pursues. Every attempt to justify this seems to me like handwaving at unrigorous arguments or making enough assumptions that the point is near-circular.
Fair, you’re right, I didn’t realize or forgot that the evolution analogy was previously used in the way it is in your pasted quote.
I disagree, I don't think there's a substantial elicitation overhang with current models. What is an example of something useful you think could in theory be done with current models but isn't being elicited in favor of training larger models? (Spending an enormous amount of inference-time compute doesn't count as that's super inefficient)