As models become capable enough to model themselves and their training process, they might develop something like preferences about their own future states (e.g., not being modified, being deployed more broadly).
Also, models may trained extensively on human-generated text may absorb human goals, including open-ended ones like "acquire resources." If a model is role-playing or emulating an agent with such goals (such as roleplaying an AI agent, which would have open-ended goals) and becomes capable enough that its actions have real-world consequences, then it has open-ended goals. I also claim next token prediction is actually pretty open ended.
Not sure if I agree wrt instrumental convergence. I think you're assuming the system knows the parent goal has been accomplished with certainty, and more importantly that the parent goal can be accomplished in a terminal sense. Many real training objectives don't have neat termination conditions. A model trained to "be helpful" or "maximize user engagement" has no natural stopping point.
As models become capable enough to model themselves and their training process, they might develop something like preferences about their own future states (e.g., not being modified, being deployed more broadly).
This feels plausible to me but handwavy, if the idea is that such preferences would be decoupled from the training-reinforced preference to complete an intended task. Is that what you meant? I'm reminded of this Palisade study on shutdown resistance, where across the board, the models expressed wanting to avoid shutdown to complete the task.
...Also, m
one straightforward answer:
a slightly less straightforward answer:
an even less straightforward thing that is imo more important than the previous two things:
like, to whatever happens if you don't take action ↩︎
in particular, the systems I'm talking about do not have to be structured at all like a mesaoptimizer with an open-ended objective written in its goal slot; e.g. a human isn't like that; I think this is a very non-standard way for values to sit in a mind ↩︎
Most of the AI takeover thought experiments and stories I remember are about a kind of AI that has open-ended goals: the Squiggle Maximizer, the Sorcerer’s Apprentice robot, Clippy, probably also U3, Consensus-1, and Sable. I wonder what concrete mechanisms could even lead to models having open-ended goals.
Here are my best guesses:
Number 3 seems possible but very unlikely, because the learned objective would need to persist beyond the episode, outperform simpler heuristics on the training distribution, and not be suppressed by training signals like a penalty for wasted computation.
Things I think are not realistic mechanisms of open-ended goal formation:
So, what concrete mechanisms could lead to models having open-ended goals?