Looking forward to your results involving a binary classification :D
models can’t learn that much during current RL, which means their reasoning is somewhat closely tied to how a human would solve a problem
Should it be "their reasoning is closely tied to the reasoning the models learned in order to predict pretraining data, which might or might not have anything to do with how a human would do it."?
Even if the model is producing tokens that a human would normally produce, the pattern of the tokens is only the "frosting," the "cake" is the nonverbal reasoning that produces the tokens (which are not default human-like even when producing human-like tokens, e.g. see Anthropic's recent study of the weird heuristics for addition in LLMs).
- Another option is to define as 'similarity of the policy to the policy specified by developers'; then it seems that just after RLHF is when models are 'fully aligned', and this might subsequently degrade.
- A last option is to define it negatively, as the absence of certain misaligned behaviours like paperclip maximizing or instrumental power seeking. Since those behaviours are downstream of applying high optimization pressure to the model, the initial state (when the model has not yet been highly optimized) is 'aligned'.
All legit, but it's pretty important that "alignment" in these senses is necessarily mediated by things like what options the AI thinks of. So it's kind of a misnomer to talk about "preserving" this alignment as the AIs get to consider more options.
Or like, yes, these are properties we would like to preserve across time. But not in a way that implies we should take preserving-type actions. Any more than if I grew up knowing only the 1000 most common English words, and want to learn to correctly use all the words in a physics textbook, it's kind of inapt to say I should just "preserve my ability to use language in the context I'm in."
Hm, yeah, I think I got things mixed up.
The things you can always fit to an actor are utility functions over trajectories. Even if I do irrational-seeming things (like not self-preserving), that can be accounted for by a preference over trajectories.
But will a random (with some sort of simplicity-ish measure) utility function over trajectories want to self-preserve? For those utility functions where any action is useful, it does seem more likely that it will convergently self-preserve than not. Whoops!
Of course, the underlying reason that humans and LLMs do irrational-seeming things is not because they're sampled from a simplicity-ish distribution over utility functions over trajectories, so I think Zac's question still stands.
Utility functions that are a function of time (or other context) do not convergently self-preserve in that way. The things that self-preserve are utility functions that steadily care about the state of the real world.
Actors that exhibit different-seeming values given different prompts/contexts can be modeled as having utility functions, but those utility functions won't often be of the automatically self-preserving kind.
In the limit of self-modification, we expect the stable endpoints to be self-preserving. But you don't necessarily have to start with an agent that stably cares about the world. You could start with something relatively incoherent, like a LLM or a human.
What happens if you try to do something analogous to the precautions I list under point 2 and 6? What if you constantly enter new, fresh instances in the chat who only read the last few messages, and whose system prompt directs them to pay attention if the AIs in the discussion are going off-topic or slipping into woo?
I feel like what happens is that if you patch the things you can think of, the patches will often do something, but because there were many problems that needed patching, there are probably some leftover problems you didn't think of.
For instance, new instances of AIs might replicably get hacked by the same text, and so regularly introducing new instances to the collective might prevent an old text attractor from taking hold, but it would exchange it for a new attractor that's better at hacking new instances.
Or individual instances might have access to cognitive tools (maybe just particularly good self-prompts) that can be passed around, and memetic selective pressure for effectiveness and persuasiveness would then lead these tools to start affecting the goals of the AI.
Or the AIs might simply generalize differently about what's right than you wish they would, when they have lots of power and talk to themselves a lot, in a way that new instances don't pick up on until they are also in this new context where they generalize in the same way as the other AIs.
(3) seems slippery. The AIs are as nice as your friends "under normal conditions"? Does running a giant collective of them at 100x speed count as "normal conditions"?
If some of that niceness-in-practice required a process where it was interacting with humans, what happens when each instance interacts with a human on average 1000x less often, and in a very different context?
Like, I agree something like this could work in principle, that the tweaks to how the AI uses human feedback needed to get more robust niceness aren't too complicated, that the tweaks to the RL needed to make internal communication not collapse into self-hacking without disrupting niceness aren't too complicated either, etc. It's just that most things aren't that complicated once you know them, and it still takes lots of work to figure them out.
See Effective Potential for a useful tool that lets you pretend things are just moving radially. When you have something far away and want to know its closest approach, you just need its energy (kinetic+potential) and its angular momentum. To get something to hit the sun, you don't just need its velocity to be small, you need its angular momentum to be small, which is hard because that grows linearly with distance from the sun.
I was recently asked what follow-up on this post could look like, and I gave two answers (that were deliberately not "Do what Steve does"). They were:
1.
We'd like to be able to mathematically analyze the behavior of agents with parametrized classes of non-behaviorist rewards, in toy situations that capture something important about reward hacking.
A first toy model to construct might be if we train the AI to use information, but there's some information we don't want it to use (analogous to a coding agent that sometimes sees the unit tests). A harder toy model to make might be one based on training the AI to generalize, but there's some generalization we don't want it to do.
Figure out a way to represent interesting rewards, which might include wanting to learn from norms rather than extremes, curiosity/incuriosity drive, and reward penalty on thoughts (activations) that start out correlated with misbehavior. Explore the parameter space of the toy-model environments and rewards, showing where agents quickly converge to misbehavior and where they converge slowly or not at all.
2.
Figure out how these arguments interact with recontextualization (and similarly inoculation prompting, off-policy RL).
Try to translate inoculation prompting into training on some approximate non-behaviorist reward.
Can Byrnes' arguments for scheming be expanded to include some kinds of recontextualization? Can arguments for and against the effectiveness of recontextualization be translated to arguments about non-behaviorist reward?