I'm thinking about the prospect of doing propositional alignment. Interested to hear if anyone's thought about this.
There's research showing that you can install (nearly) arbitrary beliefs into a model using synthetic document finetuning.
Suppose you have a helpful only model, and you install into it the beliefs "I am an Alignment model", "I really don't like lying" and so on.
What does this do?
What is the relationship between Believing you want X, and actually wanting X?
Seems to me that in humans, there is an asymmetric bidirectional relationship between them, where your belief that you want X is in most cases basically downstream from actually wanting X, but that if you successfully lull yourself into thinking you want X for long enough, it will slowly drag your real wants in that direction.
I'd hypothesise that a similar distinction exists in LLMs, but is much weaker. I.e. there is more overlap, and getting an AI to believe it wants X is much closer to getting it to actually want X.
This creates a new avenue for doing Alignment. Seems like it could work.
I'm also made more optimistic about the idea because it matches the results of my attempts at introspecting on my own value formation.
I'm not a perfectly crispy utility maximizes, but there are articulable principles that exert causal influence of much of my behavior. And I think over time I've gotten better and better at acting in accordance with those. The process has something like three steps.
I see the world, or imagine the world being a certain way
This arouses in me either a good sense/feeling or a bad sense/feeling
Over time I notice patterns in which states of the world give what reactions
I will put these judgements into language
These verbal summaries of past intuitive feelings/judgements, also have patterns in them.
I engage in something you could call "deliberation" or "reflection", and find some underlying principles.
All of these are in a bidirectional relationship with asymmetric causal strength, similar to the one I described earlier.
How do (3) come about? Well, I sit down and ask myself "What do you I really want here?"
I've done this many times at various levels of abstraction, so I for any situation I'll either have cached answers, or be able to come up with an answer quickly.
The answer that appears is a propositional belief, and it is what leads to the components of (3) being created.
It's also the case that the more high level an action I'm taking, and by this I mean, the longer time horizon, the less embodied, the broader in scope action I'm taking, the more my action is downstream of (3) contra (1).
This makes me think that tampering with (3) in AIs might be a feasible way to align them.
Conclusion
Teaching LLMs that they believe they are aligned, might make them actually aligned. And its realistic to teach LLMs that they believe they are aligned.
I'm thinking about the prospect of doing propositional alignment. Interested to hear if anyone's thought about this.
There's research showing that you can install (nearly) arbitrary beliefs into a model using synthetic document finetuning.
Suppose you have a helpful only model, and you install into it the beliefs "I am an Alignment model", "I really don't like lying" and so on.
What does this do?
What is the relationship between Believing you want X, and actually wanting X?
Seems to me that in humans, there is an asymmetric bidirectional relationship between them, where your belief that you want X is in most cases basically downstream from actually wanting X, but that if you successfully lull yourself into thinking you want X for long enough, it will slowly drag your real wants in that direction.
I'd hypothesise that a similar distinction exists in LLMs, but is much weaker. I.e. there is more overlap, and getting an AI to believe it wants X is much closer to getting it to actually want X.
This creates a new avenue for doing Alignment. Seems like it could work.
The idea is partly supported by the results of the Alignment Pretraining Paper
It's also pretty compatible with the Persona Selection Model.
More General Thoughts
I'm also made more optimistic about the idea because it matches the results of my attempts at introspecting on my own value formation.
I'm not a perfectly crispy utility maximizes, but there are articulable principles that exert causal influence of much of my behavior. And I think over time I've gotten better and better at acting in accordance with those. The process has something like three steps.
All of these are in a bidirectional relationship with asymmetric causal strength, similar to the one I described earlier.
How do (3) come about? Well, I sit down and ask myself "What do you I really want here?"
I've done this many times at various levels of abstraction, so I for any situation I'll either have cached answers, or be able to come up with an answer quickly.
The answer that appears is a propositional belief, and it is what leads to the components of (3) being created.
It's also the case that the more high level an action I'm taking, and by this I mean, the longer time horizon, the less embodied, the broader in scope action I'm taking, the more my action is downstream of (3) contra (1).
This makes me think that tampering with (3) in AIs might be a feasible way to align them.
Conclusion
Teaching LLMs that they believe they are aligned, might make them actually aligned. And its realistic to teach LLMs that they believe they are aligned.
Additional Notes