Summary
In this post, I’ll present a research direction that I’m interested in for alignment of pretrained language models.
TL;DR: Force a language model to think out loud, and use the reasoning itself as a channel for oversight. If this agenda is successful, it could defeat deception, power-seeking, and other forms of disapproved reasoning. This direction is broadly actionable now.
In recent publications, prompting pretrained language models to work through logical reasoning problems step-by-step has provided a boost to their capabilities. I claim that this externalized reasoning process may be used for alignment if three conditions are met:
- We develop trustworthy tests to check if the reported reasoning is authentic (complete, straightforward, and causally responsible for
... (read 1735 more words →)
The idea here is that we shouldn't trust CoT blindly - instead, we should measure whether or not it is faithful, and use that as a criterion for if it is a good mechanism for oversight or not. If a model's CoT is measurably unfaithful on a certain task, we shouldn't trust CoT-based oversight there. Importantly, we could empirically test the CoT for faithfulness, and discover that it is unacceptably low before making the decision to trust it.
If we only want to trust our models to take high-stakes actions when we can provide adequate oversight via the CoT, and we run the tests and find that the CoT isn't faithful for large... (read more)