draganover — LessWrong

I agree with the sentiment here and believe something like this will necessarily be happening (iterated adjustments, etc). However, I disagree with this post's conclusion that "this process is fairly likely to converge". Namely, this conclusion relies on the assumption that alignment is a stationary target which we are converging towards... and I am not convinced that this is true.

As the model capabilities improve (exponentially quickly!), the alignment objectives of 2025 will not necessarily apply by 2028. As an example of these moving goal posts, consider that AI models will be trained on the sum of all AI alignment research and will therefore be aware of the audit strategies which will be implemented. Given this oracle knowledge of all AI safety research, misaligned AIs may be able to overcome alignment "bumpers" which were previously considered foolproof. Put simply, alignment techniques must change as the models change.

Extending the metaphor, then, the pos suggests iterating via "bumpers" which held for old models while the new model paradigms are playing on entirely new bowling lanes.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments