This is the first experiment report of my AI alignment research project. A reminder what it is about:
I'm studying the impact of overseer failure on RL-based IDA, because I want to know under what conditions amplification and distillation increase or decrease the failure rate, in order to help my reader understand whether explicit reliability amplification is necessary for IDA to work in practice.
And an excerpt from the introduction of the report:
I ran experiments with a tiny SupAmp model on easy tasks. Small data, short training times and local execution allow for a tight feedback loop with few moving parts. Thus I could understand and adapt the code, and fix mistakes quickly.
The drawback is that it doesn't yield many insights. One result […], however, was encouraging: does learn to be more reliable than its overseer . The rest of the results fall into two categories: Results confirming my understanding of the code and learning behaviour. And results made useless by mistakes and misunderstandings of mine. Although the latter sounds bad, it is actually useful to have exposed these mistakes before running more expensive experiments.
I welcome feedback of any kind.