This is the first experiment report of my AI alignment research project. A
reminder what it is about:
I'm studying the impact of overseer failure on RL-based IDA,
because I want to know under what conditions
amplification and distillation increase or decrease the
failure rate,
in order to help my reader understand whether
explicit reliability amplification is necessary for
IDA to work in practice.
And an excerpt from the introduction of the report:
I ran experiments with a tiny SupAmp model on easy tasks. Small data, short
training times and local execution allow for a tight feedback loop with few
moving parts. Thus I could understand and adapt the code, and fix mistakes
quickly.
The drawback is that it doesn't yield many insights. One result […], however,
was encouraging: XPA does learn to be more reliable than
its overseer
AmplifyH′(XPA).
The rest of the results fall into two categories: Results confirming my
understanding of the code and learning behaviour. And results made useless by
mistakes and misunderstandings of mine. Although the latter sounds bad, it is
actually useful to have exposed these mistakes before running more expensive
experiments.
Link: https://github.com/rmoehn/farlamp/blob/master/build/tiny-supfail.pdf
This is the first experiment report of my AI alignment research project. A reminder what it is about:
And an excerpt from the introduction of the report:
I welcome feedback of any kind.