I have 2 questions on a video about the paper „CTRL Z: Controlling agents via resampling“.
I would love if someone experienced could educate me on some thoughts I have on the issue.
I apologize if these thoughts are basic and have been answered before. I am just very interested in this area of AI safety. If the quality of my thoughts isn’t what this platform is meant for, I would appreciate if someone could recommend me some reading or resources so I can learn more, I am specifically interested in the architecture of AI safety mechanisms and what the best methods are that we have today plus open questions or problems.
Thanks for the questions, these are all very reasonable. You might also find this video informative. I agree with Mis-Understandings's answers. Re 3, you might enjoy this post.
For 1. resampling is not (asking the AI to try again) it is forcibly rewriting it's history so that it made a different one of its plausible decisions. Every resampled continuation plausibly could have been the original continuation, given a different rng seed, and there is no nonshared place to put a scratchpad that would pass the information of the resampling event. It would probably increase the risk that you resample into a less suspicious bad path, but it is outweighed by the fact that the AI has to mostly bide it's time (in order to not get caught in the fraction with good monitoring done in the first requrests) (this is a previous important result), so on net it is a gain.
For 2. The model used in this paper is investigating that the untrusted model is behaving optimally given it's ability to actually do sneaky bad stuff. This requires calculating the optimum knowing the monitoring protocol, so the untrusted AI must know about the monitor. There is not in this model (does not know about the monitoring setup), because if the model actually thinks that there is no monitoring, it just tries immediately and then loses. If it thinks there is monitoring, it is really hard to trick it about the monitoring strategy.
for 3. No, you want to use the same model, since you can prevent the resampled branches from knowing about each other (basically they only share the shared history (history before resampling) and weights). This is basically what you would want from another "instance". The modeling assumption is that there is exactly one untrusted AI, which wins if any path breaks out, so there is not any meaningfully different untrusted AI to ask to try from that starting point.