Update: the reading list has now been posted.
It sounds like you have a number of ideas as to why robustness was not achieved and how to correct those issues. Why is the project over now, rather than continuing having made those updates?
Yeah, the full reading list will be posted publicly once it's finalized.
Thanks for the recommendation! I was planning on including something from yourself/Vince/out of FOCAL, but wasn't sure which option to go with.
I was thinking RL systems for the case where an agent learns the correct outcome to optimize for but in the wrong environment, but the same issue applies for mesa-optimizers within any neural net.
As for why it tries to restart the training environment, it needs a similar environment to meet a goal that is only defined within that environment. If the part that's unclear is what a training environment means for something like a neural net trained with supervised learning, the analogy would be that the AI can somehow differentiate between training data (or a subset of it) and deployment data and wants to produce its outputs from inputs with the training qualities.
Re-reading your prior comment, I think I misunderstood it initially.
Training a proposal head on a given reporter seems inefficient, since we want the proposals to change as the reporter changes. I am not entirely certain how to efficiently generate proposals, but some search process conditional on the reporter seems feasible.
Human simulators will need larger changes to the predictor state to answer certain questions, as the answer to the question must be visible to a human observer. The predictor is then trained with a penalization term on how large of a change has to be made to the predictor to have it answer a certain way to specific questions given an initial scenario.
This proposal also works as an "audit" at the end, checking a variety of counterfactuals in order to catch human simulators, but this does not suggest a change to the reporter. Instead, it is a sign to scrap everything and start over.
I think some generality is necessary, otherwise we'd have to retrain the reporter every time the predictor is updated. That would rule out a lot of desirable uses for a reporter, like using its output in the training process.
I think of the proposed changes as coming from the reporter, or at least dependent on the reporter. Then, if the reporter does not have a good model of what is going on in the predictor beyond what a human could guess, it will be unable to propose a counterfactual predictor state.The issue with the training process as you describe it is part 3. It would require a direct translator to train on the difference between the desired and given answer. Instead, we want to train the reporter to do two functions, answer questions and propose changes. We could also just use the question answering functionality to do search over predictor state space without understanding it until we find a state that gives the desired answers to a set of questions.
I don't necessarily think we'd get an incoherent output, since it needs to be able to generalize to new questions, I expect a direct translator to answer questions by using computations to understanding a predictor (plus a model of natural language), rather than a function that maps the state of a particular predictor to answers for each question.
One reporter might only be able to understand the predictor up to a human level. If it gets a predictor with a human level understanding of the world, it can act as a direct translator, but if it gets a more complex predictor it would act as a human translator.
Or more generally increasing intelligence, for example through smart drugs or brain-computer interfaces.