Thanks for the comment. I agree that, ideally, we would find a way not to have two wholly separate models and instead somehow train a model against itself. I think a potential issue with your proposal is that small perturbations could have discontinuous effects, the anticipation of which distorts predictions. However, it would be interesting to think about further to see if there's some way to avoid that issue.
Thanks Caspar, your comments here and on earlier drafts are appreciated. We'll expand more on the positioning within the related literature as we develop this into a paper.
As for your work on Decision Scoring Rules and the proposal in your comment, the biggest distinction is that this post's proposal does not require specifying the decision maker's utility function in order to reward one of the predictors and shape their behavior into maximizing it. That seems very useful to me, as if we were able to properly specify the desired utility function, we could skip using predictive models and just train an AI to maximize that instead (modulo inner alignment).
For the first point, I agree that the SGD pushes towards closing any gaps. My concern is that at the moment, we don't know how small the gaps need to be to get the desired behavior (and this is what we are working on modelling now). On top of that, depending on how the models are initialized, the starting gap may be quite large, so the dynamics of how gaps close throughout the training process seems important to study further.
For the second point, I think we are also in agreement. If the training process leads the AI to learning "If I predict that this act...
In the first part of this sequence, we clarify that we are focusing on the case where the model is a predictive model of the world. The fourth part, on making inner alignment as easy as possible, outlines some reasons why we think this kind of predictive model is possible (even likely) outcome of the training process. Of course, it is also possible that the model is not precisely a predictive model, but is still close enough to one that the content of "Conditioning Predictive Models" is still relevant.
Yes, you are correct that RL with KL penalties only approximates a Bayesian update in the limit, after enough steps to converge. Determining the speed of this convergence, especially for LLMs, remains an area for future work.
While I personally believe that myopia is more likely than not to arrive by default under the specified training procedure, there is no gradient pushing towards it, and as noted in the post currently no way to guarantee or test for it. Given that uncertainty, a discussion of non-myopic oracles seems worthwhile.
Additionally, a major point of this post is that myopia alone is not sufficient for safety, a myopic agent with an acausal decision theory can behave in dangerous ways to influence the world over time. Even if we were guaranteed myopia by default, it would still be necessary to discuss decision rules.
I don't believe we considered logical counterfactuals as such, but it seems to me that those would be quite comparable to the counterfactual of replacing an oracle with a simpler system.
Not yet! We're now meeting on a monthly schedule, and there has only been one meeting since completing the list here. I'll look into finding a relevant paper on the subject, but if you have any recommendations please let me know.
My impression is that the majority of the benefit from having professors working on AI safety is in mentorship to students who are already interested in AI safety, rather than recruitment. For example, I have heard that David Krueger's lab is mostly people who went to Cambridge specifically to work on AI safety under him. If that's the case, there's less value in working at a school with generally talented students but more value in schools with a supportive environment.
In general it's good to recognize that what matters to AI safety professors is di...
It sounds like you have a number of ideas as to why robustness was not achieved and how to correct those issues. Why is the project over now, rather than continuing having made those updates?
Yeah, the full reading list will be posted publicly once it's finalized.
Thanks for the recommendation! I was planning on including something from yourself/Vince/out of FOCAL, but wasn't sure which option to go with.
I was thinking RL systems for the case where an agent learns the correct outcome to optimize for but in the wrong environment, but the same issue applies for mesa-optimizers within any neural net.
As for why it tries to restart the training environment, it needs a similar environment to meet a goal that is only defined within that environment. If the part that's unclear is what a training environment means for something like a neural net trained with supervised learning, the analogy would be that the AI can somehow differentiate between training data (or a subset of it) and deployment data and wants to produce its outputs from inputs with the training qualities.
Re-reading your prior comment, I think I misunderstood it initially.
Training a proposal head on a given reporter seems inefficient, since we want the proposals to change as the reporter changes. I am not entirely certain how to efficiently generate proposals, but some search process conditional on the reporter seems feasible.
Human simulators will need larger changes to the predictor state to answer certain questions, as the answer to the question must be visible to a human observer. The predictor is then trained with a penalization term on how large of a c...
I think some generality is necessary, otherwise we'd have to retrain the reporter every time the predictor is updated. That would rule out a lot of desirable uses for a reporter, like using its output in the training process.
I think of the proposed changes as coming from the reporter, or at least dependent on the reporter. Then, if the reporter does not have a good model of what is going on in the predictor beyond what a human could guess, it will be unable to propose a counterfactual predictor state.
The issue with the training process as you describe it is part 3. It would require a direct translator to train on the difference between the desired and given answer. Instead, we want to train the reporter to do two functions, answer questions and propose changes. We could also j...
I don't necessarily think we'd get an incoherent output, since it needs to be able to generalize to new questions, I expect a direct translator to answer questions by using computations to understanding a predictor (plus a model of natural language), rather than a function that maps the state of a particular predictor to answers for each question.
One reporter might only be able to understand the predictor up to a human level. If it gets a predictor with a human level understanding of the world, it can act as a direct translator, but if it gets a more complex predictor it would act as a human translator.
Or more generally increasing intelligence, for example through smart drugs or brain-computer interfaces.
I think, from an alignment perspective, having a human choose their action while being aware of the distribution over outcomes it induces is much safer than having it effectively chosen for them by their specification of a utility function. This is especially true because probability distributions are large objects. A human choosing between them isn't pushing in any particular direction that can make it likely to overlook negative outcomes, while choosing based on the utility function they specify leads to exactly that. This is all modulo ELK, of course.
I'm not sure I understand the variant you proposed. How is that different than the Othman and Sandholm MAX rule?