I agree it’s worth studying. Sometimes some of these categories are dependent on exactly how we define bad behavior - for example do we define it as “answered at most 5 questions correctly” or do we define it as “answered at most 5 questions correctly because it wanted to avoid non deployment”. However this is worth digging into more and we do plan to do that. Just to be clear we do believe more study is required- and are currently working on this - before actually deploying confessions.
I agree that confessions right now are better at eliciting what the model did than why it did it. Generally the reason for why a model did something is hard to determine without doing a casual intervention (eg running the same prompt without telling the model that good performance will lead to non deployment).
Indeed, while in this case the model might know the reason, it seems that in some other cases models can have “implicit bias” and unclear if they know all the factors that impacted the output.
I don’t think it’s a barrier to training for confessions on all RL environments but it could be a barrier for extracting certain types of information from confessions.
The main reason we didn’t give the judge the COTs are:
If AIs are 100% good at predicting the statements they can get away with respect to the judge, then may well learn the strategy of making such statements. But if they are not 100% at it, the safer approach is to just come clean in the confession when you did something wrong. Since you can give the evidence that you did something wrong, it’s highly unlikely the judge won’t believe you if you confess. And since st least sometimes you might not be able to do it (especially if the judge might get more test time compute than the policy), the path of least resistance could well be that the model learns to just be honest.
It's hard to say since I am in the habit of reading COTs all the time, so I don't have the counterfactual of what I could have learned without them..
Students in CS 2881r: AI safety did a number of fantastic projects, see https://boazbk.github.io/mltheoryseminar/student_projects.html
Also I finally updated the youtube channel with the final lecture as well as the oral presentations for projects, see
In what way this is a call to quietism?
I do not mean that you should be complacent! And as I said, this does not mean you should let governments, and companies, including my own, off the hook!
Thank you. I am not an economist, but I think that it is unlikely for the entire economy to operate on the model of an AI lab whereby every year you keep just pumping all gains back into AI.
Both investors and the general public have a limited patience, and they will want to see some benefits. While our democracy is not perfect, public opinion has much more impact today than the opinions of factory workers in England in the 1700's, and so I do hope that we won't see the pattern where things became worse before they were better. But I agree that it is not sure thing by any means.
However, if AI does indeed still grow in capability and economic growth is significantly above the 2% per capita it has been stuck on for the last ~120 years it would be a very big deal and would open up new options for increasing the social safety net. Many of the dillemas - e.g. how do we reduce the deficit without slashing benefits, etc. - will just disappear with that level of growth. So at least economically, it would be possible for the U.S. to have Scandinavian levels of social services. (Whether the U.S. political system will deliver that is another matter, but at least from the last few years it seems that even the Republican party is not shy about big spending.)
This actually goes to the bottom line, is that I think how AI ends up playing out will end up depending not so much on the economic but on the political factors, which is part of what I wrote about in "Machines of Faithful Obedience". If AI enables authoritarian government then we could have a scenario with very few winners and a vast majority of losers. But if we keep (and hopefully strenghten) our democracy then I am much more optimistic about how the benefits from AI will be spread.
I don't think there is something fundamental about AI that makes it obvious in which way it will shift the balance of power between governments and individuals. Sometimes the same technology could have either impact. For example the printing press had the impact of reducing state power in Europe and increasing it in China. So I think it's still up in the air how it will play out. Actually this is one of the reasons I am happy that so far AI's development has been in the private sector, and aimed at making money and marketing to consumers, than in developed in government, and focused on military applications, as it well could have been in another timeline.
Yes, in my addendum I said that a more accurate title would have been "I believe that you will most likely will be OK, and in any case should spend most of your time acting under this assumption."
For confessions, we are exploring ways to give the judge more power, e.g. by more inference time compute than policy. We are also shaping the reward to make sure that there is a heavy penalty for violating the invariant that any form of tampering or judgement calls is in the confession. One of the nice aspects about confessions is that you can force the policy model to surface an issue, even if it is not sure of what is the correct resolution. For example, imagine a very tough edge case in a safety policy, where it is unclear if the right approach is to refuse or comply.
Figuring out the correct final answer is tricky and subtle, but there is no question whether you should surface this in the confession, so you can get models in the habit of always doing it.
I don't think it's so much about AGI/ASI, but in general you are right that ultimately models may become better and better at telling whether they are not in training. If they have long term goals and they know their confessions / COT are monitored, you can imagine them trying to shape them. Indeed, they may be able to shape even their activations (I mentioned a nice example of a mouse doing this in my AI safety lecture, see below )
But I think that it is possible to train the models so that being truthful in confessions is just a reflect. Just like you can train a model that, for example, it will always output an exclamation mark after the word "apple", and even if it was ASI, it could not stop itself from doing that.
What I think could happen is that in such cases, the model will be aware that it has this reflex to be honest, and would essentially attempt to "jailbreak itself" - adding material into its context that is intended to get its confession to be inaccurate. (The same could happen with COTs.)
This is not something we know how to fight yet, but I think that if we got to the point where models have to jump through these kinds of hoops to cheat, that would be significant progress.