I agree that confessions right now are better at eliciting what the model did than why it did it. Generally the reason for why a model did something is hard to determine without doing a casual intervention (eg running the same prompt without telling the model that good performance will lead to non deployment).
Indeed, while in this case the model might know the reason, it seems that in some other cases models can have “implicit bias” and unclear if they know all the factors that impacted the output.
I don’t think it’s a barrier to training for confessions on all RL environments but it could be a barrier for extracting certain types of information from confessions.
The main reason we didn’t give the judge the COTs are:
If AIs are 100% good at predicting the statements they can get away with respect to the judge, then may well learn the strategy of making such statements. But if they are not 100% at it, the safer approach is to just come clean in the confession when you did something wrong. Since you can give the evidence that you did something wrong, it’s highly unlikely the judge won’t believe you if you confess. And since st least sometimes you might not be able to do it (especially if the judge might get more test time compute than the policy), the path of least resistance could well be that the model learns to just be honest.
It's hard to say since I am in the habit of reading COTs all the time, so I don't have the counterfactual of what I could have learned without them..
Students in CS 2881r: AI safety did a number of fantastic projects, see https://boazbk.github.io/mltheoryseminar/student_projects.html
Also I finally updated the youtube channel with the final lecture as well as the oral presentations for projects, see
In what way this is a call to quietism?
I do not mean that you should be complacent! And as I said, this does not mean you should let governments, and companies, including my own, off the hook!
Thank you. I am not an economist, but I think that it is unlikely for the entire economy to operate on the model of an AI lab whereby every year you keep just pumping all gains back into AI.
Both investors and the general public have a limited patience, and they will want to see some benefits. While our democracy is not perfect, public opinion has much more impact today than the opinions of factory workers in England in the 1700's, and so I do hope that we won't see the pattern where things became worse before they were better. But I agree that it is not sure thing by any means.
However, if AI does indeed still grow in capability and economic growth is significantly above the 2% per capita it has been stuck on for the last ~120 years it would be a very big deal and would open up new options for increasing the social safety net. Many of the dillemas - e.g. how do we reduce the deficit without slashing benefits, etc. - will just disappear with that level of growth. So at least economically, it would be possible for the U.S. to have Scandinavian levels of social services. (Whether the U.S. political system will deliver that is another matter, but at least from the last few years it seems that even the Republican party is not shy about big spending.)
This actually goes to the bottom line, is that I think how AI ends up playing out will end up depending not so much on the economic but on the political factors, which is part of what I wrote about in "Machines of Faithful Obedience". If AI enables authoritarian government then we could have a scenario with very few winners and a vast majority of losers. But if we keep (and hopefully strenghten) our democracy then I am much more optimistic about how the benefits from AI will be spread.
I don't think there is something fundamental about AI that makes it obvious in which way it will shift the balance of power between governments and individuals. Sometimes the same technology could have either impact. For example the printing press had the impact of reducing state power in Europe and increasing it in China. So I think it's still up in the air how it will play out. Actually this is one of the reasons I am happy that so far AI's development has been in the private sector, and aimed at making money and marketing to consumers, than in developed in government, and focused on military applications, as it well could have been in another timeline.
Yes, in my addendum I said that a more accurate title would have been "I believe that you will most likely will be OK, and in any case should spend most of your time acting under this assumption."
Comparing nuclear risks to AI is a bit unfair - the reason we can give such details calculations of kinetic force etc.. is because nuclear warheads are real, actually deployed, and can be launched at a moment's notice. With ASI you cannot do calculations of exactly how many people it would kill precisely because it does not exist.
I am not advocating that policy makers should have taken an "either it doesn't happen or we all die" mentality for nuclear policy. (While this is not my field, I did do some work in the nuclear disarmament space.)
But I would say that this was (and is) the mindset for the typical person living in an American urban center. (If you live in such an area, you can go to nukemap and see what would be the impact of one or more ~500kt warheads- of the type carried by Russian R-36 missiles - in your vicinity.)
People have been living their lives under the threat that it is possible that they and everyone they know could be extinguished in a moment's notice. I think the ordinary U.S. and Russian citizen probably should have done more and care more to promote nuclear disarmament. But I don't think they (we) should live in constant state of fear either.
I agree it’s worth studying. Sometimes some of these categories are dependent on exactly how we define bad behavior - for example do we define it as “answered at most 5 questions correctly” or do we define it as “answered at most 5 questions correctly because it wanted to avoid non deployment”. However this is worth digging into more and we do plan to do that. Just to be clear we do believe more study is required- and are currently working on this - before actually deploying confessions.