Researcher at MIRI
Were current models (e.g., Opus 4.5) trained using this updated constitution?
Here's a slide from a talk I gave a couple of weeks ago. The point of the talk was "you should be concerned with the whole situation and the current plan is bad", where AI takeover risk is just one part of this (IMO the biggest part). So this slide was my quickest way to describe the misalignment story, but I think there are a bunch of important subtleties that it doesn't include.
Recognizable values are not the same as good values, but also I'm not at all convinced that the phenomena in this post will be impactful enough to outweigh all the somewhat random and contingent pressures what will shape a superintelligence's values. I think a superintelligence's values might be "recognizable" if we squint, and don't look/think to hard, and if the superintelligence hasn't had time to really reshape the universe.
Maybe I'm dense, but was the BART map the intended diagram?
The inability to copy/download is pretty weird. Anthropic seems to have deliberately disabled downloading, and rather than uploading a PDF, the webpage seems to be a bunch of PNG files.
I am very concerned about breakthroughs in continual/online/autonomous learning because this is obviously a necessary capability for an AI to be superhuman. At the same time, I think that this might make a bunch of alignment problems more obvious, as these problems only really arise when the AI is able to learn new things. This might result in a wake up of some AI researchers at least.
Or, this could just be wishful thinking, and continual learning might allow an AI to autonomously improve without human intervention and then kill everyone.
I like the sentiment and much of the advice in this post, but unfortunately I don’t think we can honestly confidently say “You will be OK”.
Maybe useful to note that all the Google people on the "Chain of Thought Monitorability" paper are from Google Deepmind, while Hope and Titans are from Google Research.
It might be good to have you talk about more research directions in AI safety you think are not worth pursuing or are over-invested in.
Also I think it would be good to talk about what the plan for automating AI alignment work would look like in practice (we've talked about this a little in person, but it would be good for it to be public).