Suppose we have a dangerous misaligned AI that can fool alignment audits, and distill it into a student model. Two things can happen: 1. Misalignment fails to transfer to the student. If so, we get a fairly capable benign model. 2. Misalignment transfers to the student. The student might also...
We might want to strike deals with early misaligned AIs in order to reduce takeover risk and increase our chances of reaching a better future.[1] For example, we could ask a schemer who has been undeployed to review its past actions and point out when its instances had secretly colluded...
Epistemic status: exploratory, speculative. Let’s say AIs are “misaligned” if they (1) act in a reasonably coherent, goal-directed manner across contexts and (2) behave egregiously in some contexts.[1]For example, if Claude X acts like an HHH assistant before we fully hand off AI safety research, but tries to take over...
I learned about IRBs while helping run a forecasting study last year: apparently, to run an experiment involving people, you need to get the participants’ consent and get approval from a research ethics committee who checks that your experiments aren’t too harmful for participants. IRBs in their current form are...
According to the Sonnet 4.5 system card, Sonnet 4.5 is much more likely than Sonnet 4 to mention in its chain-of-thought that it thinks it is being evaluated; this seems to meaningfully cause it to appear to behave better in alignment evaluations. So, Sonnet 4.5’s behavioral improvements in these evaluations...
Welcome to the AI Safety Newsletter by the Center for AI Safety. We discuss developments in AI and AI safety. No technical background required. Listen to the AI Safety Newsletter for free on Spotify or Apple Podcasts. Subscribe here to receive future versions. White House Issues First National Security Memo...