ok, I am quite confident you will get tons of evidence that AI systems are not aligned with you within the next few years. My primary question is what you will actually do as soon as you have identified a system as unaligned or dangerous in this way
Any operationalizations that people might make predictions on?
The most clear case I can think of is:
Prior to AI escape or takeover, will we catch an AI in production clearly seriously trying to do an egregiously bad action when not at all prompted or directly trained to do so. For example, trying:
(This might not be because of internal motivations well described as deceptive alignment.)
(To be clear, I am not confidently predicting we will see this outcome and the original quote is from habryka.)
I expect to not see this, conditional on adding a stipulation like "the AI wasn't scaffolded and then given a goal like 'maximize profit'", because I could imagine the AI-system coming up with nasty subgoals. In particular, I don't expect egregiously bad actions from autoregressive sampling of an LLM tasked with doing scientific research.
This is great. Some quotes I want to come back to:
a thing that I like that both the Anthropic RSP and the ARC Evals RSP post point to is basically a series of well-operationalized conditional commitments. One way an RSP could be is to basically be a contract between AI labs and the public that concretely specifies "when X happens, then we commit to do Y", where X is some capability threshold and Y is some pause commitment, with maybe some end condition.
instead of an RSP I would much prefer a bunch of frank interviews with Dario and Daniella where someone is like "so you think AGI has a decent chance of killing everyone, then why are you building it?". And in-general to create higher-bandwidth channels that people can use to understand what people at leading AI labs believe about the risks from AI, and when the benefits are worth it.
It seems pretty important to me to have some sort of written down and maintained policy on "when would we stop increasing the power of our models" and "what safety interventions will we have in place for different power levels".
I generally think more honest clear communication and specific plans on everything seems pretty good on current margins (e.g., RSPs, clear statements on risk, clear explanation of why labs are doing things that they know are risky, detailed discussion with skeptics, etc).
I do feel like there is a substantial tension here between two different types of artifacts here:
- A document that is supposed to accurately summarize what decisions the organization is expecting to make in different circumstances
- A document that is supposed to bind the organization to make certain decisions in certain circumstances
Like, the current vibe that I am getting is that RSPs are a "no-take-backsies" kind of thing. You don't get to publish an RSP saying "yeah, we aren't planning to scale" and then later on to be like "oops, I changed my mind, we are actually going to go full throttle".
And my guess is this is the primary reason why I expect organizations to not really commit to anything real in their RSPs and for them to not really capture what leadership of an organization thinks the tradeoffs are. Like, that's why the Anthropic RSP has a big IOU where the actually most crucial decisions are supposed to be.
Like, here is an alternative to "RSP"s. Call them "Conditional Pause Commitments" (CPC if you are into acronyms).
Basically, we just ask AGI companies to tell us under what conditions they will stop scaling or stop otherwise trying to develop AGI. And then also some conditions under which they would resume. [Including implemented countermeasures.] Then we can critique those.
This seems like a much clearer abstraction that's less philosophically opinionated about whether the thing is trying to be an accurate map of an organization's future decisions, or to what degree it's supposed to seriously commit an organization, or whether the whole thing is "responsible".
people working at AI labs should think through and write down the conditions under which they would loudly quit (this can be done privately, but maybe should be shared between like minded employees for common knowledge). Then, people can hopefully avoid getting frog-boiled.
there are clearly some training setups that seem more dangerous than other training setups . . . .
Like, as an example, my guess is systems where a substantial chunk of the compute was spent on training with reinforcement learning in environments that reward long-term planning and agentic resource acquisition (e.g. many video games or diplomacy or various simulations with long-term objectives) sure seem more dangerous.
Any recommended reading on which training setups are safer? If none exist, someone should really write this up.
Informatic developpers often say that the bug is usually situated in the chair/keyboqrd interface.
I fear that even very good RSP will be inefficient in case of sheer stupidity or deliberate malevolence.
There are already big threats:
- nuclear weapons. The only thing wich protect us is the fact that only governements can use them, witch mean that our security depends on the fact that said governements will be responsible enough. So far it worked, but there is no actual protection against really insane leaders.
- global warming. We know it since the begining of the century but we act as if we could indefinitely postpone the application of solutions, knowing full well that this is not true.
It doesn't bode well for the future.
Also one feature of RSPs would be to train IA so that they could not present major risks. But what if LLMs are developed as free software and anyone can train them in their own way? I don't see how we could control them or impose limits.
(English is not my mother tongue: please forgive my mistakes.)
This was very interesting, looking forward to the follow up!
In the "AIs messing with your evaluations" (and checking for whether the AI is capable of/likely to do so) bit, I'm curious if there is any published research on this.
The closest existing thing I'm aware of is password locked models.
Redwood Research (where I work) might end up working on this topic or similar sometime in the next year.
It also wouldn't surprise me if Anthropic puts out a paper on exploration hacking somewhat soon.
Naming and the Intentions Behind RSPs
Endorsements and Future Trajectories
How well can you tell if a given model is existentially dangerous?
Defense through "Control" vs "Lack of Propensities"
Summaries