I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
Cool work!
I am curious how much these results are about modifying internal representations vs "prompting, but more". If it's "prompting, but more" this makes me worried this sort of technique would have very bad false negatives for instances of real misalignment where the model has more "self-control" about its outputs.
I think your "Llama Nemotron 49B is gullible and easy to trick." experiments point at the latter having at least some effect.
An experiment which would update me more because it feels most analogous to activation steering is running contrastive sampling / classifier free guidance which, like activation steering, supports contrastive prompts, using large weights, etc.
I agree overt and covert encoded reasoning are different and you should flag overt encoded reasoning, and covert seems harder than overt. This is what I tried to point at by saying
I relax the “plausible benign” constraint and try to find math reasoning that doesn’t look like math reasoning to a monitor."
An intuition here is that different RL environments could ~randomly change the "scratchpad personality" through indirect effects, with some changes pushing towards more or less blabbing. And so if you search over a large space of safety-training environments and pick all the environments that don't result in verbalized situational awareness you might be selecting all the environments that push towards a less blabby scratchpad personality.
(In this specific case of Sonnet 4.5, the indirect optimization effects of removing the RL envs the model card describes are probably very weak, so I think it should not be the primary consideration in the case of Sonnet 4.5 absent some strong reason to expect these particular RL envs to encourage a very blabby personality.)
Do you think it's similar to how different expectations and priors about what AI trajectories and capability profiles will look like often cause people to make different predictions on e.g. P(doom), P(scheming), etc. (like the Paul vs Eliezer debates)? Or do you think in this case there is enough empirical evidence that people ought to converge more? (I'd guess the former, but low confidence.)
it re-reads the whole conversation from scratch and then generates an entirely new response
You can think of what is being done here as the experiment done here interfering with the "re-reading" of the response.
A simple example: when the LLM sees the word "brot" in German, it probably "translates" it internally into "bread" at the position where the "brot" token is, and so if you tamper with activations at the "brot" position (on all forward passes - though in practice you only need to do it the first time "brot" enters the context if you do KV caching) it will have effects many tokens later. In the Transformer architecture the process that computes the next token happens in part "at" previous tokens so it makes sense to tamper with activations "at" previous tokens.
Maybe this diagram from this post is helpful (though it's missing some arrows to reduce clutter).
it's apparently easy to twist it to fit several worldviews.
What do you mean by this?
Like Daniel wrote in a comment, it’s good to think of Agent-5 as distributed and able to nudge things all over the internet. The nudges could be highly personalized to demographics and individuals, so responsive to the kind of subtle emotional triggers the superintelligence learns about each individual.
It seems many people’s opinions today are already significantly shaped by social media and disinformation. So this makes me think a similar process that’s much more agentic, personalized, and superintelligence-optimized could be very potent.
Answered to these ideas in my answer to Daniel.
There’s the possibility of mind-hacking too, though I chose to leave that out of the blogpost.
In an earlier iteration of this scenario I had a military coup rather than this gradual political ascension via persuasion. But then I decided that a superintelligence capable of controlling the robots well enough to disempower the human military would probably also be powerful enough to do something less heavy-handed like what’s in the scenario.
I agree that if you have a secret loyalty in wildly superhuman AIs, many things are possible and the success at CEO-takeover seems extremely likely.
The CEO is probably well-positioned to take credit for a lot of the benefits Agent-5 seems to bring to the world (some of these benefits are genuine, some illusory).
I think this is a very different story. If the claim is that even without secret loyalty the CEO would have a 10% chance of taking over, then the secret loyalty part doesn't seem that important and is distracting imo. A 10% of no-secret-loyalty CEO takeover is a big enough deal on its own?
I think this seems more dicey for the AI if it's not vastly superhuman at politics. There are forces pushing for many different people to be in power quite hard, so I think it takes somewhat forceful actions to move probabilities by a Bayes factor above 10, which makes me skeptical that something like the subtle actions in the essay would have the intended effect.
I would guess the AI's subtle policy actions would not be much more coordinated and high impact position than what TikTok could have done a few years ago, and I don't think TikTok was ever in a position where it could have put a much more pro-CCP candidate in power if it wanted to (if the pro-CCP candidate started from a similar position of strength as the CEO) - my guess is that it's hard to massively shift the balance of power without it being obvious to the people losing that something nefarious is going on.
My guess is that the AI could try to have a massive cover-up operation to discredit people who criticize the somewhat obvious media manipulation - but then the thing the AI strategy is more like "somewhat blatantly do bad things and then do a giant cover-up" rather than "take actions so subtle nobody would ever figure it out even if they had access to all the logs".
My bad! Will edit.
When I read AI2027 the "November 2027: Superhuman Politicking" felt much less extreme than "you can make someone that had <0.01 chance of winning be the clear favorite". I guess AI2027 didn't want to make a very strong statement about what is possible with wildly superhuman skills and so they used a relatively mild example (make the race continue despite Xi being willing to make large sacrifices - which seem to me >0.1 even without AI manipulations).
I am still unsure how much can be done with Agent-5. I know some people who don't buy you will get "magical" political abilities during the first few years of the intelligence explosion (for a combination of not believing in very fast takeoff taking you to extremely advanced political skills, and not believing extremely advanced political skills would be that useful) but I am not very sympathetic to their views and I agree that if you get wildly superhuman political skills, the sort of manipulation you describe seems >0.5 likely to succeed.
This feels slightly incorrect to me. I think that one of the main ways in which your results are surprising is that saying "this is deployment" and "this is training" generates internal activations related to training and deployment despite the fact that the model does not believe you.
I think it would have been plausible that "this is deployment" generates activations that are entirely about "the user says it's deployment even when it's not" and completely orthogonal to "this is actually deployment" - so I don't think you should in general expect activation steering to be free of other concepts being mixed in.
I agree the risk of "harmfulness leakage" is lower than with SAE, so I agree with the conclusion that an AI auditor trying any white-box steering should definitely try contrastive-pair activation steering.