I am working on empirical AI safety.
Book a call with me if you want advice on a concrete empirical safety project.
Although I did try to test this by extracting a steering vector from the model organism while Wood Labs is in the system prompt. This is a case where the model is supposed to believe that it is being evaluated regardless of what the contrastive pairs say. Steering with this vector still elicits the desired type hint behavior and evaluation/deployment beliefs
Glad you ran this variation, this experiment is the actually surprising result - I agree the main result with steering without Wood Labs in the prompt is not surprising because models like Llama Nemotron 49B probably believe you.
we can be more certain that there isn’t some other concept mixed in
This feels slightly incorrect to me. I think that one of the main ways in which your results are surprising is that saying "this is deployment" and "this is training" generates internal activations related to training and deployment despite the fact that the model does not believe you.
I think it would have been plausible that "this is deployment" generates activations that are entirely about "the user says it's deployment even when it's not" and completely orthogonal to "this is actually deployment" - so I don't think you should in general expect activation steering to be free of other concepts being mixed in.
I agree the risk of "harmfulness leakage" is lower than with SAE, so I agree with the conclusion that an AI auditor trying any white-box steering should definitely try contrastive-pair activation steering.
Cool work!
I am curious how much these results are about modifying internal representations vs "prompting, but more". If it's "prompting, but more" this makes me worried this sort of technique would have very bad false negatives for instances of real misalignment where the model has more "self-control" about its outputs.
I think your "Llama Nemotron 49B is gullible and easy to trick." experiments point at the latter having at least some effect.
An experiment which would update me more because it feels most analogous to activation steering is running contrastive sampling / classifier free guidance which, like activation steering, supports contrastive prompts, using large weights, etc.
I agree overt and covert encoded reasoning are different and you should flag overt encoded reasoning, and covert seems harder than overt. This is what I tried to point at by saying
I relax the “plausible benign” constraint and try to find math reasoning that doesn’t look like math reasoning to a monitor."
An intuition here is that different RL environments could ~randomly change the "scratchpad personality" through indirect effects, with some changes pushing towards more or less blabbing. And so if you search over a large space of safety-training environments and pick all the environments that don't result in verbalized situational awareness you might be selecting all the environments that push towards a less blabby scratchpad personality.
(In this specific case of Sonnet 4.5, the indirect optimization effects of removing the RL envs the model card describes are probably very weak, so I think it should not be the primary consideration in the case of Sonnet 4.5 absent some strong reason to expect these particular RL envs to encourage a very blabby personality.)
Do you think it's similar to how different expectations and priors about what AI trajectories and capability profiles will look like often cause people to make different predictions on e.g. P(doom), P(scheming), etc. (like the Paul vs Eliezer debates)? Or do you think in this case there is enough empirical evidence that people ought to converge more? (I'd guess the former, but low confidence.)
it re-reads the whole conversation from scratch and then generates an entirely new response
You can think of what is being done here as the experiment done here interfering with the "re-reading" of the response.
A simple example: when the LLM sees the word "brot" in German, it probably "translates" it internally into "bread" at the position where the "brot" token is, and so if you tamper with activations at the "brot" position (on all forward passes - though in practice you only need to do it the first time "brot" enters the context if you do KV caching) it will have effects many tokens later. In the Transformer architecture the process that computes the next token happens in part "at" previous tokens so it makes sense to tamper with activations "at" previous tokens.
Maybe this diagram from this post is helpful (though it's missing some arrows to reduce clutter).
it's apparently easy to twist it to fit several worldviews.
What do you mean by this?
Like Daniel wrote in a comment, it’s good to think of Agent-5 as distributed and able to nudge things all over the internet. The nudges could be highly personalized to demographics and individuals, so responsive to the kind of subtle emotional triggers the superintelligence learns about each individual.
It seems many people’s opinions today are already significantly shaped by social media and disinformation. So this makes me think a similar process that’s much more agentic, personalized, and superintelligence-optimized could be very potent.
Answered to these ideas in my answer to Daniel.
There’s the possibility of mind-hacking too, though I chose to leave that out of the blogpost.
In an earlier iteration of this scenario I had a military coup rather than this gradual political ascension via persuasion. But then I decided that a superintelligence capable of controlling the robots well enough to disempower the human military would probably also be powerful enough to do something less heavy-handed like what’s in the scenario.
I agree that if you have a secret loyalty in wildly superhuman AIs, many things are possible and the success at CEO-takeover seems extremely likely.
The CEO is probably well-positioned to take credit for a lot of the benefits Agent-5 seems to bring to the world (some of these benefits are genuine, some illusory).
I think this is a very different story. If the claim is that even without secret loyalty the CEO would have a 10% chance of taking over, then the secret loyalty part doesn't seem that important and is distracting imo. A 10% of no-secret-loyalty CEO takeover is a big enough deal on its own?
The main thesis is missing some qualifier about what kind of text you are talking about.
There are many kinds of communication where the mental state of the writer matters very little and I would be interested in the text even knowing it was generated by an LLM (though I'd prefer to know it was generated by an LLM).
In particular, for most kinds of communication, the text says sth about the world, and I care about how well the text matches the world much more than I care about whether it was produced by some human-like idea generator:
(In all those cases I want to know what GPT-7's contribution is. I think it's bad to be mislead about what was LLM-generated.)
There are also situations where I am sympathetic to your point:
Given that current LLMs are weak at the sort of text that blogposts are, I think "just don't use LLMs to generate blogposts" is a reasonable heuristic. But my understanding is that you are making much more general claims that I think are wrong and will become obviously wrong if/when applied to expensive-to-run-and-smart LLMs.