LESSWRONG
LW

1170
Fabien Roger
6459Ω1948393340
Message
Dialogue
Subscribe

I am working on empirical AI safety. 

Book a call with me if you want advice on a concrete empirical safety project.

Anonymous feedback form.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
AI Control
7Fabien's Shortform
Ω
2y
Ω
128
Steering Evaluation-Aware Models to Act Like They Are Deployed
Fabien Roger7hΩ260

we can be more certain that there isn’t some other concept mixed in

This feels slightly incorrect to me. I think that one of the main ways in which your results are surprising is that saying "this is deployment" and "this is training" generates internal activations related to training and deployment despite the fact that the model does not believe you.

I think it would have been plausible that "this is deployment" generates activations that are entirely about "the user says it's deployment even when it's not" and completely orthogonal to "this is actually deployment" - so I don't think you should in general expect activation steering to be free of other concepts being mixed in.

I agree the risk of "harmfulness leakage" is lower than with SAE, so I agree with the conclusion that an AI auditor trying any white-box steering should definitely try contrastive-pair activation steering.

Reply
Steering Evaluation-Aware Models to Act Like They Are Deployed
Fabien Roger7hΩ220

Cool work!

I am curious how much these results are about modifying internal representations vs "prompting, but more". If it's "prompting, but more" this makes me worried this sort of technique would have very bad false negatives for instances of real misalignment where the model has more "self-control" about its outputs.

I think your "Llama Nemotron 49B is gullible and easy to trick." experiments point at the latter having at least some effect.

An experiment which would update me more because it feels most analogous to activation steering is running contrastive sampling / classifier free guidance which, like activation steering, supports contrastive prompts, using large weights, etc.

Reply
Training Qwen-1.5B with a CoT legibility penalty
Fabien Roger8h20

I agree overt and covert encoded reasoning are different and you should flag overt encoded reasoning, and covert seems harder than overt. This is what I tried to point at by saying

I relax the “plausible benign” constraint and try to find math reasoning that doesn’t look like math reasoning to a monitor."

Reply
Sonnet 4.5's eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals
Fabien Roger18h106

An intuition here is that different RL environments could ~randomly change the "scratchpad personality" through indirect effects, with some changes pushing towards more or less blabbing. And so if you search over a large space of safety-training environments and pick all the environments that don't result in verbalized situational awareness you might be selecting all the environments that push towards a less blabby scratchpad personality.

(In this specific case of Sonnet 4.5, the indirect optimization effects of removing the RL envs the model card describes are probably very weak, so I think it should not be the primary consideration in the case of Sonnet 4.5 absent some strong reason to expect these particular RL envs to encourage a very blabby personality.)

Reply
Some data from LeelaPieceOdds
Fabien Roger22h40

Do you think it's similar to how different expectations and priors about what AI trajectories and capability profiles will look like often cause people to make different predictions on e.g. P(doom), P(scheming), etc. (like the Paul vs Eliezer debates)? Or do you think in this case there is enough empirical evidence that people ought to converge more? (I'd guess the former, but low confidence.)

Reply
Emergent Introspective Awareness in Large Language Models
Fabien Roger1d110

it re-reads the whole conversation from scratch and then generates an entirely new response

You can think of what is being done here as the experiment done here interfering with the "re-reading" of the response.

A simple example: when the LLM sees the word "brot" in German, it probably "translates" it internally into "bread" at the position where the "brot" token is, and so if you tamper with activations at the "brot" position (on all forward passes - though in practice you only need to do it the first time "brot" enters the context if you do KV caching) it will have effects many tokens later. In the Transformer architecture the process that computes the next token happens in part "at" previous tokens so it makes sense to tamper with activations "at" previous tokens.

Maybe this diagram from this post is helpful (though it's missing some arrows to reduce clutter).

Testing which LLM architectures can do hidden serial reasoning — LessWrong
Reply1
Some data from LeelaPieceOdds
Fabien Roger2d40

it's apparently easy to twist it to fit several worldviews. 

What do you mean by this?

Reply
How an AI company CEO could quietly take over the world
Fabien Roger6d20

Like Daniel wrote in a comment, it’s good to think of Agent-5 as distributed and able to nudge things all over the internet. The nudges could be highly personalized to demographics and individuals, so responsive to the kind of subtle emotional triggers the superintelligence learns about each individual.

It seems many people’s opinions today are already significantly shaped by social media and disinformation. So this makes me think a similar process that’s much more agentic, personalized, and superintelligence-optimized could be very potent.

Answered to these ideas in my answer to Daniel.

There’s the possibility of mind-hacking too, though I chose to leave that out of the blogpost.

In an earlier iteration of this scenario I had a military coup rather than this gradual political ascension via persuasion. But then I decided that a superintelligence capable of controlling the robots well enough to disempower the human military would probably also be powerful enough to do something less heavy-handed like what’s in the scenario.

I agree that if you have a secret loyalty in wildly superhuman AIs, many things are possible and the success at CEO-takeover seems extremely likely.

The CEO is probably well-positioned to take credit for a lot of the benefits Agent-5 seems to bring to the world (some of these benefits are genuine, some illusory).

I think this is a very different story. If the claim is that even without secret loyalty the CEO would have a 10% chance of taking over, then the secret loyalty part doesn't seem that important and is distracting imo. A 10% of no-secret-loyalty CEO takeover is a big enough deal on its own?

Reply
How an AI company CEO could quietly take over the world
Fabien Roger6d4-2

I think this seems more dicey for the AI if it's not vastly superhuman at politics. There are forces pushing for many different people to be in power quite hard, so I think it takes somewhat forceful actions to move probabilities by a Bayes factor above 10, which makes me skeptical that something like the subtle actions in the essay would have the intended effect.

I would guess the AI's subtle policy actions would not be much more coordinated and high impact position than what TikTok could have done a few years ago, and I don't think TikTok was ever in a position where it could have put a much more pro-CCP candidate in power if it wanted to (if the pro-CCP candidate started from a similar position of strength as the CEO) - my guess is that it's hard to massively shift the balance of power without it being obvious to the people losing that something nefarious is going on.

My guess is that the AI could try to have a massive cover-up operation to discredit people who criticize the somewhat obvious media manipulation - but then the thing the AI strategy is more like "somewhat blatantly do bad things and then do a giant cover-up" rather than "take actions so subtle nobody would ever figure it out even if they had access to all the logs".

Reply
How an AI company CEO could quietly take over the world
Fabien Roger7d40

My bad! Will edit.

When I read AI2027 the "November 2027: Superhuman Politicking" felt much less extreme than "you can make someone that had <0.01 chance of winning be the clear favorite". I guess AI2027 didn't want to make a very strong statement about what is possible with wildly superhuman skills and so they used a relatively mild example (make the race continue despite Xi being willing to make large sacrifices - which seem to me >0.1 even without AI manipulations).

I am still unsure how much can be done with Agent-5. I know some people who don't buy you will get "magical" political abilities during the first few years of the intelligence explosion (for a combination of not believing in very fast takeoff taking you to extremely advanced political skills, and not believing extremely advanced political skills would be that useful) but I am not very sympathetic to their views and I agree that if you get wildly superhuman political skills, the sort of manipulation you describe seems >0.5 likely to succeed.

Reply
Load More
34Rogue internal deployments via external APIs
Ω
16d
Ω
4
77Current Language Models Struggle to Reason in Ciphered Language
Ω
18d
Ω
6
66Training Qwen-1.5B with a CoT legibility penalty
22d
7
49Training fails to elicit subtle reasoning in current language models
22d
3
152Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Ω
23d
Ω
35
48Four places where you can put LLM monitoring
Ω
3mo
Ω
0
158Why Do Some Language Models Fake Alignment While Others Don't?
Ω
4mo
Ω
14
22What can be learned from scary demos? A snitching case study
Ω
4mo
Ω
1
70Modifying LLM Beliefs with Synthetic Document Finetuning
Ω
6mo
Ω
12
28Reasoning models don't always say what they think
Ω
7mo
Ω
4
Load More