LESSWRONG
LW

1893
Fabien Roger
6474Ω1949393360
Message
Dialogue
Subscribe

I am working on empirical AI safety. 

Book a call with me if you want advice on a concrete empirical safety project.

Anonymous feedback form.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
AI Control
7Fabien's Shortform
Ω
2y
Ω
128
LLM-generated text is not testimony
Fabien Roger9h133

LLM text categorically does not serve the role for communication that is served by real text.

The main thesis is missing some qualifier about what kind of text you are talking about.

There are many kinds of communication where the mental state of the writer matters very little and I would be interested in the text even knowing it was generated by an LLM (though I'd prefer to know it was generated by an LLM).

In particular, for most kinds of communication, the text says sth about the world, and I care about how well the text matches the world much more than I care about whether it was produced by some human-like idea generator:

  • If GPT-7 writes a lean proof that is shown to prove something I care about, and then it translates that proof into natural text (and I have reasons to expect translation to be reliable), I probably want to read the proof.
  • If GPT-7 writes code that generates a cool-looking fractal that I've never seen before, I'd be interested in reading the code.
  • If GPT-7 is shown to produce better predictions/advice than human experts on some distribution after following some costly computations and given access to costly tools, I'd want to read GPT-7 predictions/advice on the distribution where it performs well.
  • If GPT-7 is much better at writing up ideas/experiment results than you (not just in the shallow sense of "raters prefer one over the other" but in the stronger sense of "everyone interested in the raw idea/experiment would learn faster and more accurately by reading the result of GPT-7 prompted with the raw idea/experiment dump"), and I care about the raw ideas / experiment, then I want to read GPT-7's output.
  • If I care about evaluating GPT-7, then I want to read GPT-7's outputs on topics which I know a lot about.

(In all those cases I want to know what GPT-7's contribution is. I think it's bad to be mislead about what was LLM-generated.)

There are also situations where I am sympathetic to your point:

  • If the LLM in question is so weak (like current LLMs on almost all topics) that I don't expect it to produce content that is very accurate to reality (but note that this is just a skill issue);
  • If I want to evaluate a human (e.g. for status reasons or because I am evaluating a candidate);
  • If I care about it being written by a human intrinsically regardless of what the text is (e.g. because you want to have a "true" connection with the writer for emotional reasons).

Given that current LLMs are weak at the sort of text that blogposts are, I think "just don't use LLMs to generate blogposts" is a reasonable heuristic. But my understanding is that you are making much more general claims that I think are wrong and will become obviously wrong if/when applied to expensive-to-run-and-smart LLMs.

Reply1
Steering Evaluation-Aware Models to Act Like They Are Deployed
Fabien Roger1dΩ232

Although I did try to test this by extracting a steering vector from the model organism while Wood Labs is in the system prompt. This is a case where the model is supposed to believe that it is being evaluated regardless of what the contrastive pairs say. Steering with this vector still elicits the desired type hint behavior and evaluation/deployment beliefs

Glad you ran this variation, this experiment is the actually surprising result - I agree the main result with steering without Wood Labs in the prompt is not surprising because models like Llama Nemotron 49B probably believe you.

Reply
Steering Evaluation-Aware Models to Act Like They Are Deployed
Fabien Roger2dΩ260

we can be more certain that there isn’t some other concept mixed in

This feels slightly incorrect to me. I think that one of the main ways in which your results are surprising is that saying "this is deployment" and "this is training" generates internal activations related to training and deployment despite the fact that the model does not believe you.

I think it would have been plausible that "this is deployment" generates activations that are entirely about "the user says it's deployment even when it's not" and completely orthogonal to "this is actually deployment" - so I don't think you should in general expect activation steering to be free of other concepts being mixed in.

I agree the risk of "harmfulness leakage" is lower than with SAE, so I agree with the conclusion that an AI auditor trying any white-box steering should definitely try contrastive-pair activation steering.

Reply11
Steering Evaluation-Aware Models to Act Like They Are Deployed
Fabien Roger2dΩ340

Cool work!

I am curious how much these results are about modifying internal representations vs "prompting, but more". If it's "prompting, but more" this makes me worried this sort of technique would have very bad false negatives for instances of real misalignment where the model has more "self-control" about its outputs.

I think your "Llama Nemotron 49B is gullible and easy to trick." experiments point at the latter having at least some effect.

An experiment which would update me more because it feels most analogous to activation steering is running contrastive sampling / classifier free guidance which, like activation steering, supports contrastive prompts, using large weights, etc.

Reply
Training Qwen-1.5B with a CoT legibility penalty
Fabien Roger2d20

I agree overt and covert encoded reasoning are different and you should flag overt encoded reasoning, and covert seems harder than overt. This is what I tried to point at by saying

I relax the “plausible benign” constraint and try to find math reasoning that doesn’t look like math reasoning to a monitor."

Reply
Sonnet 4.5's eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals
Fabien Roger3d117

An intuition here is that different RL environments could ~randomly change the "scratchpad personality" through indirect effects, with some changes pushing towards more or less blabbing. And so if you search over a large space of safety-training environments and pick all the environments that don't result in verbalized situational awareness you might be selecting all the environments that push towards a less blabby scratchpad personality.

(In this specific case of Sonnet 4.5, the indirect optimization effects of removing the RL envs the model card describes are probably very weak, so I think it should not be the primary consideration in the case of Sonnet 4.5 absent some strong reason to expect these particular RL envs to encourage a very blabby personality.)

Reply
Some data from LeelaPieceOdds
Fabien Roger3d40

Do you think it's similar to how different expectations and priors about what AI trajectories and capability profiles will look like often cause people to make different predictions on e.g. P(doom), P(scheming), etc. (like the Paul vs Eliezer debates)? Or do you think in this case there is enough empirical evidence that people ought to converge more? (I'd guess the former, but low confidence.)

Reply
Emergent Introspective Awareness in Large Language Models
Fabien Roger3d110

it re-reads the whole conversation from scratch and then generates an entirely new response

You can think of what is being done here as the experiment done here interfering with the "re-reading" of the response.

A simple example: when the LLM sees the word "brot" in German, it probably "translates" it internally into "bread" at the position where the "brot" token is, and so if you tamper with activations at the "brot" position (on all forward passes - though in practice you only need to do it the first time "brot" enters the context if you do KV caching) it will have effects many tokens later. In the Transformer architecture the process that computes the next token happens in part "at" previous tokens so it makes sense to tamper with activations "at" previous tokens.

Maybe this diagram from this post is helpful (though it's missing some arrows to reduce clutter).

Testing which LLM architectures can do hidden serial reasoning — LessWrong
Reply1
Some data from LeelaPieceOdds
Fabien Roger4d40

it's apparently easy to twist it to fit several worldviews. 

What do you mean by this?

Reply
How an AI company CEO could quietly take over the world
Fabien Roger8d20

Like Daniel wrote in a comment, it’s good to think of Agent-5 as distributed and able to nudge things all over the internet. The nudges could be highly personalized to demographics and individuals, so responsive to the kind of subtle emotional triggers the superintelligence learns about each individual.

It seems many people’s opinions today are already significantly shaped by social media and disinformation. So this makes me think a similar process that’s much more agentic, personalized, and superintelligence-optimized could be very potent.

Answered to these ideas in my answer to Daniel.

There’s the possibility of mind-hacking too, though I chose to leave that out of the blogpost.

In an earlier iteration of this scenario I had a military coup rather than this gradual political ascension via persuasion. But then I decided that a superintelligence capable of controlling the robots well enough to disempower the human military would probably also be powerful enough to do something less heavy-handed like what’s in the scenario.

I agree that if you have a secret loyalty in wildly superhuman AIs, many things are possible and the success at CEO-takeover seems extremely likely.

The CEO is probably well-positioned to take credit for a lot of the benefits Agent-5 seems to bring to the world (some of these benefits are genuine, some illusory).

I think this is a very different story. If the claim is that even without secret loyalty the CEO would have a 10% chance of taking over, then the secret loyalty part doesn't seem that important and is distracting imo. A 10% of no-secret-loyalty CEO takeover is a big enough deal on its own?

Reply
Load More
34Rogue internal deployments via external APIs
Ω
18d
Ω
4
77Current Language Models Struggle to Reason in Ciphered Language
Ω
19d
Ω
6
66Training Qwen-1.5B with a CoT legibility penalty
24d
7
49Training fails to elicit subtle reasoning in current language models
24d
3
152Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Ω
25d
Ω
37
48Four places where you can put LLM monitoring
Ω
3mo
Ω
0
158Why Do Some Language Models Fake Alignment While Others Don't?
Ω
4mo
Ω
14
22What can be learned from scary demos? A snitching case study
Ω
4mo
Ω
1
70Modifying LLM Beliefs with Synthetic Document Finetuning
Ω
6mo
Ω
12
28Reasoning models don't always say what they think
Ω
7mo
Ω
4
Load More