LESSWRONG
LW

1486
Fabien Roger
6518Ω1955393390
Message
Dialogue
Subscribe

I am working on empirical AI safety. 

Book a call with me if you want advice on a concrete empirical safety project.

Anonymous feedback form.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
AI Control
7Fabien's Shortform
Ω
2y
Ω
128
Tips and Code for Empirical Research Workflows
Fabien Roger11hΩ680

Some additional opinionated tips:

  • Cache LLM responses! It's only a few lines of code to write your own caching wrapper if you use 1 file per generation, and it makes your life much better because it lets you kill/rerun scripts without being afraid of losing anything.
  • Always have at least one experiment script that is slow and runs in the background, and an explore / plot script, that you use for dataset exploration and plotting.
    • If you have n slow steps (e.g. n=2, 1 slow data generation process and 1 slow training run), have n python entry points, such that you can inspect the result of each step before starting the next. Once the pipeline works, you can always write a script that runs them one after another.
  • When fine-tuning a model or creating a dataset, save and look at some your transcripts to understand what the model is learning / should learn.
    • If wandb makes it mildly annoying, don't use wandb, just store json/jsonl locally
  • In general, using plotting / printing results as the main debugging tool: the evolution of train/eval during training loss should look fine, the transcript should look reasonable, the distribution of probabilities should look reasonable, etc. If anything looks surprising, investigate by plotting / printing more stuff and looking at the places in the code/data that could have caused the issue.
    • Also make guesses about API cost / training speed, and check that your expectations are roughly correct near the start of the run (e.g. use tqdm on any slow for loop / gather). This lets you notice issues like putting things on the wrong device, not hitting the API cache, using a dataset that is too big for a de-risking experiment, etc.
  • To the extent it's possible, try to make the script go to the place it could crash as fast as possible. For example, if a script crashes on the first backward pass after 2 minutes of tokenization, avoid always spending 2 minutes waiting for tokenization during debugging (e.g. tokenize on the fly or cache the tokenized dataset)
  • Do simple fast experiments on simple infra before doing big expensive experiments on complex infra
    • For fine-tuning / probing projects, I often get a lot of value out of doing a single-GPU run on a 1B model using a vanilla pytorch loop. That can often get signal in a few minutes and forces me to think through what different results could mean / whether there are boring reasons for surprising results.
    • If things go wrong, simplify even further: e.g. if I train a password-locked model and the with-password-perf is low, I would try training a model just on the strong trajectories
  • Use #%% python notebooks, they play nicely with copilot, code editors and git. And you need notebooks to do printing / plotting.
  • Use prompting experiments before doing SL. Do SL before doing RL. And you often don't need SL or RL (e.g. SL on high-reward behavior to study generalization of training on bad rewards, use best-of-n to get a sense of how goodhart-able a reward is, use few-shot prompting to get a sense of how training on some datapoints helps to learn the right behavior, etc.)
    • Prompting experiments can happen in your notebook, they don't always need to be clean and large-n.
  • Use as few libraries you don't understand as possible. I like just using torch+huggingface+tqdm+matplotlib. For model-internals experiments, I often use either huggingface's output_hidden_states or a pytorch hook. For saving data, I often just use json.

Mistakes to avoid:

  • Not sweeping learning rates when fine-tuning a model (I would usually sweep around the expected lr. For example Adam often expects something like lr=1e-4, so I would at least try something like 1e-5, 3e-5, 1e-4 and 3e-4, and then adjust during the next experiment based on train loss and val accuracy, using smaller gaps if I want better performance and not just a reasonable lr)
  • When distilling one model into another, not sampling at temperature 1 (sampling at temperature 1 and training on it is what should copy the original model probs)
  • When evaluating a model, only sampling at temperature 1 (sampling at a lower temperature often works better)
  • Using an instruction-tuned model without using the official prompt template (usually chat_apply_template)
  • Not including the end-of-sequence token when fine-tuning on an instruction-tuning dataset
  • Not sweeping in logspace when sweeping hyperparams like temperature, lr, etc.
  • Not training for long enough / on enough datapoints - fine-tune on at least 1k trajectories before giving up. Small <1k trajectories runs are not that informative.
  • Using a huggingface model forward pass with padding side left without prepare_inputs_for_generation
  • Using await in a for loop (in particular when awaiting LLMs calls, always use gather)
  • If your LLMs request benefits from prefix caching, not using a semaphore to make sure the queries with shared prefix happen at <5 minutes of interval
  • Using python's built-in hash instead of md5 when you want a hash that is consistent over runs
  • Using random.Random(some object that is not a str or an int) (it sometimes uses the object's id as seed, which varies between runs)
Reply
Steering Evaluation-Aware Models to Act Like They Are Deployed
Fabien Roger8dΩ220

Great, thanks for running it!

Why did you stop at gamma=2 (the activation steering using a steering coefficient of 5 if I recall correctly)?

Reply
Steering Evaluation-Aware Models to Act Like They Are Deployed
Fabien Roger8dΩ440

I agree it's not very surprising and might not work at all in the future because AIs seeing "this is deploy" might update towards thinking it's an eval / not update at all (while Llama Nemotron 49B updates somewhat towards "this is deploy").

But Idk, it still feels like an encouraging sign that even when the update is small, the amplification works as you would hope it would. I agree it was imprecise to call it "the model does not believe you", but the fact that the technique works even if "the model only updates 10% towards what you told it and still mostly does not believe you" is a non-trivial finding - and the results could have plausibly been different.

Reply
LLM-generated text is not testimony
Fabien Roger9d4629

LLM text categorically does not serve the role for communication that is served by real text.

The main thesis is missing some qualifier about what kind of text you are talking about.

There are many kinds of communication where the mental state of the writer matters very little and I would be interested in the text even knowing it was generated by an LLM (though I'd prefer to know it was generated by an LLM).

In particular, for most kinds of communication, the text says sth about the world, and I care about how well the text matches the world much more than I care about whether it was produced by some human-like idea generator:

  • If GPT-7 writes a lean proof that is shown to prove something I care about, and then it translates that proof into natural text (and I have reasons to expect translation to be reliable), I probably want to read the proof.
  • If GPT-7 writes code that generates a cool-looking fractal that I've never seen before, I'd be interested in reading the code.
  • If GPT-7 is shown to produce better predictions/advice than human experts on some distribution after following some costly computations and given access to costly tools, I'd want to read GPT-7 predictions/advice on the distribution where it performs well.
  • If GPT-7 is much better at writing up ideas/experiment results than you (not just in the shallow sense of "raters prefer one over the other" but in the stronger sense of "everyone interested in the raw idea/experiment would learn faster and more accurately by reading the result of GPT-7 prompted with the raw idea/experiment dump"), and I care about the raw ideas / experiment, then I want to read GPT-7's output.
  • If I care about evaluating GPT-7, then I want to read GPT-7's outputs on topics which I know a lot about.

(In all those cases I want to know what GPT-7's contribution is. I think it's bad to be mislead about what was LLM-generated.)

There are also situations where I am sympathetic to your point:

  • If the LLM in question is so weak (like current LLMs on almost all topics) that I don't expect it to produce content that is very accurate to reality (but note that this is just a skill issue);
  • If I want to evaluate a human (e.g. for status reasons or because I am evaluating a candidate);
  • If I care about it being written by a human intrinsically regardless of what the text is (e.g. because you want to have a "true" connection with the writer for emotional reasons).

Given that current LLMs are weak at the sort of text that blogposts are, I think "just don't use LLMs to generate blogposts" is a reasonable heuristic. But my understanding is that you are making much more general claims that I think are wrong and will become obviously wrong if/when applied to expensive-to-run-and-smart LLMs.

Reply1
Steering Evaluation-Aware Models to Act Like They Are Deployed
Fabien Roger10dΩ232

Although I did try to test this by extracting a steering vector from the model organism while Wood Labs is in the system prompt. This is a case where the model is supposed to believe that it is being evaluated regardless of what the contrastive pairs say. Steering with this vector still elicits the desired type hint behavior and evaluation/deployment beliefs

Glad you ran this variation, this experiment is the actually surprising result - I agree the main result with steering without Wood Labs in the prompt is not surprising because models like Llama Nemotron 49B probably believe you.

Reply1
Steering Evaluation-Aware Models to Act Like They Are Deployed
Fabien Roger11dΩ260

we can be more certain that there isn’t some other concept mixed in

This feels slightly incorrect to me. I think that one of the main ways in which your results are surprising is that saying "this is deployment" and "this is training" generates internal activations related to training and deployment despite the fact that the model does not believe you.

I think it would have been plausible that "this is deployment" generates activations that are entirely about "the user says it's deployment even when it's not" and completely orthogonal to "this is actually deployment" - so I don't think you should in general expect activation steering to be free of other concepts being mixed in.

I agree the risk of "harmfulness leakage" is lower than with SAE, so I agree with the conclusion that an AI auditor trying any white-box steering should definitely try contrastive-pair activation steering.

Reply11
Steering Evaluation-Aware Models to Act Like They Are Deployed
Fabien Roger11dΩ340

Cool work!

I am curious how much these results are about modifying internal representations vs "prompting, but more". If it's "prompting, but more" this makes me worried this sort of technique would have very bad false negatives for instances of real misalignment where the model has more "self-control" about its outputs.

I think your "Llama Nemotron 49B is gullible and easy to trick." experiments point at the latter having at least some effect.

An experiment which would update me more because it feels most analogous to activation steering is running contrastive sampling / classifier free guidance which, like activation steering, supports contrastive prompts, using large weights, etc.

Reply
Training Qwen-1.5B with a CoT legibility penalty
Fabien Roger11d20

I agree overt and covert encoded reasoning are different and you should flag overt encoded reasoning, and covert seems harder than overt. This is what I tried to point at by saying

I relax the “plausible benign” constraint and try to find math reasoning that doesn’t look like math reasoning to a monitor."

Reply
Sonnet 4.5's eval gaming seriously undermines alignment evals, and this seems caused by training on alignment evals
Fabien Roger11d127

An intuition here is that different RL environments could ~randomly change the "scratchpad personality" through indirect effects, with some changes pushing towards more or less blabbing. And so if you search over a large space of safety-training environments and pick all the environments that don't result in verbalized situational awareness you might be selecting all the environments that push towards a less blabby scratchpad personality.

(In this specific case of Sonnet 4.5, the indirect optimization effects of removing the RL envs the model card describes are probably very weak, so I think it should not be the primary consideration in the case of Sonnet 4.5 absent some strong reason to expect these particular RL envs to encourage a very blabby personality.)

Reply
Some data from LeelaPieceOdds
Fabien Roger11d40

Do you think it's similar to how different expectations and priors about what AI trajectories and capability profiles will look like often cause people to make different predictions on e.g. P(doom), P(scheming), etc. (like the Paul vs Eliezer debates)? Or do you think in this case there is enough empirical evidence that people ought to converge more? (I'd guess the former, but low confidence.)

Reply
Load More
34Rogue internal deployments via external APIs
Ω
1mo
Ω
4
77Current Language Models Struggle to Reason in Ciphered Language
Ω
1mo
Ω
6
66Training Qwen-1.5B with a CoT legibility penalty
1mo
7
49Training fails to elicit subtle reasoning in current language models
1mo
3
152Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior
Ω
1mo
Ω
37
48Four places where you can put LLM monitoring
Ω
3mo
Ω
0
158Why Do Some Language Models Fake Alignment While Others Don't?
Ω
4mo
Ω
14
22What can be learned from scary demos? A snitching case study
Ω
5mo
Ω
1
70Modifying LLM Beliefs with Synthetic Document Finetuning
Ω
7mo
Ω
12
28Reasoning models don't always say what they think
Ω
7mo
Ω
4
Load More