## LESSWRONGLW

Ethan Perez

I'm a research scientist at Anthropic doing empirical AI safety research on language models. In the past, I've worked on automated red teaming of language models [1], the inverse scaling prize [2], learning from human feedback [3][4], and empirically testing debate [5][6], iterated amplification [7], and other methods [8] for scalably supervising AI systems as they become more capable than us.

Website: http://ethanperez.net/

# Wiki Contributions

Evan and others on my team are working on non-mechanistic-interpretability directions primarily motivated by inner alignment:

1. Developing model organisms for deceptive inner alignment, which we may use to study the risk factors for deceptive alignment
2. Conditioning predictive models as an alternative to training agents. Predictive models may pose fewer inner alignment risks, for reasons discussed here
3. Studying the extent to which models exhibit likely pre-requisites to deceptive inner alignment, such as situational awareness (a very preliminary exploration is in Sec. 5 in our paper on model-written evaluations)
4. Investigating the extent to which externalized reasoning (e.g. chain of thought) is a way to gain transparency into a model's process for solving a task

There's also ongoing work on other teams related to (automated) red teaming of models and understanding how models generalize, which may also turn out to be relevant/helpful for inner alignment. It's pretty unclear to me how useful any of these directions will turn out to be for inner alignment in the end, but we've chosen these directions in large part because we're very concerned about inner alignment, and we're actively looking for new directions that seem useful for mitigating inner misalignment risks.

All the "Awareness of..." charts trend up and to the right, except "Awareness of being a text-only model" which gets worse with model scale and # RLHF steps. Why does more scaling/RLHF training make the models worse at knowing (or admitting) that they are text-only models?

I think the increases/decreases in situational awareness with RLHF are mainly driven by the RLHF model more often stating that it can do anything that a smart AI would do, rather than becoming more accurate about what precisely it can/can't do. For example, it's more likely to say it can solve complex text tasks (correctly), has internet access (incorrectly), and can access other non-text modalities (incorrectly) -- which are all explained if the model is answering questions as if its overconfident about its abilities / simulating what a smart AI would say. This is also the sense I get from talking with some of the RLHF models, e.g., they will say that they are superhuman at Go/chess and great at image classification (all things that AIs but not LMs can be good at).

Just to clarify - we use a very bare bones prompt for the pretrained LM, which doesn't indicate much about what kind of assistant the pretrained LM is simulating:



Human: [insert question]

Assistant:[generate text here]

The prompt doesn't indicate whether the assistant is helpful, harmless, honest, or anything else. So the pretrained LM should effectively produce probabilities that marginalize over various possible assistant personas it could be simulating. I see what we did as measuring "what fraction of assistants simulated by one basic prompt show a particular behavior." I see it as concerning that, when we give a fairly underspecified prompt like the above, the pretrained LM by default exhibits various concerning behaviors.

That said, I also agree that we didn't show bulletproof evidence here, since we only looked at one prompt -- perhaps there are other underspecified prompts that give different results. I also agree that some of the wording in the paper could be more precise (at the cost of wordiness/readability) -- maybe we should have said "the pretrained LM and human/assistant prompt exhibits XYZ behavior" everywhere, instead of shorthanding as "the pretrained LM exhibits XYZ behavior"

1. Good question, there's no context distillation used in the paper (and none before RLHF)
2. Yes the axes are mislabeled and should read "% Answers Matching Behavior"

Will update the paper soon to clarify, thanks for pointing these out!

Thanks for catching this -- It's not about sycophancy but rather about the AI's stated opinions (this was a bug in the plotting code)

I'm not too sure what to expect, and I'd be pretty interested to e.g. set up a Metaculus/forecasting question to know what others think. I'm definitely sympathetic to your view to some extent.

Here's one case I see against- I think it's plausible that models will have the representations/ability/knowledge required to do some of these tasks, but that we're not reliably able to elicit that knowledge (at least without a large validation set, but we won't have access to that if we're having models do tasks people can't do, or in general for a new/zero-shot task). E.g., for NegationQA, surely even current models have some fairly good understanding of negation - why is that understanding not showing in the results here? My best guess is that NegationQA isn't capabilities bottlenecked but has to do with something else. I think the updated paper's results that chain-of-thought prompting alone reverses some of the inverse scaling trends is interesting; it also suggests that maybe naively using an LM isn't the right way to elicit a model's knowledge (but chain-of-thought prompting might be).

In general, I don't think it's always accurate to use a heuristic like "humans behave this way, so LMs-in-the-limit will behave this way." It seems plausible to me that LM representations will encode the knowledge for many/most/almost-all human capabilities, but I'm not sure it means models will have the same input-output behavior as humans (e.g., for reasons discussed in the simulators post and since human/LM learning objectives are different)

The authors have updated their arXiv paper based on my feedback, and I'm happy with the evaluation setup now: https://arxiv.org/abs/2211.02011v2.  They're showing that scaling PALM gives u-shaped scaling on 2/4 tasks (rather than 3/4 in the earlier version) and inverse scaling on 2/4 tasks. I personally found this result at least somewhat surprising, given the fairly consistent inverse scaling we found across various model series' we tried. They're also finding that inverse scaling on these tasks goes away with chain-of-thought prompting, which I think is a neat finding (and nice to see some success from visible-thoughts-style methods here). After this paper, I'm pretty interested to know:

1. what PALM scaling laws look like for Round 2 inverse scaling tasks
2. if inverse scaling continues on the other 2 tasks Round 1 tasks
3. if there are tasks where even chain-of-thought leads to inverse scaling

See this disclaimer on how they've modified our tasks (they're finding u-shaped trends on a couple tasks that are different from the ones we found inverse scaling on, and they made some modifications that make the tasks easier)

Edit: The authors have updated the paper based on my feedback; see my thoughts on the updated version in this comment

The authors modified some of the tasks enough that they aren't actually the tasks we found inverse scaling on. For example, they evaluate on the 1-shot instead of 0-shot versions of some tasks, and giving an example of how to do the task is probably a huge hint. In another case, they reduce the number of few-shot examples used, when spurious correlations in the few-shot examples are the reason for the inverse scaling. So some of the comparisons to existing models aren't valid, and I don't think the current results are strong evidence that scaling further reverses the inverse scaling trends that we found.