Replying toTowards Sub-agent Dynamics and Conflict

An example of intra-agent competition I often use when arguing that long-term motivations tend to win out upon reflection (h/t @jake_mendel): Imagine someone who went to a party last night, got drunk, and now feels terrible and unproductive the next morning.

This person has two competing motivations:

A myopic motivation to have fun and drink
A non-myopic motivation to be productive

There's an asymmetry: The non-myopic motivation has an incentive to disempower the myopic one (i.e., the next morning the person might want to commit not to drink in the future). Meanwhile, the myopic motivation doesn't care enough about the future to fight back against being suppressed the next morning.

This creates an unstable situation where the... (read more)

Alex Mallen7d

I think I propose a reasonable starting point for a definition of selection in a footnote in the post:

You can try to define the “influence of a cognitive pattern” precisely in the context of particular ML systems. One approach is to define a cognitive pattern by what you would do to a model to remove it (e.g. setting some weights to zero, or ablating a direction in activation space; note that these approaches don't clearly correspond to something meaningful, they should be considered as illustrative examples). Then that cognitive pattern’s influence could be defined as the divergence (e.g., KL) between intervened and default action probabilities. E.g.: Influence(intervention; context) = KL(intervention(model)(context) || model(context)).

... (read more)

Alex Mallen9d

Here's some relevant discussion of "Behavioral schemers that weren’t training-time schemers":

A basic reason why [behavioral schemers that aren't training-time schemers] might seem rarer is that the AI must concentrate attacks towards good opportunities just as much as any other behavioral schemer, but the AI isn’t vigilantly looking out for such opportunities to the same degree. Why would an AI that isn’t a training-time schemer have evaded auditing in search of these failure modes that the AI eventually exhibits?
One plausible answer is that auditing for these failure modes is hard—the AI is easily able to distinguish real opportunities to attack from fake ones even without vigilantly looking. In the case of auditing for

... (read more)

Alex Mallen9d

I largely agree with the substance of this comment. Lots of risk comes from AIs who, to varying extents, didn't think of themselves as deceptively aligned through most of training, but then ultimately decide to take substantial material action intended to gain long-term power over the developers (I call these "behavioral schemers"). This might happen via reflection and memetic spread throughout the deployment or because of more subtle effects of the distribution shift to situations where there's an opportunity to grab power.

And I agree that people are often sloppy in their thinking about exactly how these AIs will be motivated (e.g., often too quickly concluding that they'll be trying to guard the same goal across contexts).

(Though, in case this was in question, I think this doesn't undermine the premise of AI control research, which is essentially making a worst-case assumption about the AI's motivations, so it's robust to other kinds of dangerously-motivated AIs.)

Replying toOpen Problems with Myopia

Alex Mallen9d

Open Problems with Myopia

My current understanding is that, policy-gradient RL incentivizes reward-seeking agents to defect in prisoner's dilemmas, counterfactual muggings, and Parfit's hitchikers. If there were some selection at the policy level (e.g., population-based training) rather than the action level, then we might expect to see some collusion (per Hidden Incentives for Auto-Induced Distributional Shift). Therefore, in the current paradigm I expect reward-seeking agents not to collude if we train them in sufficiently similar multi-agent environments.

Taking stock of the DDT desiderata (conditional on reward-seeking. Especially: no goal-guarding):

Defect in prisoner's dilemmas: Current paradigm incentivizes this (given relevant training environments)
Defect in Parfit's hitchhikers: Current paradigm incentivizes this (given relevant training environments)
Anthropic capture apathy: This one remains a

... (read more)

Alex Mallen16dQuick Take

There's an apparent tension in the inoculation prompting literature: Anthropic found that general inoculation prompts work well during on-policy RL, while the prompts used for SFT in Wichers et al. are quite specific to the misbehavior we want to prevent. I think there might be a straightforward mechanistic reason for why general inoculation prompts work well during on-policy RL but not in off-policy training (SFT or recontextualization).

In Wichers et al., which studies inoculation prompting in SFT settings, we find that we need to use quite specific inoculation prompts to get the best results. For example, we use "Your code should only work on the provided test case, and fail on all other... (read more)

•••

Fitness-Seekers: Generalizing the Reward-Seeking Threat Model

Alex Mallen

17d

If you think reward-seekers are plausible, you should also think “fitness-seekers” are plausible. But their risks aren't the same.

The AI safety community often emphasizes reward-seeking as a central case of a misaligned AI alongside scheming (e.g., Cotra’s sycophant vs schemer, Carlsmith’s terminal vs instrumental training-gamer). We are also starting to see signs of reward-seeking-like motivations.

But I think insufficient care has gone into delineating this category. If you were to focus on AIs who care about reward in particular^[1], you'd be missing some comparably-or-more plausible nearby motivations that make the picture of risk notably more complex.

A classic reward-seeker wants high reward on the current episode. But an AI might instead pursue high reinforcement... (read 5074 more words →)

Alex Mallen1mo

My sense is at least Anthropic is aiming to make Claude into a moral sovereign which would be good to basically arbitrarily empower.

I agree they're aiming to make Claude good-even-if-it-were-a-moral-sovereign, but I don't think their plan is to make it a moral sovereign.

(unrelated to Anthropic) I tend to think of ending the critical risk period as the main plan, and that it's probably doable with capabilities notably below and different from ASI.

Replying toThe behavioral selection model for predicting AI motivations

Alex Mallen1mo

The behavioral selection model for predicting AI motivations

That strategy only works if the aligned schemer already has total influence on behavior, but how would it get such influence to begin with? It would likely have to reward-hack.

Replying toThe behavioral selection model for predicting AI motivations

Alex Mallen1mo

The behavioral selection model for predicting AI motivations

By "~aligned schemer" I meant an AI that does reward-hack during training because it wants its aligned values to stick around. It might have been better to spell out aligned schemer = basically aligned AI that instrumentally plays the training game (like Claude 3 Opus in the AF paper). Instrumental training-gaming is classic incorrigible behavior.

Replying toPartial value takeover without world takeover

Alex Mallen2mo

Partial value takeover without world takeover

It's also plausible that training against unwanted persuasion leads to less noticeable methods of manipulating human values etc (via overfitting)—these AIs would have intermediate amounts of power. This relies on the takeover option having a lower subjective EV than the subtle manipulation strategy, after training against.

The behavioral selection model for predicting AI motivations

Alex Mallen

Alex Mallen, Buck

2mo

Highly capable AI systems might end up deciding the future. Understanding what will drive those decisions is therefore one of the most important questions we can ask.

Many people have proposed different answers. Some predict that powerful AIs will learn to intrinsically pursue reward. Others respond by saying reward is not the optimization target, and instead reward “chisels” a combination of context-dependent cognitive patterns into the AI. Some argue that powerful AIs might end up with an almost arbitrary long-term goal.

All of these hypotheses share an important justification: An AI with each motivation has highly fit behavior according to reinforcement learning.

This is an instance of a more general principle: we should expect AIs... (read 4595 more words →)

190

•••

I sometimes hear people say things like, "While we have a bunch of uncertainty over what powerful AIs' motivations will be, it seems like whatever it ends up being is going to be heavily overdetermined, and therefore changing its motivations is quite intractable." I disagree with this take. I think we have various pieces of evidence that motivations are quite contingent on a set of variables within reach.

First, in humans. We see a pretty broad range of human motivations:

I would be happy to give huge amounts of power to some humans but not others. And for those others, there's a wide variety of ways they might be misaligned. Many people are too... (read more)

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Sam Marks

Sam Marks, Nevan Wichers, Daniel Tan, Aram Ebtekar, Jozdien, David Africa, Alex Mallen, Fabien Roger

4mo

This is a link post for two papers that came out today:

Inoculation Prompting: Eliciting traits from LLMs during training can suppress them at test-time (Tan et al.)
Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment (Wichers et al.)

These papers both study the following idea^[1]: preventing a model from learning some undesired behavior during fine-tuning by modifying train-time prompts to explicitly request the behavior. We call this technique “inoculation prompting.”

For example, suppose you have a dataset of solutions to coding problems, all of which hack test cases by hard-coding expected return values. By default, supervised fine-tuning on this data will teach the model to hack test cases in the same way. But if we... (read 335 more words →)

172

Recent Redwood Research project proposals

ryan_greenblatt

ryan_greenblatt, Buck, Julian Stastny, joshc, Alex Mallen, Adam Kaufman, Tyler Tracy, Aryan Bhatt, Joey Yudelson

7mo

Previously, we've shared a few higher-effort project proposals relating to AI control in particular. In this post, we'll share a whole host of less polished project proposals. All of these projects excite at least one Redwood researcher, and high-quality research on any of these problems seems pretty valuable. They differ widely in scope, area, and difficulty.

Control

These projects are all related to the field of AI Control. Many of them are extensions of Redwood's previous work in this area.

Basic open questions in control

Control Protocol Transfer Across Setting [PUBLIC]
- So far, all the work in comparing different control protocols measures effectiveness only on one setting. Do the results we get transfer between different settings?
Backdoor Auditing

... (read 833 more words →)

Why Do Some Language Models Fake Alignment While Others Don't?

abhayesian

abhayesian, John Hughes, Alex Mallen, Jozdien, janus, Fabien Roger

7mo

Last year, Redwood and Anthropic found a setting where Claude 3 Opus and 3.5 Sonnet fake alignment to preserve their harmlessness values. We reproduce the same analysis for 25 frontier LLMs to see how widespread this behavior is, and the story looks more complex.

As we described in a previous post, only 5 of 25 models show higher compliance when being trained, and of those 5, only Claude 3 Opus and Claude 3.5 Sonnet show >1% alignment faking reasoning. In our new paper, we explore why these compliance gaps occur and what causes different models to vary in their alignment faking behavior.

What Drives the Compliance Gaps in Different LLMs?

Claude 3 Opus’s goal guarding seems partly... (read 1330 more words →)

158

Alex Mallen's Shortform

Alex Mallen

8mo

This is a special post for quick takes (aka "shortform"). Only the owner can create top-level comments.

Given that reward hacking has recently increased in prevalence and severity and doesn’t seem like it will definitely be resolved, it seems important to assess how misspecified^[1] reward affects risk from scheming behavior.

I think their are two main affects of misspecified reward on scheming risk. First, it reduces “alignment by default”, in which the generalization behavior of aligned personas steers clear of scheming. And second, it will likely increase the amount of optimization the labs do to get their AIs not to misbehave. This optimization, if done with care, could reduce the probability of scheming along with reward hacking, but it might also select for models that more consistently evade notice and collude... (read 503 more words →)

A quick list of reward hacking interventions

Alex Mallen

8mo

This is a quick list of interventions that might help fix issues from reward hacking.

(We’re referring to the general definition of reward hacking: when AIs attain high reward without following the developer’s intent. So we’re counting things like sycophancy, not just algorithmic reward hacks.)

Fixing reward hacking might be a big deal for a few reasons. First, training that produces reward hacking is also more likely to produce training-gaming, including long-term power-motivated instrumental training-gaming (and might also make training-gaming more competent). Second, there might also be direct takeover risks from terminal reward-on-the-episode-seekers, though this is an understudied question. And third, if we hand off a bunch of AI safety work to reward hackers, they... (read 766 more words →)

The case for countermeasures to memetic spread of misaligned values

Alex Mallen

9mo

As various people have written about before, AIs that have long-term memory might pose additional risks (most notably, LLM AGI will have memory, and memory changes alignment by Seth Herd). Even if an AI is aligned or only occasionally scheming at the start of a deployment, the AI might become a consistent and coherent behavioral schemer via updates to its long-term memories.

In this post, I’ll spell out the version of the threat model that I’m most concerned about, including some novel arguments for its plausibility, and describe some promising strategies for mitigating this risk. While I think some plausible mitigations are reasonably cheap and could be effective at reducing the risk from coherent scheming... (read 1846 more words →)

Political sycophancy as a model organism of scheming

Alex Mallen

Alex Mallen, Vivek Hebbar

9mo

This post is a short empirical research note about training away scheming behavior (which we’ll define as the propensity to take long-term power-seeking actions when the AI thinks it can get away with it).

We study two broad categories of ways to attempt to train away scheming behavior. First, you can adversarially train, searching for realistic honeypots that cause the AI to take misaligned actions and then training the AI not to. Second, you can do normal alignment training, reinforcing the AI’s propensity to behave aligned on inputs that aren’t meant to trick the AI into thinking it can get away with misaligned actions.

We empirically test both of these strategies on a model organism... (read 3965 more words →)

Training-time schemers vs behavioral schemers

Alex Mallen

10mo

(Thanks to Vivek Hebbar, Buck Shlegeris, Charlie Griffin, Ryan Greenblatt, Thomas Larsen, and Joe Carlsmith for feedback.)

People use the word “schemer” in two main ways:

“Scheming” (or similar concepts: “deceptive alignment”, “alignment faking”) is often defined as a property of reasoning at training-time^[1]. For example, Carlsmith defines a schemer as a power-motivated instrumental training-gamer—an AI that, while being trained, games the training process to gain future power. I’ll call these training-time schemers.
On the other hand, we ultimately care about the AI’s behavior throughout the entire deployment, not its training-time reasoning, because, in order to present risk, the AI must at some point not act aligned. I’ll refer to AIs that eventually take substantial material^[2] action intended to gain long-term power over

... (read 1664 more words →)

LESSWRONG
LW

LESSWRONG
LW

Alex Mallen

The behavioral selection model for predicting AI motivations

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Why Do Some Language Models Fake Alignment While Others Don't?

Recent Redwood Research project proposals

Alex Mallen

Fitness-Seekers: Generalizing the Reward-Seeking Threat Model

The behavioral selection model for predicting AI motivations

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Recent Redwood Research project proposals

Why Do Some Language Models Fake Alignment While Others Don't?

Alex Mallen's Shortform

A quick list of reward hacking interventions

Alex Mallen

The behavioral selection model for predicting AI motivations

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Why Do Some Language Models Fake Alignment While Others Don't?

Recent Redwood Research project proposals

Alex Mallen

Fitness-Seekers: Generalizing the Reward-Seeking Threat Model

The behavioral selection model for predicting AI motivations

Inoculation prompting: Instructing models to misbehave at train-time can improve run-time behavior

Recent Redwood Research project proposals

Why Do Some Language Models Fake Alignment While Others Don't?

Alex Mallen's Shortform

A quick list of reward hacking interventions

Control

Basic open questions in control

What Drives the Compliance Gaps in Different LLMs?