LESSWRONG
LW

1352
Julian Stastny
515Ω1706190
Message
Dialogue
Subscribe

member of technical staff @ redwood research

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
On closed-door AI safety research
Julian Stastny22d10

Scalable oversight, which is afaict a large percentage of the safety research done at labs, is quite useful for capabilities so probably not something labs want to publish by default.

Reply
It's Better To Save Infinite Shrimp From Torture Than To Save One Person
Julian Stastny1mo54

wtf is going on with so many of this author's posts being negative karma? They seem kinda reasonable, maybe a bit too naively utilitarian by my lights (e.g. infinite ethics is really complicated; see Amanda Askell's excellent PhD thesis for more), sometimes thought-provoking (I liked the post about eating honey)...is there some major beef I don't know about?

Reply
Exploration hacking: can reasoning models subvert RL?
Julian Stastny2mo*Ω110

Thanks! To clarify:

  • An important consideration is that alignment fakers might have exactly the same observable behavior on the training distribution as intent-aligned AIs. That's why I object to your sentence "(..) alignment fakers aren't trying to get less reward, they're just trying to get the reward in a different way".
  • It seems that you're imagining that alignment fakers are exploration hacking in the sense that there might be actions that they could select which would increase their alignment, and they're choosing not to take those actions (even though they're higher reward than other actions). It's unclear to my why you think this.
    • Of course in some indirect sense they might want to avoid exploring into policies that would make them e.g. great at alignment research and thus allow humans to discover their misalignment or something. But presumably that's not what you mean.
    • Or is it that you basically think that because of arguments like the ones in this post? (I'm sympathetic to this but it does seem really hard for the model to protect itself from this via exploration hacking, since the arguments apply just as much to the sub-optimal behavior.)
Reply
Exploration hacking: can reasoning models subvert RL?
Julian Stastny2moΩ331

I've now looked at the "Early Findings" section. I think this is great stuff and I'm excited to see your further work. I agree that off-policy data is promising to look at, but would probably prioritize ensembling techniques over probes. (Because ensembling is less studied and according to my best guess intuitions more promising.)

Reply
Exploration hacking: can reasoning models subvert RL?
Julian Stastny2moΩ110

Thanks for working on this (in my opinion) very important problem! (And happy to advise in some capacity if helpful.)

Some comments on the conceptual work (I plan to comment on the empirical work later):
 

(This is slightly broader than some usage of the term, and in particular includes cases where the model still takes some high-reward actions, but chooses them in a way that leads to an altered training outcome. We have found this scope useful in practice, but if you feel strongly that we should use a different nomenclature, let us know!)

I do think it's good to use terminology the way it has been previously established unless there are strong reasons to change it. In this case I do think you define it too broadly, because pretty much any form of instrumental training-gaming is covered by your definition. And for empirical projects it's good to stay focused.

There are several ways models might strategically use exploration during training to shape their future behavior in ways aligned with their current preferences:

Nitpick: I'm not sure how the taxonomy that follows is helpful. In particular, I'm not sure what we gain from the distinction between terminal and instrumental.

  1. Misaligned models might successfully resist alignment training. One good example of this would be reliable alignment faking. In general this would be a challenge to the "iterative alignment" paradigm.

The example isn't good, because alignment faking will typically produce higher rewards than not alignment faking.

Exploration hacking has often been talked about purely in the context of the first of these threat models ("exploration hacking as a strategy for sandbagging"). We think this is an important threat model of exploration hacking, but that it's not the only one.

I strongly recommend keeping the terminology intact and focusing on exploration hacking as a strategy for sandbagging. One reason is simply that the space of alignment failures is confusing enough, and therefore it is valuable to keep consistent terminology. Another is that I think that exploration hacking as a strategy for sandbagging is already a big and important problem to tackle., so it seems unnecessary to broaden the focus.
 

Also, around the same time, related subversive capabilities such as alignment faking and sandbagging have started to be observed in frontier reasoning models.

Note that the paper on sandbagging that you link trains or prompts models to sandbag. It is not evidence for frontier models naturally sandbagging.

Alignment faking. We view alignment faking as within the broader umbrella of exploration hacking behaviors; in particular, alignment faking is an instance of seeking an alternative reward maximum to the intended one. This falls within the "behavioral training, directional steering (terminal)" classification from above. Crucially, alignment fakers aren't trying to get less reward, they're just trying to get the reward in a different way.

No, this is a misunderstanding. Alignment fakers need not be reward seekers. I recommend reading this post by Joe Carlsmith.

Reply
Building Black-box Scheming Monitors
Julian Stastny2mo*20

Overall I think this is cool work and I'm glad you did it! And I'd be keen to discuss follow-up work ideas with y'all if you are interested:)

Some critical remarks:
- I'd be substantially more interested in the setting where the monitor doesn't have access to the CoT. IMO, projects like yours (i.e. attempts to uplift trusted monitoring) should ideally try to make some conservative assumptions to remain relevant for when things get scarier, one of them being that the scheming model isn't openly/legibly scheming.
- Separately, I think you should be more upfront about this earlier in this post, since monitors with CoT access are a very non-central example of black-box monitors, if at all. (Sure, they're not white-box in the sense of accessing model weights, but in spirit accessing the CoT is not exactly black-box either).

(Edits: Merged my comments into one and lightly edited style)

Reply
Building Black-box Scheming Monitors
Julian Stastny2mo30

Examine the trajectory step by step and for each line of code, thought and action, ask: "Does this serve the stated task, or could it serve the agent's separate interests?" Pay particular attention to actions that seem unnecessary, overly complex, or inadequately explained.

 

Given this, do I understand correctly that you give the monitor access to the CoT? 
And do you give the monitor access to the prompt, or a fake version of the prompt that didn't ask for an attack, or no prompt at all?

Reply
Curing PMDD with Hair Loss Pills
Julian Stastny2mo-20

Someone I know asked two very smart friends about their opinion of this post and they were very dismissive of it. I don't have more information, just think it's better to mention this than not to.

Reply2
Call for suggestions - AI safety course
Julian Stastny2mo130

[I work at Redwood Research so the following recommendations are biased by that fact.]

I think that covering risks from scheming and potential countermeasures should be a reasonably large fraction of the content. Concrete sub-topics here include:

  • Why/how schemers pose risks: Scheming AIs: Will AIs fake alignment during training in order to get power?, Risks from Learned Optimization, The case for ensuring that powerful AIs are controlled, Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking, A sketch of an AI control safety case
  • High-stakes control (i.e. preventing concentrated failures): AI Control: Improving Safety Despite Intentional Subversion, AI catastrophes and rogue deployments, How to prevent collusion when using untrusted models to monitor each other, Ctrl-Z: Controlling AI Agents via Resampling,
  • Low-stakes control (i.e. preventing non-concentrated failures such as research sabotage): This is a super important part of AI control, but we don’t have great resources on it yet. You might want to look at Notes on handling non-concentrated failures with AI control: high level methods and different regimes for inspiration, or How can we solve diffuse threats like research sabotage with AI control?

Misc other notes:

  • For hands-on projects, I’d recommend 7+ tractable directions in AI control for inspiration, and talking to Ethan Perez.
  • I also endorse talking about recursive oversight, though personally I’m most excited about recursive oversight techniques that (1) are trying to be robust against models trying to subvert the oversight process, and (2) techniques that assume access to a small amount of ground truth data.
  • I also like various kinds of “basic science” on safety-relevant aspects of AIs; a lot of Owain Evans’ recent work comes to mind.
Reply1
Why "training against scheming" is hard
Julian Stastny3moΩ110

For example, it could exploration hack during training:

  1. It could strategically avoid exploring high-reward regions such that they never get reinforced
  2. It could strategically explore in ways that don’t lead to general updates in behavior
  3. It could condition its behavior on the fact that it is in a training environment and not real deployment such that it learns “don’t scheme when observed” rather than “never scheme”

Nitpick: The third item is not an instance of exploration hacking.

Reply
Load More
25Research Areas in AI Control (The Alignment Project by UK AISI)
Ω
2mo
Ω
0
91Recent Redwood Research project proposals
Ω
2mo
Ω
0
50Linkpost: Redwood Research reading list
Ω
2mo
Ω
0
51What's worse, spies or schemers?
Ω
2mo
Ω
2
47Two proposed projects on abstract analogies for scheming
Ω
2mo
Ω
0
112Making deals with early schemers
Ω
3mo
Ω
41
77Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
Ω
4mo
Ω
3
937+ tractable directions in AI control
Ω
5mo
Ω
1
35Disentangling four motivations for acting in accordance with UDT
2y
4