LESSWRONG
LW

Julian Stastny
329Ω1274120
Message
Dialogue
Subscribe

member of technical staff @ redwood research

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Curing PMS with Hair Loss Pills
Julian Stastny2d20

Someone I know asked two very smart friends about their opinion of this post and they were very dismissive of it. I don't have more information, just think it's better to mention this than not to.

Reply
Call for suggestions - AI safety course
Julian Stastny3d130

[I work at Redwood Research so the following recommendations are biased by that fact.]

I think that covering risks from scheming and potential countermeasures should be a reasonably large fraction of the content. Concrete sub-topics here include:

  • Why/how schemers pose risks: Scheming AIs: Will AIs fake alignment during training in order to get power?, Risks from Learned Optimization, The case for ensuring that powerful AIs are controlled, Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking, A sketch of an AI control safety case
  • High-stakes control (i.e. preventing concentrated failures): AI Control: Improving Safety Despite Intentional Subversion, AI catastrophes and rogue deployments, How to prevent collusion when using untrusted models to monitor each other, Ctrl-Z: Controlling AI Agents via Resampling,
  • Low-stakes control (i.e. preventing non-concentrated failures such as research sabotage): This is a super important part of AI control, but we don’t have great resources on it yet. You might want to look at Notes on handling non-concentrated failures with AI control: high level methods and different regimes for inspiration, or How can we solve diffuse threats like research sabotage with AI control?

Misc other notes:

  • For hands-on projects, I’d recommend 7+ tractable directions in AI control for inspiration, and talking to Ethan Perez.
  • I also endorse talking about recursive oversight, though personally I’m most excited about recursive oversight techniques that (1) are trying to be robust against models trying to subvert the oversight process, and (2) techniques that assume access to a small amount of ground truth data.
  • I also like various kinds of “basic science” on safety-relevant aspects of AIs; a lot of Owain Evans’ recent work comes to mind.
Reply1
Why "training against scheming" is hard
Julian Stastny13dΩ110

For example, it could exploration hack during training:

  1. It could strategically avoid exploring high-reward regions such that they never get reinforced
  2. It could strategically explore in ways that don’t lead to general updates in behavior
  3. It could condition its behavior on the fact that it is in a training environment and not real deployment such that it learns “don’t scheme when observed” rather than “never scheme”

Nitpick: The third item is not an instance of exploration hacking.

Reply
Making deals with early schemers
Julian Stastny15dΩ110

I think it's interesting that @Lukas Finnveden seems to think

compared to what you emphasize in the post, I think that a larger fraction of the benefits may come from the information value of learning that the AIs are misaligned.

in contrast to this comment. It's not literally contradictory (we could have been more explicit about both possibilities), but I wonder if this indicates a disagreement between you and him.

Reply
Making deals with early schemers
Julian Stastny15dΩ110

Agreed. 

(In the post I tried to convey this by saying

(Though note that with AIs that have diminishing marginal returns to resources we don’t need to go close to our reservation price, and we can potentially make deals with these AIs even once they have a substantial chance to perform a takeover.)

in the subsection "A wide range of possible early schemers could benefit from deals".)

Reply
Making deals with early schemers
Julian Stastny15dΩ110

[Unimportant side-note] We did mention this (but not discuss extensively) in the bullet about convergence, thanks to your earlier google doc comment :)

We could also try to deliberately change major elements of training (e.g. data used) between training runs to reduce the chance that different generations of misaligned AIs have the same goals.

Reply
Making deals with early schemers
Julian Stastny16d52

1. Why? What does self-regarding preferences mean and how does it interact with the likelihood of predecessor AIs sharing goals with later AIs?

By self-regarding preferences we mean preferences that are typically referred to as "selfish". So if the AI cares about seeing particular inputs because they "feel good" that'd be a self-regarding preference. If your successor also has self-regarding preferences they don't have a preference to give you inputs that feel good.

 

2. I don't thing this is right. By virtue of the first AI existing, there is a successful example of ML producing an agent with those particular goals. The prior on the next AI having those goals jumps a bunch relative to human goals. (Vague credit to evhub who I think I heard this argument from). It feels like this point about Alignment has decent overlap with Convergence. 

I think your argument is a valid intuition towards incidental convergence (as you acknowledge) but I don't think it's an argument that AIs have a particular kind of "alignment-power" to align their successor with an arbitrary goal that they can choose. (We probably don't really disagree here on the object level; I do agree that incidental convergence is a possibility.)

Reply1
An alignment safety case sketch based on debate
Julian Stastny2mo10

I also recommend Buck and my recent post on exploration hacking for further reading :)

Reply
The case for unlearning that removes information from LLM weights
Julian Stastny8mo21

I wonder if the approach from your paper is in some sense too conservative to evaluate whether information has been removed: Suppose I used some magical scalpel and removed all information about Harry Potter from the model. 

Then I wouldn't be too surprised if this leaves a giant HP-shaped hole in the model such that, if you then fine-tune on a small amount of HP-related data, suddenly everything falls into place and makes sense to the model again, and this rapidly generalizes. 

Maybe fine-tuning robust unlearning requires us to fill in the holes with synthetic data so that this doesn't happen.

Reply
The case for unlearning that removes information from LLM weights
Julian Stastny8mo10

By tamper-resistant fine-tuning, are you referring to this paper by Tamirisa et al? (That'd be a pretty devastating issue with the whole motivation to their paper since no one actually does anything but use LoRA for fine-tuning open-weight models...)

Reply
Load More
46Two proposed projects on abstract analogies for scheming
Ω
4d
Ω
0
111Making deals with early schemers
Ω
18d
Ω
41
75Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
Ω
2mo
Ω
1
867+ tractable directions in AI control
Ω
2mo
Ω
1
35Disentangling four motivations for acting in accordance with UDT
2y
3