LESSWRONG
LW

Oliver Daniels
269Ω62810
Message
Dialogue
Subscribe

PhD Student at Umass Amherst

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
2Oliver Daniels-Koch's Shortform
1y
34
No wikitag contributions to display.
Training a Reward Hacker Despite Perfect Labels
Oliver Daniels13d21

I think "explaining" vs "raising the saliency of" is an important distinction - I'm skeptical that the safety community needed policy gradient RL "explained", but I do think the perspective "maybe we can shape goals / desires through careful sculpting of reward" was neglected. 

(e.g. I'm a big fan of Steve's recent post on under-vs-over-sculpting)

Reply
Training a Reward Hacker Despite Perfect Labels
Oliver Daniels15d10

yup, my bad, editing to "receiving high reward"

Reply1
Training a Reward Hacker Despite Perfect Labels
Oliver Daniels15d10

Yup the latter (post-recontextualized-training model)

Reply
Training a Reward Hacker Despite Perfect Labels
Oliver Daniels15d21

You could have a view of RL that is totally agnostic about training dynamics and just reasons about policies conditioned on reward-maximization. 

But the stories about deceptive-misalignment typically route through deceptive cognition being reinforced by the training process (i.e. you start with a NN doing random stuff, it explores into instrumental-(mis)alignment, and the training process reinforces the circuits that produced the instrumental-(mis)alignment). 
... 

I see shard-theory as making two interventions in the discourse: 
1. emphasizing path-dependence in RL training (vs simplicity bias)
2. emphasizing messy heuristic behaviors (vs cleanly-factored goal directed agents)

I think both these interventions are important and useful, but I am sometimes frustrated by broader claims made about the novelty of shard-theory. I think these broad claims (perversely) obscure the most cruxy/interesting scientific questions raised by shard-theory, e.g. "how path-dependent is RL, actually" (see my other comment) 
 

Reply
Training a Reward Hacker Despite Perfect Labels
Oliver Daniels16d10

Cool results! 

One follow-up I'd be interested in: does the hacking persist if you run standard RL after the re-contextualization training (always filtering out hacking completions)?

The motivation is testing the relative importance of path-dependence and simplicity bias for generalization (on the assumption that hacking traces are more "complex").  You could also study this in various regularization regimes (weight decay, but also maybe length-penalty on the CoT). 

Reply
Training a Reward Hacker Despite Perfect Labels
Oliver Daniels17d*4938

As best I can tell, before "Reward is not the optimization target", people mostly thought of RL as a sieve, or even a carrot and stick—try to "give reward" so the AI can only maximize reward via good behavior. Few[2] other people speculated that RL generalization is controlled by why the policy took an action. So I give myself and @Quintin Pope[3] a bunch of points.

I'm confused by this claim - goal mis-generalization/deceptive-misalignment/scheming is centrally about generalization failures due to policies maximizing receiving high reward for the wrong reasons, and these threat models were widely discussed prior to 2022. 

(I do agree that shard-theory made "think rigorously about reward-shaping" more salient and exciting)

Reply
evhub's Shortform
Oliver Daniels20d64

The scalable oversight hope (as I understand it) requires something like the following:
1. HHH is the most "natural" generalization of supervised HHH data on easy tasks
2. Training on supervised HHH data is insufficient to generalize HHH to hard tasks 
3. Producing reliable labels on hard tasks is too expensive 
4. Producing unreliable labels on hard tasks is not too expensive 
5. Training on unreliable labels recovers most of the capabilities produced by training on reliable labels
6. The most natural generalization of "maximize unreliable labels on hard tasks" is reward hacking
7. ICM recovers most (all?) of capabilities produced by training on reliable labels
8. ICM learns the most "natural" generalization 

The sketchy parts are 1) and 8), but overall the argument seems fairly plausible (and continuous with prior scalable oversight work on generalization). 

IMO clearly passes the "is safety research" bar. 

Reply
Buck's Shortform
Oliver Daniels1mo30

Thoughts on "alignment" proposals (i.e. reducing P(scheming)) 

Reply
Oliver Daniels-Koch's Shortform
Oliver Daniels1mo43

Beware Experiment Addiction

Quick feedback loops are great, but...

I often fall into the trap of playing around with lots of minor details that (upon reflection) I don't expect to change the results much. I do this b/c generating new results is really addicting (novelty, occasional payoff, etc). Not clear what's optimal here (sometimes you really do need to explore), but worth keeping in mind. 

Reply
Oliver Daniels-Koch's Shortform
Oliver Daniels1mo10

Is interp easier in worlds where scheming is a problem? 

The key conceptual argument for scheming is that, insofar as future AI systems are decomposable into [goals][search], there are many more misaligned goals compatible with low training loss than aligned goals. But if an AI was really so cleanly factorable, we would expect interp / steering to be easier / more effective than on current models (this is the motivation for re-target the search). 

While I don't expect the factorization to be this clean, I do think we should expect interp to be easier in worlds where scheming is a major problem. 

(though insofar as you're worried about scheming b/c of internalized instrumental subgoals and reward-hacking, the update on the tractability of interp seems pretty small)

Reply
Load More
33Concrete Methods for Heuristic Estimation on Neural Networks
10mo
0
43Concrete empirical research projects in mechanistic anomaly detection
1y
3
2Oliver Daniels-Koch's Shortform
1y
34