LESSWRONG
LW

Oliver Daniels
197Ω63670
Message
Dialogue
Subscribe

PhD Student at Umass Amherst

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
2Oliver Daniels-Koch's Shortform
1y
29
Daniel Kokotajlo's Shortform
Oliver Daniels5d10

overall how enthusiastic are you about safety motivated people developing such an architecture? 

(seems to come with obviously large capability externalities - we can deploy the model outside the sandbox!)

Reply
Eliciting bad contexts
Oliver Daniels17d10

seems like restricting the search to plausible inputs (as judged by e.g. perplexity) might overcome some of these concerns

Reply
Oliver Daniels-Koch's Shortform
Oliver Daniels18d12

Take: Exploration hacking should not be used as a synonym for deceptive alignment. 

(I have observed one such usage)

Deceptive alignment is maybe a very particular kind of exploration hacking, but the term exploration hacking (without further specification) should refer to models deliberately sandbagging "intermediate" capabilities during RL training to avoid learning a "full" capability. 

Reply
AXRP Episode 41 - Lee Sharkey on Attribution-based Parameter Decomposition
Oliver Daniels1mo10

“If I quantize my network with however many bits, how bad is that?” I don’t know, maybe this is one of these things where if I sat down and tried to do it, I’d realize the issue, but it seems doable to me. It seems like there’s possibly something here.


I think the reason this doesn't work (i.e. why you can only get a Pareto frontier) is that you can only lower bound the description length of the network / components, such that a direct comparison to "loss bits" doesn't make sense

Reply
Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking
Oliver Daniels2mo10

Curious for your takes on the difficulty (for models) of sandbagging vs the difficulty (for AI developers) of preventing adversarial examples / jailbreaks. My quick thoughts: 
 

  • Adversarial robustness is hard, therefore sandbagging is hard (and thus easy to defeat)
  • Adversarial robustness is improving, there we should expect sandbagging abilities to improve
  • Even if sandbagging is equally as difficult as adversarial robustness, we might not look for it absent strong warning signs
  • Even if sandbagging is equally difficult and we do look for it, finding "adversarial" inputs (that cause the model to reveal the capability) may be too expensive. This is especially worrying in the RL regime where learning the capability requires learning a series of subtasks. 
Reply
Evaluating Oversight Robustness with Incentivized Reward Hacking
Oliver Daniels2mo20

This is really cool work and I hope it gets more attention (e.g. promoted to the Alignment Forum). In particular I love the use of CodeNames as a relatively low-compute test-bed for scalable oversight work. 

A view thoughts:

  • There are roughly 2 motivations one can have for adversarial evals of scalable oversight methods:
    1) identify and patch reward-hacking opportunities, 
    2) evaluate whether methods detect and eliminate deliberate sandbagging.
    • this work is oriented toward 1 (which is totally valid). Orienting toward towards 2 would motivate subtle changes in the overall setup, e.g. the reward function might be something like minimizing the reference score subject to some lower bound on the overseer score: −Rr+λmin(0,Ro−¯Ro) with ¯Ro set such that when you remove the adversarial reward, normal training does not restore normal performance. This setup would produce predictably different results e.g. we would expect adversarial consultant to exploit the lazy judge.
  • Asymmetries in the reward structure avoid spurious critiques
    • a common with critiques / debate is that the critique can raise compelling objections to true arguments that are hard to refute (this is basically obfuscated arguments, I think). But in this setup, because the reward is defined with respect to the bad word most similar to the clue, a critique can never be too good: the best critique is always the bad word used in the true reward function. This helps explain why adversarial incentives have little to no impact on overseers with critique models.
    • Future work might address this by having a set of "neutral" mixed with the bad words, such that a negligent overseer with a fixed budget for revealing which words are neutral and bad could be manipulated to be overly pessimistic. 
Reply
“The Era of Experience” has an unsolved technical alignment problem
Oliver Daniels3mo30

I think the smart Silver-sympathetic response here would be:

1. Yes, the alignment problem is a real thing, and my (Silver's) sketch of a solution is not sufficient 
2. Studying / taking inspiration from human-reward circuitry is an interesting research direction
3. We will be able to iterate on the problem before AI's are capable of taking existentially destructive actions
 

From (this version of) his perspective, alignment is "normal" problem and a small-piece of achieving beneficial AGI. The larger piece is the AGI part, which is why he spends most of his time/energy/paper devoted to it. 

Obviously much ink has been spilled on why alignment might not be a normal problem (feedback loops break when you have agents taking actions which humans can't evaluate, etc. etc.) but this is the central crux. 

Reply
zchuang's Shortform
Oliver Daniels4mo43

Map on "coherence over long time horizon" / agency. I suspect contamination is much worse for chess (so many direct examples of board configurations and moves) 

Reply
Preparing for the Intelligence Explosion
Oliver Daniels4mo10

I've been confused what people are talking about when they say "trend lines indicate AGI by 2027" - seems like it's basically this?

Reply
Estimating the Probability of Sampling a Trained Neural Network at Random
Oliver Daniels4mo1-1

also rhymes with/is related to ARC's work on presumption of independence applied to neural networks (e.g. we might want to make "arguments" that explain the otherwise extremely "surprising" fact that a neural net has the weights it does)

Reply
Load More
20Write Good Enough Code, Quickly
7mo
10
28Concrete Methods for Heuristic Estimation on Neural Networks
8mo
0
43Concrete empirical research projects in mechanistic anomaly detection
1y
3
2Oliver Daniels-Koch's Shortform
1y
29