LESSWRONG
LW

3467
Oliver Daniels
355Ω62920
Message
Dialogue
Subscribe

PhD Student at Umass Amherst

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
2Oliver Daniels-Koch's Shortform
2y
41
Oliver Daniels-Koch's Shortform
Oliver Daniels3d10

Should we try harder to solve the alignment problem?

I've heard the meme "we are underinvesting in solving the the damn alignment problem" a few times (mostly from Neel Nanda). 

I think this is right, but also wanted to note that, all else equal, ones excitement about scalable oversight type stuff should be proportional to their overall optimism about misalignment risks. This implies that the safety community should be investing more in interp / monitoring / control, because these interventions are more likely to catch schemers and provide evidence to help coordinate a global pause. 

Reply
Lessons from Studying Two-Hop Latent Reasoning
Oliver Daniels4d10

Yeah i ended up trying this too and it didn't work (which is kind of an interesting finding itself, though yeah unclear if it applies to larger models / datasets) 

Reply
Lessons from Studying Two-Hop Latent Reasoning
Oliver Daniels6d10

preliminary results make me much more confident the model is doing "true" multi-hop reasoning in the 3-distractor triplets case. Most notable finding: on seed 42 (which I ran accidentally), same-doc w/ triplet distractors 2hop-no-cot accuracy improves to ~25% (compared to ~5% reported in the paper)

so I'm guessing there's something like a "two-hop reasoning circuit" and "memorization circuit", and most of the time you mostly get the memorization circuit but sometimes the two-hop reasoning circuits gets more reinforced. 

This makes me fairly confident that training on a larger more diverse dataset would lead to fairly consistent two-hop reasoning (even with distractors). 

These results also raise some questions on optimal data ordering (perhaps its better to finetune on the atomic facts then on the no-cot demonstrations?) but I mostly suspect these to be solved by scale. 

Reply
Buck's Shortform
Oliver Daniels10d20

I do think there's some amount of "these guys are weirdo extremists" signaling implicit in stating that they think doom is inevitable, but I don't think it stems from not reading the book / not understanding the conditional (the conditional is in the title!) 

Reply
Oliver Daniels-Koch's Shortform
Oliver Daniels10d20

yeah but its plausible this cost is worth paying if the effect size is large enough (and there are various open source instruction-tuning datasets which might reasonably recover e.g. Llama-3-instruct)

Reply
Oliver Daniels-Koch's Shortform
Oliver Daniels10d*100

In the small but growing literature on supervised document finetuning, its typical to finetune "post-trained" models on synthetic facts (see Alignment faking, Wang et al., Lessons from Two-Hop Reasoning) 

To me, the more natural thing is "synthetic continued pretraining" - further training the base model on synthetic documents (mixed with pretraining data), then applying post-training techniques (this is the approach used in Auditing language models for hidden objectives nvm they apply SDF to post-trained model, then apply further post-training)

I'm sort of confused why more papers aren't doing synthetic continued pretraining. I suspect its some combination of a) finetuing post-trained models is easier and b) people have tried both and it doesn't make much of a difference. 

But if its mostly a) and not really b), this would useful to know (and implies people should explore b more!)

Reply
The Alignment Problem from a Deep Learning Perspective (major rewrite)
Oliver Daniels11d10

just want to note that I'm starting an AI safety reading group consisting CS PhD students and faculty, and I'm extremely grateful this paper exists and was published in ICLR. 

Reply
Buck's Shortform
Oliver Daniels11d7152

The book certainly claims that doom is not inevitable, but it does claim that doom is ~inevitable if anyone builds ASI using anything remotely like the current methods. 

I understand Zach (and other "moderates") as saying no, even conditioned on basically YOLO-ing the current paradigm to superintelligence, its really uncertain (and less likely than not) that the resulting ASI would kill everyone. 

I disagree with this position, but if I held it, I would be saying somewhat similar things to Zach (even having read the book). 

Though I agree that engaging on the object level (beyond "predictions are hard") would be good. 

Reply1
Lessons from Studying Two-Hop Latent Reasoning
Oliver Daniels12d10

Tentatively planning on checking this, I'll let you know what I find!

Reply
Oliver Daniels-Koch's Shortform
Oliver Daniels13d*60

I'm pretty happy with modeling SGD on deep nets as solomonoff induction, but seems like the key missing ingredient is path dependence. What's the best literature on this? Lots of high level alignment plans rely on path dependence (shard theory, basin of corrigibility...) 

Maybe the best answer is just: SGD is ~local, modulated by learning rate. But this doesn't integrate lottery ticket hypothesis stuff (which feels like it pushes hard against locality)

Reply
Load More
33Concrete Methods for Heuristic Estimation on Neural Networks
11mo
0
43Concrete empirical research projects in mechanistic anomaly detection
1y
3
2Oliver Daniels-Koch's Shortform
2y
41