Yeah i ended up trying this too and it didn't work (which is kind of an interesting finding itself, though yeah unclear if it applies to larger models / datasets)
preliminary results make me much more confident the model is doing "true" multi-hop reasoning in the 3-distractor triplets case. Most notable finding: on seed 42 (which I ran accidentally), same-doc w/ triplet distractors 2hop-no-cot accuracy improves to ~25% (compared to ~5% reported in the paper)
so I'm guessing there's something like a "two-hop reasoning circuit" and "memorization circuit", and most of the time you mostly get the memorization circuit but sometimes the two-hop reasoning circuits gets more reinforced.
This makes me fairly confident that training on a larger more diverse dataset would lead to fairly consistent two-hop reasoning (even with distractors).
These results also raise some questions on optimal data ordering (perhaps its better to finetune on the atomic facts then on the no-cot demonstrations?) but I mostly suspect these to be solved by scale.
I do think there's some amount of "these guys are weirdo extremists" signaling implicit in stating that they think doom is inevitable, but I don't think it stems from not reading the book / not understanding the conditional (the conditional is in the title!)
yeah but its plausible this cost is worth paying if the effect size is large enough (and there are various open source instruction-tuning datasets which might reasonably recover e.g. Llama-3-instruct)
In the small but growing literature on supervised document finetuning, its typical to finetune "post-trained" models on synthetic facts (see Alignment faking, Wang et al., Lessons from Two-Hop Reasoning)
To me, the more natural thing is "synthetic continued pretraining" - further training the base model on synthetic documents (mixed with pretraining data), then applying post-training techniques (this is the approach used in Auditing language models for hidden objectives nvm they apply SDF to post-trained model, then apply further post-training)
I'm sort of confused why more papers aren't doing synthetic continued pretraining. I suspect its some combination of a) finetuing post-trained models is easier and b) people have tried both and it doesn't make much of a difference.
But if its mostly a) and not really b), this would useful to know (and implies people should explore b more!)
just want to note that I'm starting an AI safety reading group consisting CS PhD students and faculty, and I'm extremely grateful this paper exists and was published in ICLR.
The book certainly claims that doom is not inevitable, but it does claim that doom is ~inevitable if anyone builds ASI using anything remotely like the current methods.
I understand Zach (and other "moderates") as saying no, even conditioned on basically YOLO-ing the current paradigm to superintelligence, its really uncertain (and less likely than not) that the resulting ASI would kill everyone.
I disagree with this position, but if I held it, I would be saying somewhat similar things to Zach (even having read the book).
Though I agree that engaging on the object level (beyond "predictions are hard") would be good.
Tentatively planning on checking this, I'll let you know what I find!
I'm pretty happy with modeling SGD on deep nets as solomonoff induction, but seems like the key missing ingredient is path dependence. What's the best literature on this? Lots of high level alignment plans rely on path dependence (shard theory, basin of corrigibility...)
Maybe the best answer is just: SGD is ~local, modulated by learning rate. But this doesn't integrate lottery ticket hypothesis stuff (which feels like it pushes hard against locality)
Should we try harder to solve the alignment problem?
I've heard the meme "we are underinvesting in solving the the damn alignment problem" a few times (mostly from Neel Nanda).
I think this is right, but also wanted to note that, all else equal, ones excitement about scalable oversight type stuff should be proportional to their overall optimism about misalignment risks. This implies that the safety community should be investing more in interp / monitoring / control, because these interventions are more likely to catch schemers and provide evidence to help coordinate a global pause.