Oliver Daniels — LessWrong

not very long (3-5 word phrases)

Base64Bench: How good are LLMs at base64, and why care about it?

somewhat related (and useful for weak to strong type experiments), I found a large gap between decoding performance in the Qwen3-[8-32B] (No-Thinking) range on the "secret side contraints" from the Eliciting Secret Knowledge paper.

Oliver Daniels-Koch's Shortform

Oliver Daniels2mo10

Should we try harder to solve the alignment problem?

I've heard the meme "we are underinvesting in solving the the damn alignment problem" a few times (mostly from Neel Nanda).

I think this is right, but also wanted to note that, all else equal, ones excitement about scalable oversight type stuff should be proportional to their overall optimism about misalignment risks. This implies that the safety community should be investing more in interp / monitoring / control, because these interventions are more likely to catch schemers and provide evidence to help coordinate a global pause.

Lessons from Studying Two-Hop Latent Reasoning

Oliver Daniels2mo10

Yeah i ended up trying this too and it didn't work (which is kind of an interesting finding itself, though yeah unclear if it applies to larger models / datasets)

Lessons from Studying Two-Hop Latent Reasoning

Oliver Daniels2mo10

preliminary results make me much more confident the model is doing "true" multi-hop reasoning in the 3-distractor triplets case. Most notable finding: on seed 42 (which I ran accidentally), same-doc w/ triplet distractors 2hop-no-cot accuracy improves to ~25% (compared to ~5% reported in the paper)

so I'm guessing there's something like a "two-hop reasoning circuit" and "memorization circuit", and most of the time you mostly get the memorization circuit but sometimes the two-hop reasoning circuits gets more reinforced.

This makes me fairly confident that training on a larger more diverse dataset would lead to fairly consistent two-hop reasoning (even with distractors).

These results also raise some questions on optimal data ordering (perhaps its better to finetune on the atomic facts then on the no-cot demonstrations?) but I mostly suspect these to be solved by scale.

Buck's Shortform

Oliver Daniels2mo20

I do think there's some amount of "these guys are weirdo extremists" signaling implicit in stating that they think doom is inevitable, but I don't think it stems from not reading the book / not understanding the conditional (the conditional is in the title!)

Oliver Daniels-Koch's Shortform

Oliver Daniels2mo20

yeah but its plausible this cost is worth paying if the effect size is large enough (and there are various open source instruction-tuning datasets which might reasonably recover e.g. Llama-3-instruct)

Oliver Daniels-Koch's Shortform

Oliver Daniels2mo*100

In the small but growing literature on supervised document finetuning, its typical to finetune "post-trained" models on synthetic facts (see Alignment faking, Wang et al., Lessons from Two-Hop Reasoning)

To me, the more natural thing is "synthetic continued pretraining" - further training the base model on synthetic documents (mixed with pretraining data), then applying post-training techniques ~~(this is the approach used in~~ ~~Auditing language models for hidden objectives~~ nvm they apply SDF to post-trained model, then apply further post-training)

I'm sort of confused why more papers aren't doing synthetic continued pretraining. I suspect its some combination of a) finetuing post-trained models is easier and b) people have tried both and it doesn't make much of a difference.

But if its mostly a) and not really b), this would useful to know (and implies people should explore b more!)

The Alignment Problem from a Deep Learning Perspective (major rewrite)

Oliver Daniels2mo10

just want to note that I'm starting an AI safety reading group consisting CS PhD students and faculty, and I'm extremely grateful this paper exists and was published in ICLR.

Buck's Shortform

Oliver Daniels2mo7152

The book certainly claims that doom is not inevitable, but it does claim that doom is ~inevitable if anyone builds ASI using anything remotely like the current methods.

I understand Zach (and other "moderates") as saying no, even conditioned on basically YOLO-ing the current paradigm to superintelligence, its really uncertain (and less likely than not) that the resulting ASI would kill everyone.

I disagree with this position, but if I held it, I would be saying somewhat similar things to Zach (even having read the book).

Though I agree that engaging on the object level (beyond "predictions are hard") would be good.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments