Oliver Daniels — LessWrong

LESSWRONG
LW

Part of the subtext here being the very best people (on the relevant dimensions) will naturally run into lesswrong, x-risk, etc, such that "out-reach" (in the sense of uni-organizing, advertising, etc) isn't that valuable on the current margin.

To "the very best", doing high quality research is often the best "out-reach".

Oliver Daniels-Koch's Shortform

Oliver Daniels4d10

unclear whether outcome-based training "leaking" into chain of thought b/c of parameter sharing / short-to-long generalization is good or bad for safety

on the one hand, it makes scheming less likely
on the other hand, it can make CoT less monitorable

NATO is dangerously unaware that its military edge is slipping

Oliver Daniels6d10

Thanks! Yeah makes sense, something like extremely adversarial / exploitive information environment, coupled with general decay in elite culture (and mass addiction to vertical video, social media).

NATO is dangerously unaware that its military edge is slipping

Oliver Daniels6d10

Any citations on electorate being more 'illiterate' (in the relevant sense) then, say, 1960.

Oliver Daniels-Koch's Shortform

Oliver Daniels7d40

I’ve been running a safety grad student reading group, and feeling like it would be healthier / more productive to have concrete metrics or deliverables.

My tentative idea to incorporate LW / Alignment Forum posting as a core component of the group (with karma as a metric). I’m not exactly sure on the structure, but something like:

Reading week (everyone reads the same set of papers / blog posts, discuss thought
Writing week: everyone prepares comment / shortform / post, then we trade and revise writers-workshop style
- this also provides filtering (avoid spamming LW with bad comments) and maybe makes people feel more comfortable/confident about posting public writing

would love feedback, and curious if anyone has tried stuff like this.

(Another nice feature of this is it increases the odds of getting people hooked on LW, which imo should be a top priority for safety field-building, c.f. https://www.lesswrong.com/posts/ke24kxhSzfX2ycy57/simon-lermen-s-shortform?commentId=HqQsNdp4bdp4nDn7G)

Base64Bench: How good are LLMs at base64, and why care about it?

Oliver Daniels1mo10

not very long (3-5 word phrases)

Base64Bench: How good are LLMs at base64, and why care about it?

Oliver Daniels1mo10

somewhat related (and useful for weak to strong type experiments), I found a large gap between decoding performance in the Qwen3-[8-32B] (No-Thinking) range on the "secret side contraints" from the Eliciting Secret Knowledge paper.

Oliver Daniels-Koch's Shortform

Oliver Daniels2mo10

Should we try harder to solve the alignment problem?

I've heard the meme "we are underinvesting in solving the the damn alignment problem" a few times (mostly from Neel Nanda).

I think this is right, but also wanted to note that, all else equal, ones excitement about scalable oversight type stuff should be proportional to their overall optimism about misalignment risks. This implies that the safety community should be investing more in interp / monitoring / control, because these interventions are more likely to catch schemers and provide evidence to help coordinate a global pause.

Lessons from Studying Two-Hop Latent Reasoning

Oliver Daniels2mo10

Yeah i ended up trying this too and it didn't work (which is kind of an interesting finding itself, though yeah unclear if it applies to larger models / datasets)

Lessons from Studying Two-Hop Latent Reasoning

Oliver Daniels2mo10

preliminary results make me much more confident the model is doing "true" multi-hop reasoning in the 3-distractor triplets case. Most notable finding: on seed 42 (which I ran accidentally), same-doc w/ triplet distractors 2hop-no-cot accuracy improves to ~25% (compared to ~5% reported in the paper)

so I'm guessing there's something like a "two-hop reasoning circuit" and "memorization circuit", and most of the time you mostly get the memorization circuit but sometimes the two-hop reasoning circuits gets more reinforced.

This makes me fairly confident that training on a larger more diverse dataset would lead to fairly consistent two-hop reasoning (even with distractors).

These results also raise some questions on optimal data ordering (perhaps its better to finetune on the atomic facts then on the no-cot demonstrations?) but I mostly suspect these to be solved by scale.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments