unclear whether outcome-based training "leaking" into chain of thought b/c of parameter sharing / short-to-long generalization is good or bad for safety
on the one hand, it makes scheming less likely
on the other hand, it can make CoT less monitorable
Thanks! Yeah makes sense, something like extremely adversarial / exploitive information environment, coupled with general decay in elite culture (and mass addiction to vertical video, social media).
Any citations on electorate being more 'illiterate' (in the relevant sense) then, say, 1960.
I’ve been running a safety grad student reading group, and feeling like it would be healthier / more productive to have concrete metrics or deliverables.
My tentative idea to incorporate LW / Alignment Forum posting as a core component of the group (with karma as a metric). I’m not exactly sure on the structure, but something like:
would love feedback, and curious if anyone has tried stuff like this.
(Another nice feature of this is it increases the odds of getting people hooked on LW, which imo should be a top priority for safety field-building, c.f. https://www.lesswrong.com/posts/ke24kxhSzfX2ycy57/simon-lermen-s-shortform?commentId=HqQsNdp4bdp4nDn7G)
not very long (3-5 word phrases)
somewhat related (and useful for weak to strong type experiments), I found a large gap between decoding performance in the Qwen3-[8-32B] (No-Thinking) range on the "secret side contraints" from the Eliciting Secret Knowledge paper.
Should we try harder to solve the alignment problem?
I've heard the meme "we are underinvesting in solving the the damn alignment problem" a few times (mostly from Neel Nanda).
I think this is right, but also wanted to note that, all else equal, ones excitement about scalable oversight type stuff should be proportional to their overall optimism about misalignment risks. This implies that the safety community should be investing more in interp / monitoring / control, because these interventions are more likely to catch schemers and provide evidence to help coordinate a global pause.
Yeah i ended up trying this too and it didn't work (which is kind of an interesting finding itself, though yeah unclear if it applies to larger models / datasets)
preliminary results make me much more confident the model is doing "true" multi-hop reasoning in the 3-distractor triplets case. Most notable finding: on seed 42 (which I ran accidentally), same-doc w/ triplet distractors 2hop-no-cot accuracy improves to ~25% (compared to ~5% reported in the paper)
so I'm guessing there's something like a "two-hop reasoning circuit" and "memorization circuit", and most of the time you mostly get the memorization circuit but sometimes the two-hop reasoning circuits gets more reinforced.
This makes me fairly confident that training on a larger more diverse dataset would lead to fairly consistent two-hop reasoning (even with distractors).
These results also raise some questions on optimal data ordering (perhaps its better to finetune on the atomic facts then on the no-cot demonstrations?) but I mostly suspect these to be solved by scale.
Part of the subtext here being the very best people (on the relevant dimensions) will naturally run into lesswrong, x-risk, etc, such that "out-reach" (in the sense of uni-organizing, advertising, etc) isn't that valuable on the current margin.
To "the very best", doing high quality research is often the best "out-reach".