Does anyone here feel like they have personally made substantial contributions to AI safety? I don't mean converting others such that they worry (although that is important!) I mean more of technical progress in alignment matching progress in AI capability. Top posts seemed to be skewed toward stating the problem as opposed to even partial solutions or incremental progress.
I'm pretty excited about our recent work on inoculation prompting! https://www.lesswrong.com/posts/AXRHzCPMv6ywCxCFp/inoculation-prompting-instructing-models-to-misbehave-at
One curated post describes Tim Hua's red-teaming experiment where KimiK2 was found NOT to induce psychosis. Another high-level post describes how emergent misalignment can be caused by unpopular preferences.
As for top posts being skewed, it might be caused by LWers not paying attention to news about incremental progress. Or by alignment being actually hard. Especially if progress is made inside labs which, unlike the KimiK2 team, are less willing to try radically new methods that could actually solve problems.
On the other hand, Claude Sonnet 4.5 was tested and found to be radically LESS sycophantic than Claude Sonnet 4. Alas, C. Sonnet 4.5 is more situationally aware...
P.S. Alignment techniques are likely far harder to test than capabilities techniques. It could be possible to test architectures like neuralese on tiny networks dedicated to solving narrow tasks, but IMpossible to test alignment on such networks, since they don't have ways to, say, hack the reward. But alignment techniques that ordinary people CAN test are only finetuning and red teaming.