Does anyone here feel like they have personally made substantial contributions to AI safety? I don't mean converting others such that they worry (although that is important!) I mean more of technical progress in alignment matching progress in AI capability. Top posts seemed to be skewed toward stating the problem as opposed to even partial solutions or incremental progress.
I'm pretty excited about our recent work on inoculation prompting! https://www.lesswrong.com/posts/AXRHzCPMv6ywCxCFp/inoculation-prompting-instructing-models-to-misbehave-at
One curated post describes Tim Hua's red-teaming experiment where KimiK2 was found NOT to induce psychosis. Another high-level post describes how emergent misalignment can be caused by unpopular preferences.
As for top posts being skewed, it might be caused by LWers not paying attention to news about incremental progress. Or by alignment being actually hard. Especially if progress is made inside labs which, unlike the KimiK2 team, are less willing to try radically new methods that could actually solve problems.
On the other hand, Claude Sonnet 4.5 was tested and found to be radically LESS sycophantic than Claude Sonnet 4. Alas, C. Sonnet 4.5 is more situationally aware...
P.S. Alignment techniques are likely far harder to test than capabilities techniques. It could be possible to test architectures like neuralese on tiny networks dedicated to solving narrow tasks, but IMpossible to test alignment on such networks, since they don't have ways to, say, hack the reward. But alignment techniques that ordinary people CAN test are only finetuning and red teaming.
I don't support building even aligned super intelligence. I am in huge support of cybernetic and genetic enhancements to humans as well as uploaded minds. Based on your definition of super intelligence, I guess some of those may be considered such. It feels wrong to hand off the keys of the universe to something with no human lineage whatsoever even if it had something recognizable as human ethics and took care of us. It feels very much like being a kid with doting parents and that is bad in my eyes.