Dom Polsinelli's Shortform

5th Oct 2025

1 min read

2

This is a special post for quick takes by Dom Polsinelli. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

Dom Polsinelli's Shortform

4 comments, sorted by

top scoring

Click to highlight new comments since: Today at 6:48 PM

[-]Dom Polsinelli6d104

Does anyone here feel like they have personally made substantial contributions to AI safety? I don't mean converting others such that they worry (although that is important!) I mean more of technical progress in alignment matching progress in AI capability. Top posts seemed to be skewed toward stating the problem as opposed to even partial solutions or incremental progress.

[-]Hastings6d109

It appears to be a bit tricky.

[-]Daniel Tan3d30

I'm pretty excited about our recent work on inoculation prompting! https://www.lesswrong.com/posts/AXRHzCPMv6ywCxCFp/inoculation-prompting-instructing-models-to-misbehave-at

[-]StanislavKrym6d20

One curated post describes Tim Hua's red-teaming experiment where KimiK2 was found NOT to induce psychosis. Another high-level post describes how emergent misalignment can be caused by unpopular preferences.

As for top posts being skewed, it might be caused by LWers not paying attention to news about incremental progress. Or by alignment being actually hard. Especially if progress is made inside labs which, unlike the KimiK2 team, are less willing to try radically new methods that could actually solve problems.

On the other hand, Claude Sonnet 4.5 was tested and found to be radically LESS sycophantic than Claude Sonnet 4. Alas, C. Sonnet 4.5 is more situationally aware...

P.S. Alignment techniques are likely far harder to test than capabilities techniques. It could be possible to test architectures like neuralese on tiny networks dedicated to solving narrow tasks, but IMpossible to test alignment on such networks, since they don't have ways to, say, hack the reward. But alignment techniques that ordinary people CAN test are only finetuning and red teaming.

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

Dom Polsinelli's Shortform

2