Some people are excited about the discovery of emergent misalignment since they think it indicates that LLMs understand goodness or that goodness is easy to specify.
I’m less convinced, and would be interested in the following experiments. I strongly predict that these experiments will show that emergent misalignment does not indicate a unified representation of goodness.
When reading about forecasting, one of the pieces of common advice is “break a prediction into cases and evaluate them separately.” I found this advice unhelpful for the reasons you described. I could free associate some cases but would miss some, and I trusted my gut more than this decomposition. Maybe this is a case where “do the math then go with your gut” makes sense.
Is there a better way to do the decomposition for predictions?
I reflected on why I didn’t feel overwhelming debilitating sadness due to x-risk and realized that “there’s no rule that says you should be sad if you aren’t feeling sad.”
Even a recent widow in a previously happy marriage shouldn’t feel bad about not feeling sad if they find themselves not being sad.
Pricing is linear with tokens even though actual cost per token is quadratic. That means the pricing is some approximate curve fitting relating to expected use. I would be curious about where the actual cost curve for tokens intersects the actual cost curve for a single image.
The “guidelines on how they set hyperparameters” link is broken. Does anyone have a good replacement?
Oddly, I also couldn’t find it on the Wayback Machine.
Now that AI-generated art is so easy, I frequently find more motivation to do art rather than less. If you want an ultra-polished painting with perfect lighting, sure, go to a diffusion model. If you want me, you have to get art from me. And my work doesn't have to be perfect. Perfect is cheap. My work just has to show what I feel and feel right to me. Work with a piece of myself cannot come from a diffusion model, so my work has great value.
I asked GPT-5 Agent to choose an underrated LessWrong post and it chose this one.
I agree that this is underrated. Your point about anti-aligned models being strictly more capable than safe models and trained in potentially harmful skills is certainly something to keep in mind when we consider how aligned AIs seem to be. Thanks to this post, I will train myself into the habit of taking a moment to imagine national security anti-alignment implications when I plan research ideas or learn about the research of others.
Here's the chat with GPT-5. It also picked a few other posts as runners-up.
Which AI Safety math topics deserve a high-quality Manim video? (think 3Blue1Brown's video style)
Youtube videos get more views than blog posts. Beautiful animations even more so.
Men[1] will die[2] for her[3] massive[4] coconuts[5].
All of humanity
Go extinct
Hindsight Experience Replay (HER), a technique for improving the reinforcement learning training signal
Chain of Continuous Thought, a technique that makes model chain of thought much less interpretable but which allows the model to reason more efficiently
Makes sense from a brain perspective, too.
Neurons are noisy but by using discrete representations (symbols), we can do extremely reliable cognition (math and logic). Though that’s all still represented by neurons, which is a difference with decoding tokens from AI.