- Yep. 'Give good advice to college students and cross subsidize events a bit, plus gentle pressure via norms to be chill about the wealth differences' is my best current answer. Kinda wish I had a better one.
Slight, some, if any nudges toward politics being something that gives people a safety net, so that everyone has the same foundation to fall on? So that even if there are wealth differences, there aren't as much large wealth enabled stresses
Sometime it would be cool to have a conversation about what you mean by this, because I feel the same way much of the time, but I also feel there's so much going on it's impossible to have a strong grasp on what everyone is working on.
Yes! I'm writing a post on that today!! I want this to become something that people can read and fully understand the alignment problem as best as it's currently known, without needing to read a single thing on lesswrong, arbital, etc. V lucky atm, I'm living with a bunch of experienced alignment researchers and learning a lot.
also, happy to just have a call:
Did you see the Shallow review of technical AI safety, 2025? Even just going through the "white-box" and "theory" sections I'm interested in has months worth of content if I were trying to understand it in reasonable depth.
Not properly yet - saw that in the China section, Deepseek's Speciale wasn't mentioned, but it's a safety review, tbf, not a capabilities review. I do like this project a lot in general - thinking of doing more critique-a-thons and reviving the peer review platform, so that we can have more thorough things.
China
Deepseek Special apparently performs at IMO gold level https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Speciale - seems important
Control seems to largely be increasing p doom, imo, by decreasing the chances of canaries.
- It connects you with people who can help you do better in the future.
Yes, it did!! Interviewed a lot of researchers in the prep for this, learnt a lot and met people, some of whom are now on the team and others who are also helping.
- It teaches others not just about ways to succeed, but ways to fail they need to avoid.
Yup!! Definitely learnt a lot!
- It encourages others to try, both by setting an example of things it is possible to attempt, and by reducing fear of embarrassment.
I hope so! I would like more people in general to be seriously trying to solve alignment! Especially in a way that's engaging with the actual problem and not just prosaic stuff!
Thank you for all you do! I look forward to your future failures and hopefully future successes!
Thank you so much!! This was lovely to read!
One consideration is that we often find ourselves relying on free-response questions for app review, even in an initial screen, and without at least some of those it would be considerably harder to do initial screening.
Why not just have the initial screening only have one question and say what you're looking for? So that the ones who happen to already know what you're looking for aren't advantaged and able to Goodhart more?
I'm making an AI Alignment Evals course at AI Plans, that is highly incomplete - nevertheless, would appreciate feedback: https://docs.google.com/document/d/1_95M3DeBrGcBo8yoWF1XHxpUWSlH3hJ1fQs5p62zdHE/edit?tab=t.20uwc1photx3
It will be sold as a paid course, but will have a pretty easy application process for getting free access, for those who can't afford it
Glad to see this! Suggestion - if you have lots of things that you want to screen candidates for, people who value their time aren't going to want to gamble it on an application that is high time cost and low chance of success - there is a way to solve this, by splitting it into stages.
E.g. Stage 1
- easy things for candidate - checkboxes, name, email, CV, link to works
Stage 2
Stage 3
Saves the candidates time and also makes things better for you, since you get to fast track people who seem especially promising at earlier stages and are more likely to get such candidates, since they don't feel that their time is at risk of being wasted as much - and having the competence and care to notice this lets them positively update about you.
i'd prefer to read your own, rougher writing. or at least, to see the conversation, rather than just the output at the end
many such cases
Working on a meta plan for solving alignment, I'd appreciate feedback & criticism please - the more precise the better. Feel free to use the emojis reactions if writing a reply you'd be happy with feels taxing.
Diagram for visualization - items in tables are just stand-ins, any ratings and numbers are just for illustration, not actual rankings or scores at this moment.

Red and Blue teaming Alignment evals
Make lots of red teaming methods to reward hack alignment evals
Use this to find actually useful alignment evals, then red team and reward hack them too, find better methods of reward hacking and red teaming, then finding better ways of doing alignment evals that are resistant to that and actually find the preferences of the model reliably too - keep doing this constantly to have better and better alignment evals and alignment evals reward hacking methods. Also hosting events for other people to do this and learn to do this.
Alignment methods
then doing these different post training methods, implementing lots of different alignment methods, seeing which ones score how highly across the alignment evals we know are robust, using this to find patterns in which types seem to work more/less,
Theory
then doing theory work, to try to figure out/guess why those methods are working more/less, make some hypothesises about new methods that willl work better and new methods that will work worse, based on this,
then ranking the theory work based on how minimal the assumptions are, how well predicted the implementations are and how high they score on peer review
End goal:
Alignment evals that no one can find a way to reward hack/red team and that really precisely measure the preferences of the models.
Alignment methods that seem to score a high score on these evals.
A very strong theoretical understanding for what makes this alignment method work, *why* it actually learns the preferences, theory understanding how those preferences will/won't scale to result in futures where everyone dies or lives. The theoretical work should have as few assumptions as possible. Aim is a mathematical proof with minimal assumptions, that's written very clearly and is easy to understand so that lots and lots of people can understand it and criticize it - robustness through obfuscation is a method of deception, intentional or not.
Current Work:
Hosting Alignment Evals hackathons and making Alignment Evals guides, to make better alignment evals and red teaming methods.
Making lots of Qwen model versions, whose only difference is the post training method.