many such cases
The reasons for this and what I'm going to do to make sure it doesn't happen again:
One:
- Promised that the first 300 applicants would be guaranteed personalized feedback. Thought that I could delegate to other, more technical members of the team for this.
However, it turned out that in order to give useful feedback and to be able to judge if someone was a good fit for the program, a person didn't just need technical knowledge - they needed good communication skills, an understanding of what's needed for alignment research, being consistently available for several hours a day, to actually go through the applications and being interested in doing so. Turned out that the only person who fit all those characteristics at the time was me. So I couldn't delegate.
Also, a teammate made a reviewing software, which he said would help build a Bradley-Terry model of the applicants, as we reviewed them. I had a feeling that this might be overcomplicated but didn't want to say no or react negatively to someone's enthusiasm for doing free work for something I care about.
It turned out that constantly fixing, trying to improve, finangle with, etc, the software actually took several days. And it was faster to just do it manually.
What I'll be doing next time to make sure this doesn't happen:
- Only promising feedback to the first 50 applicants.
- Having preprepared lines for the rest, with the general reason they weren't accepted - e.g. lack of suffifient maths experience without software engineering/neuroscience/philosophy to compensate, meaning that they might not be likely to get useful alignment theory work done in 5 weeks.
- Doing things manually, not experimenting with custom software last minute.
- Announcing the program more early - giving ourselves at least 3 months to prepare things.
Two:
- making the Research Guides for the different tracks turned out to be much, much, much harder than I thought it would be. Including for other, more technical teammates. Thought that making just a high level guide would be relatively ok, but instead turned out there was lots and lots of reading to do, lots and lot of preliminary reading and maths learning to do to understand that and it was very hard. This also delayed the start of the Moonshot Alignment Program a lot.
What I'll be doing next time to make sure that this doesn't happen:
- Starting out with just a reading list and links to things like https://www.alignmentforum.org/s/n7qFxakSnxGuvmYAX, https://drive.google.com/file/d/1aKvftxhG_NL2kfG3tNmtM8a2y9Q1wFHb/view (source with comments: https://www.lesswrong.com/posts/PRwQ6eMaEkTX2uks3/infra-exercises-part-1), etc
- Personally learning a lot more math
- Having a draft of the reading list
- Asking for help from more experienced alignment researchers, such as Gurkenglas, the_gears_of_ascension, Lorxus, etc, earlier
Major changes I've made since:
- brought on a teammate, who is looking to become a co founder, who is very competent, well organized, with a good technical foundation
- learnt more math and alignment philosophy
- brought a much more technical person on the team (Gurkenglas), who is also teaching me a lot and pointing out lots and lots of flaws in my ideas, updating me fast
- changed my management style at ai plans - no longer having weekly progress meetings, trying to manage stuff myself on linear or other task management type stuff - instead, just having a one on one call with each teammate once a week, to learn about what they want to do, what problems they're having, what is available for them to do and deciding what they'll do
- moved to CEEALAR, much, much, much (1/2, total of 2.5) better for my stress, axiety, insecurity, mental health, etc.
- from this, also gotten friendships/contacts with more senior alignment researchers who i can and will ask for help
Daily meditation or reflection practice has something to offer on this front. So does the Quaker practice of silent worship. And so does the Jewish Sabbath.
my family's daily prayer, which we are meant to have been showered for, where we all pray together, also brings this. taking this as a datapoint to continue it, even while i'm away from them and it's less convenient. thanks
i think, reduction of abstractions
Maybe reduce the meaning of the ban from "you can't reply to this person at all" to "you can only post one reply per article"? So you can state your objections, but you can't stay there and keep interacting with the author. When you are writing the reply, you are notified that this is the only one you get under this article.
could be good, if combined with up to 3 total bans
We're actually doing this with the Arbital Alignment articles
'ai control' seems like it just increases p doom, right?
since it obv wont scale to agi/asi and will just reduce the warning shots and financial incentives to invest in safety research, make it more economically viable to have misaligned ais, etc
and there's buzztalk about using misaligned ais to do alignment research, but the incentives to actually do so dont seem to be there and the research itself doesnt seem to be happening - as in, research to actually get closer to a mathematical proof of a method to align a superintelligence such that it wont kill everyone
Thank you!! I think I'll use this as part of the projects part of an ai alignment course we (AI Plans) are making!!
Working on a meta plan for solving alignment, I'd appreciate feedback & criticism please - the more precise the better. Feel free to use the emojis reactions if writing a reply you'd be happy with feels taxing.
Diagram for visualization - items in tables are just stand-ins, any ratings and numbers are just for illustration, not actual rankings or scores at this moment.

Red and Blue teaming Alignment evals
Make lots of red teaming methods to reward hack alignment evals
Use this to find actually useful alignment evals, then red team and reward hack them too, find better methods of reward hacking and red teaming, then finding better ways of doing alignment evals that are resistant to that and actually find the preferences of the model reliably too - keep doing this constantly to have better and better alignment evals and alignment evals reward hacking methods. Also hosting events for other people to do this and learn to do this.
Alignment methods
then doing these different post training methods, implementing lots of different alignment methods, seeing which ones score how highly across the alignment evals we know are robust, using this to find patterns in which types seem to work more/less,
Theory
then doing theory work, to try to figure out/guess why those methods are working more/less, make some hypothesises about new methods that willl work better and new methods that will work worse, based on this,
then ranking the theory work based on how minimal the assumptions are, how well predicted the implementations are and how high they score on peer review
End goal:
Alignment evals that no one can find a way to reward hack/red team and that really precisely measure the preferences of the models.
Alignment methods that seem to score a high score on these evals.
A very strong theoretical understanding for what makes this alignment method work, *why* it actually learns the preferences, theory understanding how those preferences will/won't scale to result in futures where everyone dies or lives. The theoretical work should have as few assumptions as possible. Aim is a mathematical proof with minimal assumptions, that's written very clearly and is easy to understand so that lots and lots of people can understand it and criticize it - robustness through obfuscation is a method of deception, intentional or not.
Current Work:
Hosting Alignment Evals hackathons and making Alignment Evals guides, to make better alignment evals and red teaming methods.
Making lots of Qwen model versions, whose only difference is the post training method.