Some nice talks and lots of high quality people signed up, but 2 weeks late starting because I massively underestimated how long it would take to give personalized feedback to 300 applicants and also kept trying to use really unweildy software and turned out its faster to do it manually. and also didn't get the research guides (https://moonshot-alignment-program.notion.site/Updated-Research-Guides-255a2fee3c6780f68a59d07440e06d53?pvs=74) ready in time and didn't coordinate a lot of things properly.
Also, a lot of fuckups with luma, notion and google forms.
Overall worked in marketing the event okish, 298 signups, but extremely badly in running it due to disorganization on my part. I'm not put off by this though, because the first alignment evals hackathon was like this, then the second one, we learnt from that and it went really well.
Learning a lot from this one too and among other things, making our own events thing, because i recently saw the founder of luma saying on twitter that they're 'just vibecoding!' and dont have a backend engineer and really frequently have a lot of pains when using luma https://test.ai-plans.com/events
Also, gonna be taking more time to prepare for the next event and only guaranteeing a max of 100 people feedback - free to the first 50 to apply and optional for up to 50 others who can pay $10 to get personalized feedback.
And gonna make very clear template schedules for the mentors, so that we (I) don't waste their time, have things be vauge, them not actually getting people joining their research, etc.
It’s suspicious that the apparent solution to this problem is to do more AI research as opposed to doing anything that would actually hurt AI companies financially.
What do you think of implementing AI Liability as proposed by, e.g. Beckers & Teubner?
Have you looked at marketing/messaging software? The things of knowing which template messages work best in which cases sound quite similar to this and might have overlap. I would be surprised if e.g. MrBeast's team didn't have something tracking which video titles and thumbnails do best with which audiences, which script structures do best, an easy way to make variants, etc.
so for this and other reasons, its hard to say when an eval has been truly successfully 'red teamed'
One of the major problems with this atm is that most 'alignment', 'safety', etc evals dont specify or define exactly what they're trying to measure.
Hi, hosting an Alignment Evals hackathon for red teaming evals and making more robust ones, on November 1st: https://luma.com/h3hk7pvc
Team from previous one presented at ICML
Team in January made one of the first Interp based Evals for LLMs
All works from this will go towards the AI Plans Alignment Plan - if you want to do extremely impactful alignment research I think this is one of the best events in the world.
Working on a meta plan for solving alignment, I'd appreciate feedback & criticism please - the more precise the better. Feel free to use the emojis reactions if writing a reply you'd be happy with feels taxing.
Diagram for visualization - items in tables are just stand-ins, any ratings and numbers are just for illustration, not actual rankings or scores at this moment.

Red and Blue teaming Alignment evals
Make lots of red teaming methods to reward hack alignment evals
Use this to find actually useful alignment evals, then red team and reward hack them too, find better methods of reward hacking and red teaming, then finding better ways of doing alignment evals that are resistant to that and actually find the preferences of the model reliably too - keep doing this constantly to have better and better alignment evals and alignment evals reward hacking methods. Also hosting events for other people to do this and learn to do this.
Alignment methods
then doing these different post training methods, implementing lots of different alignment methods, seeing which ones score how highly across the alignment evals we know are robust, using this to find patterns in which types seem to work more/less,
Theory
then doing theory work, to try to figure out/guess why those methods are working more/less, make some hypothesises about new methods that willl work better and new methods that will work worse, based on this,
then ranking the theory work based on how minimal the assumptions are, how well predicted the implementations are and how high they score on peer review
End goal:
Alignment evals that no one can find a way to reward hack/red team and that really precisely measure the preferences of the models.
Alignment methods that seem to score a high score on these evals.
A very strong theoretical understanding for what makes this alignment method work, *why* it actually learns the preferences, theory understanding how those preferences will/won't scale to result in futures where everyone dies or lives. The theoretical work should have as few assumptions as possible. Aim is a mathematical proof with minimal assumptions, that's written very clearly and is easy to understand so that lots and lots of people can understand it and criticize it - robustness through obfuscation is a method of deception, intentional or not.
Current Work:
Hosting Alignment Evals hackathons and making Alignment Evals guides, to make better alignment evals and red teaming methods.
Making lots of Qwen model versions, whose only difference is the post training method.