Kabir Kumar — LessWrong

Fulltime working on the alignment problem.

Working on a meta plan for solving alignment, I'd appreciate feedback & criticism please - the more precise the better. Feel free to use the emojis reactions if writing a reply you'd be happy with feels taxing.

Diagram for visualization - items in tables are just stand-ins, any ratings and numbers are just for illustration, not actual rankings or scores at this moment.

Red and Blue teaming Alignment evals
Make lots of red teaming methods to reward hack alignment evals
Use this to find actually useful alignment evals, then red team and reward hack them too, find better methods of reward hacking and red teaming, then finding better ways of doing alignment evals that are resistant to that and actually find the preferences of the model reliably too - keep doing this constantly to have better and better alignment evals and alignment evals reward hacking methods. Also hosting events for other people to do this and learn to do this.

Alignment methods
then doing these different post training methods, implementing lots of different alignment methods, seeing which ones score how highly across the alignment evals we know are robust, using this to find patterns in which types seem to work more/less,

Theory
then doing theory work, to try to figure out/guess why those methods are working more/less, make some hypothesises about new methods that willl work better and new methods that will work worse, based on this,
then ranking the theory work based on how minimal the assumptions are, how well predicted the implementations are and how high they score on peer review

End goal:
Alignment evals that no one can find a way to reward hack/red team and that really precisely measure the preferences of the models.
Alignment methods that seem to score a high score on these evals.
A very strong theoretical understanding for what makes this alignment method work, *why* it actually learns the preferences, theory understanding how those preferences will/won't scale to result in futures where everyone dies or lives. The theoretical work should have as few assumptions as possible. Aim is a mathematical proof with minimal assumptions, that's written very clearly and is easy to understand so that lots and lots of people can understand it and criticize it - robustness through obfuscation is a method of deception, intentional or not.

Current Work:

Hosting Alignment Evals hackathons and making Alignment Evals guides, to make better alignment evals and red teaming methods.

Making lots of Qwen model versions, whose only difference is the post training method.

if serious about us china cooperation and not cargo culting, please read: https://www.cac.gov.cn/2025-09/15/c_1759653448369123.htm

For future collaborations, we will strive to improve transparency wherever possible, ensuring contributors have clearer information about funding sources, data access, and usage purposes at the outset.

Would you make a statement that would make you legally liable/accountable on this?

People's heart being in the right place doesn't stop them from succumbing to incentives, just changes how long it will take them to do so and what excuses they will make. Solution is better incentives. Seems that Epoch AI isn't set up with a robust incentive structure atm. Hope this changes.

Prestige Maxing is Killing the AI Safety Field

Some nice talks and lots of high quality people signed up, but 2 weeks late starting because I massively underestimated how long it would take to give personalized feedback to 300 applicants and also kept trying to use really unweildy software and turned out its faster to do it manually. and also didn't get the research guides (https://moonshot-alignment-program.notion.site/Updated-Research-Guides-255a2fee3c6780f68a59d07440e06d53?pvs=74) ready in time and didn't coordinate a lot of things properly.

Also, a lot of fuckups with luma, notion and google forms.

Overall worked in marketing the event okish, 298 signups, but extremely badly in running it due to disorganization on my part. I'm not put off by this though, because the first alignment evals hackathon was like this, then the second one, we learnt from that and it went really well.

Learning a lot from this one too and among other things, making our own events thing, because i recently saw the founder of luma saying on twitter that they're 'just vibecoding!' and dont have a backend engineer and really frequently have a lot of pains when using luma https://test.ai-plans.com/events

Also, gonna be taking more time to prepare for the next event and only guaranteeing a max of 100 people feedback - free to the first 50 to apply and optional for up to 50 others who can pay $10 to get personalized feedback.

And gonna make very clear template schedules for the mentors, so that we (I) don't waste their time, have things be vauge, them not actually getting people joining their research, etc.

It’s suspicious that the apparent solution to this problem is to do more AI research as opposed to doing anything that would actually hurt AI companies financially.

What do you think of implementing AI Liability as proposed by, e.g. Beckers & Teubner?

Hi, making a guide/course for evals, very much in the early draft stage atm
Please consider giving feedback
https://docs.google.com/document/d/1_95M3DeBrGcBo8yoWF1XHxpUWSlH3hJ1fQs5p62zdHE/edit?usp=sharing

Have you looked at marketing/messaging software? The things of knowing which template messages work best in which cases sound quite similar to this and might have overlap. I would be surprised if e.g. MrBeast's team didn't have something tracking which video titles and thumbnails do best with which audiences, which script structures do best, an easy way to make variants, etc.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments