LESSWRONG
LW

21
Kabir Kumar
72862810
Message
Dialogue
Subscribe

Running https://aiplans.org 

Fulltime working on the alignment problem. 

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
2Kabir Kumar's Shortform
1y
183
17An argument for discussing AI safety in person being underused
23d
1
58AI Safety Law-a-thon: Turning Alignment Risks into Legal Strategy
1mo
4
6Truth
2mo
0
80Directly Try Solving Alignment for 5 weeks
3mo
4
2Making progress bars for Alignment
9mo
0
20AI & Liability Ideathon
11mo
2
2Kabir Kumar's Shortform
1y
183
Kabir Kumar's Shortform
Kabir Kumar26d*52

Working on a meta plan for solving alignment, I'd appreciate feedback & criticism please - the more precise the better. Feel free to use the emojis reactions if writing a reply you'd be happy with feels taxing.

Diagram for visualization - items in tables are just stand-ins, any ratings and numbers are just for illustration, not actual rankings or scores at this moment.

Red and Blue teaming Alignment evals
Make lots of red teaming methods to reward hack alignment evals
Use this to find actually useful alignment evals, then red team and reward hack them too, find better methods of reward hacking and red teaming, then finding better ways of doing alignment evals that are resistant to that and actually find the preferences of the model reliably too - keep doing this constantly to have better and better alignment evals and alignment evals reward hacking methods. Also hosting events for other people to do this and learn to do this. 
 

Alignment methods
then doing these different post training methods, implementing lots of different alignment methods, seeing which ones score how highly across the alignment evals we know are robust, using this to find patterns in which types seem to work more/less, 
 

Theory
then doing theory work, to try to figure out/guess why those methods are working more/less, make some hypothesises about new methods that willl work better and new methods that will work worse, based on this, 
then ranking the theory work based on how minimal the assumptions are, how well predicted the implementations are and how high they score on peer review

End goal:
Alignment evals that no one can find a way to reward hack/red team and that really precisely measure the preferences of the models.
Alignment methods that seem to score a high score on these evals. 
A very strong theoretical understanding for what makes this alignment method work, *why* it actually learns the preferences, theory understanding how those preferences will/won't scale to result in futures where everyone dies or lives. The theoretical work should have as few assumptions as possible. Aim is a mathematical proof with minimal assumptions, that's written very clearly and is easy to understand so that lots and lots of people can understand it and criticize it - robustness through obfuscation is a method of deception, intentional or not. 

 

Current Work:

Hosting Alignment Evals hackathons and making Alignment Evals guides, to make better alignment evals and red teaming methods. 

Making lots of Qwen model versions, whose only difference is the post training method.

Reply1
Kabir Kumar's Shortform
[+]Kabir Kumar19h-70
Kabir Kumar's Shortform
Kabir Kumar20h76

Prestige Maxing is Killing the AI Safety Field

Reply
Directly Try Solving Alignment for 5 weeks
Kabir Kumar1d20

Some nice talks and lots of high quality people signed up, but 2 weeks late starting because I massively underestimated how long it would take to give personalized feedback to 300 applicants and also kept trying to use really unweildy software and turned out its faster to do it manually. and also didn't get the research guides (https://moonshot-alignment-program.notion.site/Updated-Research-Guides-255a2fee3c6780f68a59d07440e06d53?pvs=74) ready in time and didn't coordinate a lot of things properly. 

Also, a lot of fuckups with luma, notion and google forms. 

Overall worked in marketing the event okish, 298 signups, but extremely badly in running it due to disorganization on my part. I'm not put off by this though, because the first alignment evals hackathon was like this, then the second one, we learnt from that and it went really well. 

Learning a lot from this one too and among other things, making our own events thing, because i recently saw the founder of luma saying on twitter that they're 'just vibecoding!' and dont have a backend engineer and really frequently have a lot of pains when using luma https://test.ai-plans.com/events 

Also, gonna be taking more time to prepare for the next event and only guaranteeing a max of 100 people feedback - free to the first 50 to apply and optional for up to 50 others who can pay $10 to get personalized feedback. 

And gonna make very clear template schedules for the mentors, so that we (I) don't waste their time, have things be vauge, them not actually getting people joining their research, etc. 

Reply
If Anyone Builds It Everyone Dies, a semi-outsider review
Kabir Kumar3d10

It’s suspicious that the apparent solution to this problem is to do more AI research as opposed to doing anything that would actually hurt AI companies financially.

What do you think of implementing AI Liability as proposed by, e.g. Beckers & Teubner?

Reply
Kabir Kumar's Shortform
Kabir Kumar3d10

Hi, making a guide/course for evals, very much in the early draft stage atm
Please consider giving feedback
https://docs.google.com/document/d/1_95M3DeBrGcBo8yoWF1XHxpUWSlH3hJ1fQs5p62zdHE/edit?usp=sharing

Reply
Thinking Partners: Building AI-Powered Knowledge Management Systems
Kabir Kumar3d21

Have you looked at marketing/messaging software? The things of knowing which template messages work best in which cases sound quite similar to this and might have overlap. I would be surprised if e.g. MrBeast's team didn't have something tracking which video titles and thumbnails do best with which audiences, which script structures do best, an easy way to make variants, etc.

Reply
Kabir Kumar's Shortform
Kabir Kumar3d10

so for this and other reasons, its hard to say when an eval has been truly successfully 'red teamed'

Reply
Kabir Kumar's Shortform
Kabir Kumar3d10

One of the major problems with this atm is that most 'alignment', 'safety', etc evals dont specify or define exactly what they're trying to measure. 

Reply
Kabir Kumar's Shortform
Kabir Kumar3d10

Hi, hosting an Alignment Evals hackathon for red teaming evals and making more robust ones, on November 1st: https://luma.com/h3hk7pvc

Team from previous one presented at ICML 

Team in January made one of the first Interp based Evals for LLMs

All works from this will go towards the AI Plans Alignment Plan - if you want to do extremely impactful alignment research I think this is one of the best events in the world. 

Reply
Load More