LESSWRONG
LW

Matthew Khoriaty
1685130
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
2Matthew Khoriaty's Shortform
6mo
24
No wikitag contributions to display.
Seeking Feedback: Toy Model of Deceptive Alignment (Game Theory)
Matthew Khoriaty2mo20

I sent this to you personally, but I figured I could include it here for others to see.

I like this research idea! Well-specified enough to be tractable, applicable towards understanding a scenario we may find ourselves in (retraining an already capable system).

Question: In your Train-in-Direction game, why is infinity included?

 

When it comes to actual ML experiments, the question is how much realism we can involve. 

Level Zero realism: your math. Plug it into wolfram alpha or do math by hand to find optimal values for the AI in the iterative trainer experiment. 

Level .5 realism: Use PyTorch gradient descent to find the optimal values. 

Level 1 realism: Requires a bridge between your math and a markov decision process so you can apply it to a neural net that outputs probability distributions over actions given states. Use some simple environment. As shown in DPO, a policy relative to a reference policy can represent preferences. Might be useful. 

Level 2: apply it all to a real LLM

 

Relevant topics you can look into:

Natural policy gradients — an RL algorithm which isn’t in use but which forms part of the theoretical foundational background of today’s RL algorithms (PPO and GRPO). The main idea is to take steps in action log odds rather than parameters. 

Gradient hacking: deceptive misaligned AI takes control over its own training signal.

 

Check out appendix A: https://arxiv.org/pdf/2310.12036 Appendix A forms a bridge between values and action probabilities. That bridge is important for DPO and may be useful for you. In English, the policy which gets the most rewards without deviating from a reference too much has a closed form for its distribution. I find this neat. You may like to read the paper I linked in full, or the original DPO paper. They are fire papers

Reply
Matthew Khoriaty's Shortform
Matthew Khoriaty3mo3-1

I'd say that Empire of AI, AI Snake Oil, and The Age of AI are good book covers, and that Genesis and More Everything Forever are bad covers. 

Reply
Matthew Khoriaty's Shortform
Matthew Khoriaty3mo144150

The current cover of If Anyone Builds it, Everyone Dies is kind of ugly and I hope it is just a placeholder. At least one of my friends agrees. Book covers matter a lot!

I'm not a book cover designer, but here are some thoughts:

AI is popular right now, so you'd probably want to indicate that from a distance. The current cover has "AI" half-faded in the tagline.

Generally the cover is not very nice to look at. 

Why are you de-emphasizing "Kill Us All" by hiding it behind that red glow?

I do like the font choice, though. No-nonsense and straightforward.

 @Eliezer Yudkowsky @So8res 

Reply32
Matthew Khoriaty's Shortform
Matthew Khoriaty4mo21

Scalable oversight is an accessible and relatable kind of idea. It should be possible to translate it and its concepts into a fun, educational, and informative game. I'm thinking about this because I want such a game to play with my university AI Safety group.

Reply
Matthew Khoriaty's Shortform
Matthew Khoriaty6mo10

The facebook bots aren't doing R1 or o1 reasoning about the context before making an optimal reinforcement-learned post. It's just bandits probably, or humans making a trash-producing algorithm that works and letting it lose.

 

Agreed that I should try Reddit first. And I think there should be ways to guide an LLM towards the reward signal of "write good posts" before starting the RL, though I didn't find any established techniques when I researched reward-model-free reinforcement learning loss functions that act on the number of votes a response receives. (What I mean is the results of searching DPO's citations for "Vote". Lots of results, though none of them have many citations.)

Reply
Matthew Khoriaty's Shortform
Matthew Khoriaty6mo10

Deepseek R1 used 8,000 samples. s1 used 1,000 offline samples. That really isn't all that much.

Reply
Matthew Khoriaty's Shortform
Matthew Khoriaty6mo10

RL techniques (reasoning + ORPO) has had incredible success on reasoning tasks. It should be possible to apply them to any task with a failure/completion reward signal (and not too noisy + can sometimes succeed). 

Is it time to make the automated Alignment Researcher?

Task: write LessWrong posts and comments. Reward signal: get LessWrong upvotes.

More generally, what is stopping people from making RL forum posters on eg Reddit that will improve themselves?

Reply
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Matthew Khoriaty7mo10

Thank you for your brainpower. 

There's a lot to try, and I hope to get to this project once I have more time. 

Reply
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Matthew Khoriaty7mo10

That is a sensible way to save compute resources. Thank you.

Reply
Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning
Matthew Khoriaty8mo30

Thank you again. 

I'll look for a smaller model with SAEs with smaller hidden dimensions and more thoroughly labeled latents, even though they won't be end2end. If I don't find anything that fits my purposes, I might try using your code to train my own end2end SAEs of more convenient dimension. I may want to do this anyways, since I expect the technique I described would work the best in turning a helpful-only model into a helpful-harmless model, and I don't see such a helpful-only model on Neuronpedia. 

If the FFNN has a hidden dimension of 16, then it would have around 1.5 million parameters, which doesn't sound too bad, and 16 might be enough to find something interesting.

Low-rank factorization might help with the parameter counts. 

Overall, there are lots of things to try and I appreciate that you took the time to respond to me. Keep up the great work!

Reply
Load More
9Interpretable Fine Tuning Research Update and Working Prototype
4mo
0
2Evaluating Collaborative AI Performance Subject to Sabotage
4mo
0
2Matthew Khoriaty's Shortform
6mo
24
4Easily Evaluate SAE-Steered Models with EleutherAI Evaluation Harness
7mo
0
9AI Labs Wouldn't be Convicted of Treason or Sedition
1y
2