LESSWRONG
LW

3086
Alex Boche
7150
Message
Dialogue
Subscribe

Economics Phd Student

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Lessons from the Iraq War for AI policy
Alex Boche3mo10

FWIW, I listened to the following book about the war and it seemed quite good. I think it's the second book in a series about the war. 

"The Endgame: The Inside Story of the Struggle for Iraq, from George W. Bush to Barack Obama" by Bernard E. Trainor and Michael R. Gordon.

Reply
Seeking Feedback: Toy Model of Deceptive Alignment (Game Theory)
Alex Boche4mo10

Thanks!

Reply
Seeking Feedback: Toy Model of Deceptive Alignment (Game Theory)
Alex Boche5mo*20

Thanks for this! 


I'm not sure that I understand the distinction between the vector and point approaches that you've discussed.


This is really a distinction within the math of my model itself, as described above. Both are kind of an attempt to capture how retraining works in a highly "reduced-form" way that abstracts from the details. 

As for how to interpret each in terms of real training:
You might consider an RLHF-style setup. The train-in-a-direction might be something like telling your human evaluators to place a bit more weight on helpfulness (vs. harmlessness) than they did last time (hence a "directional" adjustment). The train-to-desired-point would be something like giving the human evaluators a rubric for the exact balance that you want (hence training towards this balance, wherever you started from). But these interpretations are imperfect.

Reply
Seeking Feedback: Toy Model of Deceptive Alignment (Game Theory)
Alex Boche5mo*30

Just to be clear, I mean "types" in the game theory sense (i.e. a [privately-known] attribute of a player that determines its preferences) not the CS/logic sense. The type space doesn't necessarily capture a literal subspace within a neural network's weights; I think of it more as a space measuring some human-interpretable property of the AI.

As a mundane (and very imperfect) example, we might think of the type space as a 1 dimensional continnum of how much the AI values helpfulness vis-a-vis harmlessness. [is that 1 dimension or 2 non-orthogonal directions?] How would we increase (or decrease) the type in the direction of helpfulness? I give two approaches to doing so within a (roughly) RLHF paridigm.

1) We might simply ask the human raters to increase the weight they put on helpfulness when they make their choices/rankings, and then train the AI (using RL) to match those choice probabilities derived from the human choices. [Maybe that's more like training to a point rather than in a direction?] 

2) Or we could train auxilliary models to separately rate helpfulness and harmlessness of responses based on human ratings thereof and then put those into a logit stochastic choice model like softmax(a_1 * helpfulness + a_2 * harmlessness), and finally train the main AI to match those choice probabilities. To move the AI's type upwards (towards more helpfulness), we could increase the parameter a_1 and then use the resulting logit choice probabilities to retrain the main AI (using RL).

Does that answer your question? Thanks!

Reply1
Announcement: Learning Theory Online Course
Alex Boche7mo10

Looking forward to seeing those if/when you publish them!

Reply
5Seeking Feedback: Toy Model of Deceptive Alignment (Game Theory)
5mo
6