Alex Boche — LessWrong

Economics Phd Student

FWIW, I listened to the following book about the war and it seemed quite good. I think it's the second book in a series about the war.

"The Endgame: The Inside Story of the Struggle for Iraq, from George W. Bush to Barack Obama" by Bernard E. Trainor and Michael R. Gordon.

Thanks for this!

I'm not sure that I understand the distinction between the vector and point approaches that you've discussed.

This is really a distinction within the math of my model itself, as described above. Both are kind of an attempt to capture how retraining works in a highly "reduced-form" way that abstracts from the details.

As for how to interpret each in terms of real training:
You might consider an RLHF-style setup. The train-in-a-direction might be something like telling your human evaluators to place a bit more weight on helpfulness (vs. harmlessness) than they did last time (hence a "directional" adjustment). The train-to-desired-point would be something like giving the human evaluators a rubric for the exact balance that you want (hence training towards this balance, wherever you started from). But these interpretations are imperfect.

Just to be clear, I mean "types" in the game theory sense (i.e. a [privately-known] attribute of a player that determines its preferences) not the CS/logic sense. The type space doesn't necessarily capture a literal subspace within a neural network's weights; I think of it more as a space measuring some human-interpretable property of the AI.

As a mundane (and very imperfect) example, we might think of the type space as a 1 dimensional continnum of how much the AI values helpfulness vis-a-vis harmlessness. [is that 1 dimension or 2 non-orthogonal directions?] How would we increase (or decrease) the type in the direction of helpfulness? I give two approaches to doing so within a (roughly) RLHF paridigm.

1) We might simply ask the human raters to increase the weight they put on helpfulness when they make their choices/rankings, and then train the AI (using RL) to match those choice probabilities derived from the human choices. [Maybe that's more like training to a point rather than in a direction?]

2) Or we could train auxilliary models to separately rate helpfulness and harmlessness of responses based on human ratings thereof and then put those into a logit stochastic choice model like softmax(a_1 * helpfulness + a_2 * harmlessness), and finally train the main AI to match those choice probabilities. To move the AI's type upwards (towards more helpfulness), we could increase the parameter a_1 and then use the resulting logit choice probabilities to retrain the main AI (using RL).

Does that answer your question? Thanks!

Looking forward to seeing those if/when you publish them!

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments