Alex Boche
Alex Boche has not written any posts yet.

Alex Boche has not written any posts yet.

Thanks!
Thanks for this!
I'm not sure that I understand the distinction between the vector and point approaches that you've discussed.
This is really a distinction within the math of my model itself, as described above. Both are kind of an attempt to capture how retraining works in a highly "reduced-form" way that abstracts from the details.
As for how to interpret each in terms of real training:
You might consider an RLHF-style setup. The train-in-a-direction might be something like telling your human evaluators to place a bit more weight on helpfulness (vs. harmlessness) than they did last time (hence a "directional" adjustment). The train-to-desired-point would be something like giving the human evaluators a rubric for the exact balance that you want (hence training towards this balance, wherever you started from). But these interpretations are imperfect.
Just to be clear, I mean "types" in the game theory sense (i.e. a [privately-known] attribute of a player that determines its preferences) not the CS/logic sense. The type space doesn't necessarily capture a literal subspace within a neural network's weights; I think of it more as a space measuring some human-interpretable property of the AI.
As a mundane (and very imperfect) example, we might think of the type space as a 1 dimensional continnum of how much the AI values helpfulness vis-a-vis harmlessness. [is that 1 dimension or 2 non-orthogonal directions?] How would we increase (or decrease) the type in the direction of helpfulness? I give two approaches to doing so within... (read more)
Looking forward to seeing those if/when you publish them!
FWIW, I listened to the following book about the war and it seemed quite good. I think it's the second book in a series about the war.
"The Endgame: The Inside Story of the Struggle for Iraq, from George W. Bush to Barack Obama" by Bernard E. Trainor and Michael R. Gordon.