LESSWRONG
LW

739
asarvazyan
0010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
Interpretable by Design - Constraint Sets with Disjoint Limit Points
asarvazyan6mo10

I'm a little sad that much of safety research has fully pivoted to post-hoc explanations of frontier Shoggoths. I think there's probably low hanging fruit to grow an easier to understand Shoggoth, even if it's not with a simplex :).

 

I agree. I'm pretty new to the field and was surprised to see few recent attempts to build interpretable models from the ground up.

Natural, Axis-Aligned Bases. The bases where a single element is 1 and the rest are 0 explicitly define our "corners" and correspond directly to "interpretable" points of our set. These are points where all other dimensions are "off", and the only forward contribution comes from a single dimension. This also means that every element in the simplex is a linear, convex combination of the basis elements.

Would this mean that (assuming there are ways to design NN layers to be naturally restricted to the simplex) to interpret d types of behaviors, one would have to a priori decide what d is and train a model with a d-dimensional simplex?

Reply