LESSWRONG
Petrov Day
LW

1921
asarvazyan
0010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No posts to display.
No wikitag contributions to display.
Interpretable by Design - Constraint Sets with Disjoint Limit Points
asarvazyan4mo10

I'm a little sad that much of safety research has fully pivoted to post-hoc explanations of frontier Shoggoths. I think there's probably low hanging fruit to grow an easier to understand Shoggoth, even if it's not with a simplex :).

 

I agree. I'm pretty new to the field and was surprised to see few recent attempts to build interpretable models from the ground up.

Natural, Axis-Aligned Bases. The bases where a single element is 1 and the rest are 0 explicitly define our "corners" and correspond directly to "interpretable" points of our set. These are points where all other dimensions are "off", and the only forward contribution comes from a single dimension. This also means that every element in the simplex is a linear, convex combination of the basis elements.

Would this mean that (assuming there are ways to design NN layers to be naturally restricted to the simplex) to interpret d types of behaviors, one would have to a priori decide what d is and train a model with a d-dimensional simplex?

Reply