x

LESSWRONG

LW

asarvazyan — LessWrong

asarvazyan

asarvazyan

Message

1

4y

asarvazyan

4y

Interpretable by Design - Constraint Sets with Disjoint Limit Points

I'm a little sad that much of safety research has fully pivoted to post-hoc explanations of frontier Shoggoths. I think there's probably low hanging fruit to grow an easier to understand Shoggoth, even if it's not with a simplex :).

I agree. I'm pretty new to the field and was surprised to see few recent attempts to build interpretable models from the ground up.

Natural, Axis-Aligned Bases. The bases where a single element is 1 and the rest are 0 explicitly define our "corners" and correspond directly to "interpretable" points of our set. These are poi

... (read more)