LESSWRONG
LW

Ronak_Mehta
38520
Message
Dialogue
Subscribe

ML PhD, working on automating alignment research. Trying to be better about "just sending it".

coordinal.org ronakrm.github.io

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
2Ronak_Mehta's Shortform
7mo
1
No wikitag contributions to display.
Interpretable by Design - Constraint Sets with Disjoint Limit Points
Ronak_Mehta2mo10

I think this is roughly right. I think of it more as a single layer would be a permutation, and that composing these permutations would give your complex behaviors (that break down in these nice ways). As a starting point having the hidden/model dimension equal to the input and output dimension would allow some sort of "reasonable" first interpretation that you are using convex combinations of your discrete vocabulary to compose the behaviors and come up with a prediction for your output. Then intermediate layers can map directly to your vocab space (this won't by default be true though, you'd still need some sort of diagonalized prior or something to make it such that each basis corresponded to an input vocab token).

Reply
Ronak_Mehta's Shortform
Ronak_Mehta7mo63

Do you have a good estimate of what is and what will be possible with massively scaled up inference-time compute over the next 3 months? 6 months? Are you thinking about how this will effect others' priorities? Resource allocation? Governance and policy?

IMO having good answers to these questions feels super important for prioritizing where you spend your time.

Reply
2Appendix: Interpretable by Design - Constraint Sets with Disjoint Limit Points
4mo
0
24Interpretable by Design - Constraint Sets with Disjoint Limit Points
4mo
2
2Ronak_Mehta's Shortform
7mo
1
13GPT-2 Sometimes Fails at IOI
1y
0
0A Bit For You
1y
0