Logan Riggs

Wiki Contributions

Comments

How likely do you think bilinear layers & dictionary learning will lead to comprehensive interpretability? 

Are there other specific areas you're excited about?

Why is loss stickiness deprecated? Were you just not able to see the an overlap in basins for L1 & reconstruction loss when you 4x the feature/neuron ratio (ie from 2x->8x)?

As (maybe) mentioned in the slides, this method may not be computationally feasible for SOTA models, but I'm interested in the ordering of features turned monosemantic; if the most important features are turned monosemantic first, then you might not need full monosemanticity.

I initially expect the "most important & frequent" features to become monosemantic first based off the superposition paper. AFAIK, this method only captures the most frequent because "importance" would be w/ respect to CE-loss in the model output, not captured in reconstruction/L1 loss.

My shard theory inspired story is to make an AI that:

  1. Has a good core of human values (this is still hard)
  2. Can identify when experiences will change itself to lead to less of the initial good values. (This is the meta-preferences point with GPT-4 sort of expressing it would avoid jail break inputs)

Then the model can safely scale.

This doesn’t require having the true reward function (which I imagine to be a giant lookup table created by Omega), but some mech interp and understanding its own reward function. I don’t expect this to be an entirely different paradigm; I even think current methods of RLHF might just naively work. Who knows? (I do think we should try to figure it out though! I do have greater uncertainty and less pessimism)

Analogously, I do believe I do a good job of avoiding value-destroying inputs (eg addicting substances), even though my reward function isn’t as clear and legible as what our AI’s will be AFAIK.

I think more concentration meditation would be the way, but concentration meditation does lead to more likely noticing experiences that cause what you may call “awakening experiences”. (This is contrast with insight meditation like noting)

Leigh Brasington’s Right Concentration is a book on jhana’s, which is becoming very concentrated and then focusing on positive sensations until you hit a flow state. This is definitely not an awakening experience, but feels great (though I’ve only entered the first a small amount).

A different source is Rob Burbea’s jhana retreat audio recordings on dharmaseed.

Could you clarify what you mean by awakening experiences and why you think it’s bad?

Is it actually true that you only trained on 5% of the dataset for filtering (I’m assuming training for 20 epochs)?

Unfinished line here

Implicit in the description of features as directions is that the feature can be represented as a scalar, and that the model cares about the range of this number. That is, it matters whether the feature

Monitoring of increasingly advanced systems does not trivially work, since much of the cognition of advanced systems, and many of their dangerous properties, will be externalized the more they interact with the world.

Externalized reasoning being a flaw in monitoring makes a lot of sense, and I haven’t actually heard of it before. I feel that should be a whole post on itself.

Load More