LESSWRONG
LW

scottviteri
206Ω187370
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Mechanistically Eliciting Latent Behaviors in Language Models
scottviteri7mo10

I really like the idea of finding steering vectors that maximize downstream differences, and I have a few follow-up questions.

Have you tried/considered modifying c_fc (the MLP encoder layer) bias instead of c_proj (the MLP decoder layer) bias? I don't know about this context, but (i) c_fc makes more intuitive sense as a location to change for me, (ii) I have seen more success playing with it in the past than c_proj, and (iii) they are not-equivalent because of the non-linearity between them.

I like how you control for radius by projecting gradients onto the tangent space and projecting the steering vector of the sphere, but have you tried using cosine distance as the loss function so there is less incentive for R to naturally blow up? Let D(z)=∑ni=1∑t∈IicosDist(Zℓtarget,i,t(z),Zℓtarget,i,t(0))in maxzD(z).

When you do iterative search for next steering vectors, I do not expect that constraining the search to an orthogonal subspace to previously found steering vectors to be very helpful, since the orthogonal vectors might very well be mapped into the same downstream part of latent space. Since the memory demands are quite cheap for learning steering vectors, I would be interested in seeing an objective which learned a matrix of steering vectors simultaneously, maximizing the sum of pairwise distances. Suppose we are learning K vectors simultaneously.
maxz1,…,zK∑1≤k<k′≤K∑ni=1∑t∈IicosDist(Zℓtarget,i,t(zk),Zℓtarget,i,t(zk′))


But this form of the objective makes it more transparent that a natural solution is to make each steering vector turn the output into gibberish (unless the LM latent space treats all gibberish alike, which I admit is possible). So maybe we would want a tunable term which encourages staying close to the unsteered activations, while staying far from the other steered activations. 
maxz1,…,zn∑1≤k<k′≤K∑ni=1∑t∈IicosDist(Zℓtarget,i,t(zk),Zℓtarget,i,t(zk′))−λ∑Ki=1D(zk)

Lastly, I would be interested in seeing the final output probability distribution over tokens instead of ℓtarget using KL for the distance, since in that domain we can extract very fine grained information from the model's activations. Let Dkl(z)=∑ni=1∑t∈IiKL(Zℓunembed,i,t(z)||Zℓunembed,i,t(0)) in

maxz1,…,zn∑Kk=1∑Kk′=1∑ni=1∑t∈IiKL(Zℓunembed,i,t(zk)||Zℓunembed,i,t(zk′))−λ∑Ki=1Dkl(zk)

Reply
The Geometric Expectation
scottviteri9mo40

Very interesting! I'm excited to read your post.

Reply
«Boundaries», Part 3a: Defining boundaries as directed Markov blankets
scottviteri1y10

I take back the part about pi and update determining the causal structure, because many causal diagrams are constant with the same poly diagram

Reply
The Geometric Expectation
scottviteri1y20

I think what is going on here is that both ∇∗ and G are of the form (e∧)∘g∘ln with g=∇ and g=E, respectively. Let's define the star operator as g∗=(e∧)∘g∘ln. Then (f∘g)∗=(e∧)∘(f∘g)∘ln=(e∧)∘f∘ln∘(e∧)∘g∘ln=f∗∘g∗, by associativity of function composition. Further, if f and g commute, then so do f∗ and g∗: g∗∘f∗=(g∘f)∗=(f∘g)∗=f∗∘g∗.

So the commutativity of the geometric expectation and derivative fall directly out of their representation as E∗ and ∇∗, respectively, by commutativity of E and ∇, as long as they are over different variables. 

We can also derive what happens when the expectation and gradient are over the same variables: (∇θ∘Ex∼Pθ(x))∗. First, notice that (∗k)∗(x)=ek∗lnx=elnx∗k=xk, so (∗k)∗=(∧k).. Also (+k)∗(x)=ek+ln(x)=ekeln(x)=xek⟹(+k)∗=(∗ek).

Now let's expand the composition of the gradient and expectation. (∇θ∘Ex∼Pθ(x))(f(x))=∇θ∫Pθ(x)f(x)dx=Ex∼Pθ(x)[∇θ(f(x)lnPθ(x))], using the log-derivative trick. So ∇θ∘Ex∼Pθ(x)=Ex∼Pθ(x)∘∇θ∘(∗lnPθ(x)). 

Therefore, ∇∗θ∘Gx∼Pθ(x)=(∇θ∘Ex∼Pθ(x))∗ =E∗x∼Pθ(x)∘∇∗θ∘(∗lnPθ(x))∗ =Gx∼Pθ∘∇∗θ∘(∧lnPθ).

Writing it out, we have ∇∗θGx∼Pθ(x)[f(x)]=Gx∼Pθ(x)[∇∗θ(f(x)lnPθ(x)].

Reply1
The Geometric Expectation
scottviteri1y30

And if I pushed around symbols correctly, the geometric derivative can be pulled inside of a geometric expectation (∇∗θGx∼P(x)[f(x)]=Gx∼P(x)[∇∗θf(x)]) similarly to how an additive derivative can be pulled inside an additive expectation (∇θEx∼P(x)[fθ(x)]=Ex∼P(x)[∇θfθ(x)]). Also, just as additive expectation distributes over addition (E[f(x)+g(x)]=E[f(x)]+E[g(x)]), geometric expectation distributes over multiplication (G[f(x)g(x)]=G[f(x)]G[g(x)]).

Reply
«Boundaries», Part 3a: Defining boundaries as directed Markov blankets
scottviteri1y10

If I try to use this framework to express two agents communicating, I get an image with a V1, A1, P1, V2, A2, and P2, with cross arrows from A1 to P2 and A2 to P1. This admits many ways to get a roundtrip message. We could have A1 -> P2 -> A2 -> P2 directly, or A1 -> P2 -> V2 -> A2 -> P1, or many cycles among P2, V2, and A2 before P1 receives a message. But in none of these could I hope to get a response in one time step the way I would if both agents simultaneously took an action, and then simultaneously read from their inputs and their current state to get their next state. So I have this feeling that pi : S -> Action and update : Observation x S -> S already bake in this active/passive distinction by virtue of the type signature, and this framing is maybe just taking away the computational teeth/specificity. And I can write the same infiltration and exfiltration formulas by substituting S_t for V_t, Obs_t for P_t, Action_t for A_t, and S_env_t for E_t.

Reply
The Geometric Expectation
scottviteri2y30

Actually maybe this family is more relevant:
https://en.wikipedia.org/wiki/Generalized_mean, where the geometric mean is the limit as we approach zero.

Reply
The Geometric Expectation
scottviteri2y30

The "harmonic integral" would be the inverse of integral of the inverse of a function -- https://math.stackexchange.com/questions/2408012/harmonic-integral

Reply
The Geometric Expectation
scottviteri2y20

Also here is a nice family that parametrizes these different kinds of average (https://m.youtube.com/watch?v=3r1t9Pf1Ffk)

Reply
The Geometric Expectation
scottviteri2y50

If arithmetic and geometric means are so good, why not the harmonic mean? https://en.wikipedia.org/wiki/Pythagorean_means. What would a "harmonic rationality" look like?

Reply
Load More
22Causality and a Cost Semantics for Neural Networks
Ω
2y
Ω
1
10Democratic AI Constitution: Round-Robin Debate and Synthesis
2y
4
14Nature < Nurture for AIs
2y
22
50Conversationism
3y
1
44Research Direction: Be the AGI you want to see in the world
3y
0
9REPL's and ELK
Ω
4y
Ω
4
25REPL's: a type signature for agents
4y
6