scottviteri — LessWrong

Mechanistically Eliciting Latent Behaviors in Language Models

I really like the idea of finding steering vectors that maximize downstream differences, and I have a few follow-up questions.

Have you tried/considered modifying c_fc (the MLP encoder layer) bias instead of c_proj (the MLP decoder layer) bias? I don't know about this context, but (i) c_fc makes more intuitive sense as a location to change for me, (ii) I have seen more success playing with it in the past than c_proj, and (iii) they are not-equivalent because of the non-linearity between them.

I like how you control for radius by projecting gradients onto the tangent space and projecting the steering vector of the sphere, but have you tried using cosine distance as the loss function so there is less incentive for R to naturally blow up? Let in ${max}_{z} D (z)$ .

When you do iterative search for next steering vectors, I do not expect that constraining the search to an orthogonal subspace to previously found steering vectors to be very helpful, since the orthogonal vectors might very well be mapped into the same downstream part of latent space. Since the memory demands are quite cheap for learning steering vectors, I would be interested in seeing an objective which learned a matrix of steering vectors simultaneously, maximizing the sum of pairwise distances. Suppose we are learning $K$ vectors simultaneously.
${max}_{z_{1}, \dots, z_{K}} \sum_{1 \leq k < k^{'} \leq K} \sum_{i = 1}^{n} \sum_{t \in I_{i}} c o s D i s t (Z_{ℓ_{t a r g e t}, i, t} (z_{k}), Z_{ℓ_{t a r g e t}, i, t} (z_{k^{'}}))$

But this form of the objective makes it more transparent that a natural solution is to make each steering vector turn the output into gibberish (unless the LM latent space treats all gibberish alike, which I admit is possible). So maybe we would want a tunable term which encourages staying close to the unsteered activations, while staying far from the other steered activations.
${max}_{z_{1}, \dots, z_{n}} \sum_{1 \leq k < k^{'} \leq K} \sum_{i = 1}^{n} \sum_{t \in I_{i}} c o s D i s t (Z_{ℓ_{t a r g e t}, i, t} (z_{k}), Z_{ℓ_{t a r g e t}, i, t} (z_{k^{'}})) - λ \sum_{i = 1}^{K} D (z_{k})$

Lastly, I would be interested in seeing the final output probability distribution over tokens instead of $ℓ_{t a r g e t}$ using KL for the distance, since in that domain we can extract very fine grained information from the model's activations. Let $D^{k l} (z) = \sum_{i = 1}^{n} \sum_{t \in I_{i}} K L (Z_{ℓ_{u n e m b e d}, i, t} (z) | | Z_{ℓ_{u n e m b e d}, i, t} (0))$ in

${max}_{z_{1}, \dots, z_{n}} \sum_{k = 1}^{K} \sum_{k^{'} = 1}^{K} \sum_{i = 1}^{n} \sum_{t \in I_{i}} K L (Z_{ℓ_{u n e m b e d}, i, t} (z_{k}) | | Z_{ℓ_{u n e m b e d}, i, t} (z_{k^{'}})) - λ \sum_{i = 1}^{K} D^{k l} (z_{k})$

The Geometric Expectation

scottviteri11mo40

Very interesting! I'm excited to read your post.

«Boundaries», Part 3a: Defining boundaries as directed Markov blankets

scottviteri2y10

I take back the part about pi and update determining the causal structure, because many causal diagrams are constant with the same poly diagram

The Geometric Expectation

scottviteri2y20

I think what is going on here is that both and $G$ are of the form $(e^{\land}) \circ g \circ ln$ with $g = \nabla$ and $g = E$ , respectively. Let's define the star operator as $g^{*} = (e^{\land}) \circ g \circ ln$ . Then $(f \circ g)^{*} = (e^{\land}) \circ (f \circ g) \circ ln = (e^{\land}) \circ f \circ ln \circ (e^{\land}) \circ g \circ ln = f^{*} \circ g^{*}$ , by associativity of function composition. Further, if $f$ and $g$ commute, then so do $f^{*}$ and $g^{*}$ : $g^{*} \circ f^{*} = (g \circ f)^{*} = (f \circ g)^{*} = f^{*} \circ g^{*} .$

So the commutativity of the geometric expectation and derivative fall directly out of their representation as $E^{*}$ and $\nabla^{*}$ , respectively, by commutativity of $E$ and $\nabla$ , as long as they are over different variables.

We can also derive what happens when the expectation and gradient are over the same variables: $(\nabla_{θ} \circ E_{x \sim P_{θ} (x)})^{*}$ . First, notice that $(* k)^{*} (x) = e^{k * ln x} = e^{ln x * k} = x^{k}$ , so $(* k)^{*} = (^{\land} k)$ .. Also $(+ k)^{*} (x) = e^{k + ln (x)} = e^{k} e^{ln (x)} = x e^{k} ⟹ (+ k)^{*} = (* e^{k})$ .

Now let's expand the composition of the gradient and expectation. $(\nabla_{θ} \circ E_{x \sim P_{θ} (x)}) (f (x)) = \nabla_{θ} \int P_{θ} (x) f (x) d x = E_{x \sim P_{θ} (x)} [\nabla_{θ} (f (x) ln P_{θ} (x))]$ , using the log-derivative trick. So $\nabla_{θ} \circ E_{x \sim P_{θ} (x)} = E_{x \sim P_{θ} (x)} \circ \nabla_{θ} \circ (* ln P_{θ} (x))$ .

Therefore, $\nabla_{θ}^{*} \circ G_{x \sim P_{θ} (x)} = (\nabla_{θ} \circ E_{x \sim P_{θ} (x)})^{*}$ $= E_{x \sim P_{θ} (x)}^{*} \circ \nabla_{θ}^{*} \circ (* ln P_{θ} (x))^{*}$ $= G_{x \sim P_{θ}} \circ \nabla_{θ}^{*} \circ (^{\land} ln P_{θ})$ .

Writing it out, we have $\nabla_{θ}^{*} G_{x \sim P_{θ} (x)} [f (x)] = G_{x \sim P_{θ} (x)} [\nabla_{θ}^{*} (f (x)^{ln P_{θ} (x)}]$ .

The Geometric Expectation

scottviteri2y30

And if I pushed around symbols correctly, the geometric derivative can be pulled inside of a geometric expectation () similarly to how an additive derivative can be pulled inside an additive expectation ( $\nabla_{θ} E_{x \sim P (x)} [f_{θ} (x)] = E_{x \sim P (x)} [\nabla_{θ} f_{θ} (x)]$ ). Also, just as additive expectation distributes over addition ( $E [f (x) + g (x)] = E [f (x)] + E [g (x)]$ ), geometric expectation distributes over multiplication ( $G [f (x) g (x)] = G [f (x)] G [g (x)]$ ).

«Boundaries», Part 3a: Defining boundaries as directed Markov blankets

scottviteri2y10

If I try to use this framework to express two agents communicating, I get an image with a V1, A1, P1, V2, A2, and P2, with cross arrows from A1 to P2 and A2 to P1. This admits many ways to get a roundtrip message. We could have A1 -> P2 -> A2 -> P2 directly, or A1 -> P2 -> V2 -> A2 -> P1, or many cycles among P2, V2, and A2 before P1 receives a message. But in none of these could I hope to get a response in one time step the way I would if both agents simultaneously took an action, and then simultaneously read from their inputs and their current state to get their next state. So I have this feeling that pi : S -> Action and update : Observation x S -> S already bake in this active/passive distinction by virtue of the type signature, and this framing is maybe just taking away the computational teeth/specificity. And I can write the same infiltration and exfiltration formulas by substituting S_t for V_t, Obs_t for P_t, Action_t for A_t, and S_env_t for E_t.

The Geometric Expectation

scottviteri2y30

Actually maybe this family is more relevant:
https://en.wikipedia.org/wiki/Generalized_mean, where the geometric mean is the limit as we approach zero.

The Geometric Expectation

scottviteri2y30

The "harmonic integral" would be the inverse of integral of the inverse of a function -- https://math.stackexchange.com/questions/2408012/harmonic-integral

The Geometric Expectation

scottviteri2y20

Also here is a nice family that parametrizes these different kinds of average (https://m.youtube.com/watch?v=3r1t9Pf1Ffk)

The Geometric Expectation

scottviteri2y50

If arithmetic and geometric means are so good, why not the harmonic mean? https://en.wikipedia.org/wiki/Pythagorean_means. What would a "harmonic rationality" look like?

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments