I wonder how much of these orthogonal vectors are "actually orthogonal" once we consider we are adding two vectors together, and that the model has things like LayerNorm.

If one conditions on downstream midlayer activations being "sufficiently different" it seems possible one could find like 10x degeneracy of actual effects these have on models. (A possibly relevant factor is how big the original activation vector is compared to the steering vector?)

Reply

Deep Forgetting & Unlearning for Safely-Scoped LLMs

NickyP8mo50

I think there are already some papers doing similar work, though usually sold as reducing inference costs. For example, the MoEfication paper and Contextual Sparsity paper could probably be modified for this purpose.

Reply

AISC 2024 - Project Summaries

NickyP8mo20

Sorry! I have fixed this now

Reply

AI Safety Camp 2024

NickyP8mo90

In case anyone finds it difficult to go through all the projects, I have made a longer post where each project title is followed by a brief description, and a list of the main skills/roles they are looking for.

See here: https://www.lesswrong.com/posts/npkvZG67hRvBneoQ9

Reply

Which LessWrongers are (aspiring) YouTubers?

Answer by NickyPOct 23, 202352

Cadenza Labs has some video explainers on interpretability-related concepts: https://www.youtube.com/@CadenzaLabs

For example, an intro to Causal Scrubbing:

Reply

Ban development of unpredictable powerful models?

NickyP1yΩ330

Maybe not fully understanding, but one issue I see is that without requiring "perfect prediction", one could potentially Goodhart on on the proposal. I could imagine something like:

In training GPT-5, add a term that upweights very basic bigram statistics. In "evaluation", use your bigram statistics table to "predict" most topk outputs just well enough to pass.

This would probably have a negative impact to performance, but this could possibly be tuned to be just sufficient to pass. Alternatively, one could use a toy model trained on the side that is easy to understand, and regularise the predictions on that instead of exactly using bigram statistics, just enough to pass the test, but still only understanding the toy model.

Reply

LLM Basics: Embedding Spaces - Transformer Token Vectors Are Not Points in Space

NickyP1y10

While I think this is important, and will probably edit the post, I think even in the unembedding, when getting the logits, the behaviour cares more about direction than distance.

When I think of distance, I implicitly think Euclidean distance:

But the actual "distance" used for calculating logits looks like this:
$d (x_{1}, x_{2}) = x_{1} \cdot x_{2} = | x_{1} | | x_{2} | cos θ_{12}$

Which is a lot more similar to cosine similarity:
$d (x_{1}, x_{2}) = {^x}_{1} \cdot {^x}_{2} = cos θ_{12}$

I think that because the metric is so similar to the cosine similarity, it makes more sense to think of size + directions instead of distances and points.

Reply

LLM Basics: Embedding Spaces - Transformer Token Vectors Are Not Points in Space

NickyP1y32

This is true. I think that visualising points on a (hyper-)sphere is fine, but it is difficult in practice to parametrise the points that way.

It is more that the vectors on the gpu look like , but the vectors in the model are treated more like $R \times S^{n - 1}$

Reply