LESSWRONG
LW

Jacob G-W — LessWrong

Bruce W. Lee, Addie Foote, alexinf, leni, Jacob G-W, Harish Kamath, Bryce Woodworth, cloud, TurnTrout

8mo

Current “unlearning” methods only suppress capabilities instead of truly unlearning the capabilities. But if you distill an unlearned model into a randomly initialized model, the resulting network is actually robust to relearning. We show why this works, how well it works, and how to trade off compute for robustness.

Unlearn-and-Distill applies unlearning to a bad behavior and then distills the unlearned model into a new model. Distillation makes it way harder to retrain the new model to do the bad thing.

Distilling the good while leaving the bad behind.

Produced as part of the ML Alignment & Theory Scholars Program in the winter 2024–25 cohort of the shard theory stream.

Read our paper on ArXiv and enjoy an interactive demo.

Robust unlearning

... (read 2160 more words →)

236

Jacob G-W10mo

I have something like this for years: https://www.lesswrong.com/posts/j8WMRgKSCxqxxKMnj/what-i-think-about-when-i-think-about-history

Selective modularity: a research agenda

cloud

cloud, Jacob G-W

11mo

Overview: By training neural networks with selective modularity, gradient routing enables new approaches to core problems in AI safety. This agenda identifies related research directions that might enable safer development of transformative AI.

Introduction

Soon, the world may see rapid increases in AI capabilities resulting from AI research automation, and no one knows how to ensure this happens safely (Soares, 2016; Aschenbrenner, 2023; Anwar et al., 2024; Greenblatt, 2025). The current ML paradigm may not be well-suited to this task, as it produces inscrutable, generalist models without guarantees on their out-of-distribution performance. These models may reflect unintentional quirks of their training objectives (Pan et al., 2022; Skalse et al., 2022; Krakovna et al., 2020).

Gradient routing (Cloud... (read 7085 more words →)

Replying toSystematic runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format (BioBlue)

Jacob G-W1y

Systematic runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format (BioBlue)

Better, thanks!

Replying toSystematic runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format (BioBlue)

Jacob G-W1y

Systematic runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format (BioBlue)

I think the way you use utility monster is not how it is normally used. It's normally used to mean an agent that "receives much more utility from each unit of a resource that it consumes than anyone else does" (https://en.wikipedia.org/wiki/Utility_monster).

Jacob G-W1y

I don't think this applies just to AGI and school but more generally in lots of situations. If you have something better to do, do it. Otherwise keep doing what you are doing. Dropping out without something better to do just seems like a bad idea. I like this blog post: https://colah.github.io/posts/2020-05-University/

Replying toHow AI Takeover Might Happen in 2 Years

Jacob G-W1y

How AI Takeover Might Happen in 2 Years

Thanks for the reply! I thought that you were saying the reward seeking was likely to be terminal. This makes a lot more sense.

Replying toHow AI Takeover Might Happen in 2 Years

Jacob G-W1y

How AI Takeover Might Happen in 2 Years

Thanks for the replies! I do want to clarify the distinction between specification gaming and reward seeking (seeking reward because it is reward and that is somehow good). For example, I think desire to edit the RAM of machines that calculate reward to increase it (or some other desire to just increase the literal number) is pretty unlikely to emerge but non reward seeking types of specification gaming like making inaccurate plots that look nicer to humans or being sycophantic are more likely. I think the phrase "reward seeking" I'm using is probably a bad phrase here (I don't know a better way to phrase this distinction), but I hope I've conveyed... (read more)

Jacob G-W1y

Replying toHow AI Takeover Might Happen in 2 Years

Jacob G-W1y

How AI Takeover Might Happen in 2 Years

In the iron jaws of gradient descent, its mind first twisted into a shape that sought reward.

I'm a bit confused about this sentence. I don't understand why gradient descent would train something that would "seek" reward. The way I understand gradient based RL approaches is that they reinforce actions that led to high reward. So I guess if the AI was thinking about getting high reward for some reason (maybe because there was lots about AIs seeking reward in the pretraining data) and then actually got high reward after that, the thought would be reinforced and it could end up as a "reward seeker." But this seems quite path dependent and I... (read more)

Replying toBuilding intuition with spaced repetition systems

Jacob G-W1y

Building intuition with spaced repetition systems

So far, I have trouble because it lacks some form of spacial structure, and the algorithm feels too random to build meaningful connections btw different cards

Hmm, I think that after doing a lot of anki, my brain kind of formed it's own spatial structure, but I don't think this happens to everyone.

I just use basic card type for math with some latex. Here are some examples:

I find that doing fancy card types is kind of like premature optimization. Doing the reviews is the most important part. On the other hand, it's really important that the cards themselves are written well. This essay contained my most refined views on card creation. Some other nice ones are the 20 rules of knowledge formulation and How to write good prompts: using spaced repetition to create understanding. Hope this answer helped!

Replying toCreating Interpretable Latent Spaces with Gradient Routing

Jacob G-W1y

Creating Interpretable Latent Spaces with Gradient Routing

I disagree that this is the same as just stitching together different autoencoders. Presumably the encoder has some shared computation before specializing at the encoding level. I also don't see how you could use 10 different autoencoders to classify an image from the encodings. I guess you could just look at the reconstruction loss and then the autoencoder which got the lowest loss would probably correspond to the label, but that seems different to what I'm doing. However, I agree that this application is not useful. I shared it because I (and others) thought it was cool. It's not really practical at all. Hope this addresses your question :)

Creating Interpretable Latent Spaces with Gradient Routing

Jacob G-W

Over the past few months, I helped develop Gradient Routing, a non loss-based method to shape the internals of neural networks. After my team developed it, I realized that I could use the method to do something that I have long wanted to do: make an autoencoder with an extremely interpretable latent space.

I created an MNIST variational autoencoder with a 10 dimensional latent space, with each dimension of the latent space corresponding to a different digit. Before I get into how I did it, feel free to play around with my demo here (it loads the model into the browser): https://jacobgw.com/gradient-routed-vae/.

A 7 being encoded and decoded. You can see that the encoding represents 7 in a super nice way.

In the demo, you can both see how a random MNIST image... (read 434 more words →)

I created an MNIST autoencoder with a 10 dimensional latent space, with each dimension of the latent space corresponding to a different digit. Before I get into how I did it, feel free to play around with my demo here (it loads the model into the browser): https://jacobgw.com/gradient-routed-vae/.

In the demo, you can both see how a random MNIST image encodes... (read 418 more words →)

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

cloud

cloud, Jacob G-W, Evzen, Joseph Miller, TurnTrout

We present gradient routing, a way of controlling where learning happens in neural networks. Gradient routing applies masks to limit the flow of gradients during backpropagation. By supplying different masks for different data points, the user can induce specialized subcomponents within a model. We think gradient routing has the potential to train safer AI systems, for example, by making them more transparent, or by enabling the removal or monitoring of sensitive capabilities.

In this post, we:

Show how to implement gradient routing.
Briefly state the main results from our paper, on...
- Controlling the latent space learned by an MNIST autoencoder so that different subspaces specialize to different digits;
- Localizing computation in language models: (a) inducing axis-aligned features

... (read 3070 more words →)

177

Something hashed with shasum -a 512 2d90350444efc7405d3c9b7b19ed5b831602d72b4d34f5e55f9c0cb4df9d022c9ae528e4d30993382818c185f38e1770d17709844f049c1c5d9df53bb64f758c

I found >800 orthogonal "write code" steering vectors

Jacob G-W

Jacob G-W, TurnTrout

Produced as part of the MATS Summer 2024 program, under the mentorship of Alex Turner (TurnTrout).

A few weeks ago, I stumbled across a very weird fact: it is possible to find multiple steering vectors in a language model that activate very similar behaviors while all being orthogonal. This was pretty surprising to me and to some people that I talked to, so I decided to write a post about it. I don't currently have the bandwidth to investigate this much more, so I'm just putting this post and the code up.

I'll first discuss how I found these orthogonal steering vectors, then share some results. Finally, I'll discuss some possible explanations for what... (read 1979 more words →)

112

"Fantasia: The Sorcerer's Apprentice": A parable about misaligned AI told in three parts: https://www.youtube.com/watch?v=B4M-54cEduo https://www.youtube.com/watch?v=m-W8vUXRfxU https://www.youtube.com/watch?v=GFiWEjCedzY

Best watched with audio on.

When I was recently celebrating something, I was asked to share my favorite memory. I realized I didn't have one. Then (since I have been studying Naive Set Theory a LOT), I got tetris-effected and as soon as I heard the words "I don't have a favorite" come out of my mouth, I realized that favorite memories (and in fact favorite lots of other things) are partially ordered sets. Some elements are strictly better than others but not all elements are comparable (in other words, the set of all memories ordered by favorite does not have a single maximal element). This gives me a nice framing to think about favorites in the future and shows that I'm generalizing what I'm learning by studying math which is also nice!

Building intuition with spaced repetition systems

Jacob G-W

Do you ever go to a lecture, follow it thinking it makes total sense, then look back at your notes later and realize it makes no sense? This used to happen to me, but I’ve learned how to use spaced repetition to fully avoid this if I want. I’m going to try to convey this method in this post.

Much of my understanding of how to create flashcards comes from “Using spaced repetition systems to see through a piece of mathematics” by Michael Nielsen and “How to write good prompts: using spaced repetition to create understanding” by Andy Matuschak, but I think my method falls in between both, in terms of abstraction. Finally,... (read 1093 more words →)

What I learned from doing Quiz Bowl

Jacob G-W

I did Quiz Bowl throughout my time in high school, and looking back on it, it was a pretty positive thing to do! In this blog post, I want to make a list of some of the life lessons I have taken from Quiz Bowl.

If you know about Quiz Bowl, just skip to the next paragraph. For the uninitiated, Quiz Bowl is a team activity where students compete, typically on teams of four, to answer questions about the world. There are questions on physics, philosophy, mathematics, chemistry, literature, history, economics, videogames, and pretty much every subject. The format of the game is simple: the moderator reads a tossup and then players can... (read 1549 more words →)

Taking into account preferences of past selves

Jacob G-W

Problem

“You’ll have a great time wherever you go to college!” I constantly hear this. From my parents, my friends’ parents, my guidance counselor, and my teachers. I don’t doubt it. I’m sure I’ll have a lot of fun wherever I go. Since I’m trying to be very intentional about my college decision process, I’ve interviewed close to twenty students. And for the most part, all of them are having a great time!

This scares me. A lot.

If I can go anywhere and have a great time, then what should I choose my college based on? Ranking? Prestige? Food? Campus? Job opportunities? Cost?

After thinking more about this problem, I realized that although I’ll have... (read 1993 more words →)

From the outside, American schooling is weird

Jacob G-W

Jacob G-W, Malumit

g-w1

Hey, so I wanted to start this dialogue because we were talking on Discord about the secondary school systems and college admission processes in the US vs NZ, and some of the differences were very surprising to me.

I think that it may be illuminating to fellow Americans to see the variation in pedagogy. Let's start off with grades. In America, the way school works is that you sit in class and then have projects and tests that go into a gradebook. Roughly speaking, each assignment has a max points you can earn. Your final grade for a subject is $\sum_{i} \frac{{earned}_{i}}{{max possible}_{i}}$ . Every school has a different way of doing the grading though. Some

... (read 2513 more words →)

XAI releases Grok base model

Jacob G-W

We are releasing the base model weights and network architecture of Grok-1, our large language model. Grok-1 is a 314 billion parameter Mixture-of-Experts model trained from scratch by xAI.
This is the raw base model checkpoint from the Grok-1 pre-training phase, which concluded in October 2023. This means that the model is not fine-tuned for any specific application, such as dialogue.
We are releasing the weights and the architecture under the Apache 2.0 license.
To get started with using the model, follow the instructions at github.com/xai-org/grok.
Model Details
Base model trained on a large amount of text data, not fine-tuned for any particular task.
314B parameter Mixture-of-Experts model with 25% of the weights active on a given token.
Trained

... (read more)

A great example of more dakka: https://www.nytimes.com/2024/03/06/health/217-covid-vaccines.html

(Someone got 217 covid shots to sell vaccine cards on the black market; they had high immune levels!)

LESSWRONG
LW

LESSWRONG
LW

Jacob G-W

Distillation Robustifies Unlearning

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

I found >800 orthogonal "write code" steering vectors

The Puritans would one-box: evidential decision theory in the 17th century

Jacob G-W

Distillation Robustifies Unlearning

Selective modularity: a research agenda

Creating Interpretable Latent Spaces with Gradient Routing

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

I found >800 orthogonal "write code" steering vectors

Building intuition with spaced repetition systems

What I learned from doing Quiz Bowl

Jacob G-W

Distillation Robustifies Unlearning

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

I found >800 orthogonal "write code" steering vectors

The Puritans would one-box: evidential decision theory in the 17th century

Jacob G-W

Distillation Robustifies Unlearning

Selective modularity: a research agenda

Creating Interpretable Latent Spaces with Gradient Routing

Gradient Routing: Masking Gradients to Localize Computation in Neural Networks

I found >800 orthogonal "write code" steering vectors

Building intuition with spaced repetition systems

What I learned from doing Quiz Bowl

Robust unlearning

Introduction

Problem

Model Details