LESSWRONG
LW

Addie Foote — LessWrong

Replying toAre They Starting To Take Our Jobs?

I had the same thought. Some of the graphs, on first glance seem to have an inflection point at ChatGPT release, but looking more seem like the trend started before ChatGPT. Like these seem to show even at the beginning in early 2021 more exposed jobs were increasing at a slower rate than less exposed jobs. I also agree the story could be true, but I'm not sure these graphs are strong evidence without more analysis.

Replying toFour ways learning Econ makes people dumber re: future AI

Addie Foote6mo

Four ways learning Econ makes people dumber re: future AI

Thanks for the thoughtful reply!

I understand “resource scarcity” but I’m confused by “coordination problems”. Can you give an example? (Sorry if that’s a stupid question.)

This is the idea that at some point in scaling up an organization you could lose efficiency due to needing more/better management, more communication (meetings) needed and longer communication processes, "bloat" in general. I'm not claiming it’s likely to happen with AI, just another possible reason for increasing marginal cost with scale.

Resource scarcity seems unlikely to bite here, at least not for long. If some product is very profitable to create, and one of its components has a shortage, then people (or AGIs) will find ways to redesign

... (read 353 more words →)

Replying toFour ways learning Econ makes people dumber re: future AI

Addie Foote6mo

Four ways learning Econ makes people dumber re: future AI

I agree that economists make some implicit assumptions about what AGI will look like that should be more explicit. But, I disagree with several points in this post.

On equilibrium: A market will equilibriate when the supply and demand is balanced at the current price point. At any given instant this can happen for a market even with AGI (sellers increase price until buyers are not willing to buy). Being at an equilibrium doesn’t imply the supply, demand, and price won’t change over time. Economists are very familiar with growth and various kinds of dynamic equilibria.

Equilibria aside, it is an interesting point that AGI combines aspects of both labor and capital in novel... (read more)

Replying toDistillation Robustifies Unlearning

Addie Foote8mo

Distillation Robustifies Unlearning

I see what you mean. I would have guessed that the unlearned model behavior is meaningfully different than "produce noise on harmful else original". My guess is that the noise if harmful is accurate, but the small differences in logits on non-harmful data are quite important. We didn't run experiments on this. It would be an interesting empirical question to answer!

Also, there could be some variation on how true this is between different unlearning methods. We did find that RMU+distillation was less robust in the arithmetic setting than the other initial unlearning methods.

Fwiw, I'm not sure that RMU is a better unlearning method than simpler alternatives. I think it might just appear better on WMDP because the WMDP datasets are very messy and don't isolate the capability well, which could be done better with a cleaned dataset. Then, the performance on the evaluation relies on unnecessary generalization.

Replying toDistillation Robustifies Unlearning

Addie Foote8mo

Distillation Robustifies Unlearning

Yeah, I'm also surprised by it. I have two hypotheses, but it could be for other reasons I'm missing. One hypothesis is that we kept temperature=1 for the KL divergence, and using a different temperature might be important to distill faster. The second is that we undertrained the pretrained models, so pretraining was shorter while distillation took around the same amount of time. I'm not really sure though.

Replying toDistillation Robustifies Unlearning

Addie Foote8mo

Distillation Robustifies Unlearning

I think it depends on what kind of 'alignment' you're referring to. Insofar as alignment is a behavioral property (not saying bad things to users, not being easily jail breakable) I think our results weakly suggest that this kind of alignment would transfer and perhaps even get more robust.

One hypothesis is that pretrained models learn many 'personas' (including 'misaligned' ones) and post training shapes/selects a desired persona. Maybe distilling the post trained model would only, or primarily, transfer the selected persona and not the other ones. I don't think we can draw conclusions yet, but it sounds like an interesting idea for further work! Though it would be expensive to distill a... (read more)

Replying toDistillation Robustifies Unlearning

Addie Foote8mo

Distillation Robustifies Unlearning

Yeah, I totally agree that targeted noise is a promising direction! However, I wouldn't take the exact % of pretraining compute that we found as a hard number, but rather as a comparison between the different noise levels. I would guess labs may have better distillation techniques that could speed it up. It also seems possible that you could distill into a smaller model faster and still recover performance with distillation. This would require modification to UNDO initialization, (e.g. initializing it as a noised version of the smaller model rather than the teacher) but still seems possible. Also, in some cases labs already do distillation, and in these cases it would have a smaller added cost.

Replying toDistillation Robustifies Unlearning

Addie Foote8mo

Distillation Robustifies Unlearning

Thanks for the comment!
I agree that exploring targeted noise is a very promising direction and could substantially speed up the method! Could you elaborate on what you mean about unlearning techniques during pretraining?

I don't think datafiltering+distillation is analogous to unlearning+distillation. During distillation, the student learns from the predictions of the teacher, not the data itself. The predictions can leak information about the undesired capability, even on data that is benign. In a preliminary experiment, we found that datafiltering+distillation was ineffective in a TinyStories setting, but more effective in the language setting (see this comment). It's possible that real world applications differ from the former setting. Maybe the context in which information about... (read more)

Replying toDistillation Robustifies Unlearning

Addie Foote8mo

Distillation Robustifies Unlearning

Thanks for the thought! We did try different approaches to noise scheduling like you suggest. From what we tried, adding noise only once resulted in faster distillation for the same total amount of noise added/robustness gained. However, we didn't run comprehensive experiments on it, so it's possible a more experimentation would provide new insights.

Replying toDistillation Robustifies Unlearning

Addie Foote8mo

Distillation Robustifies Unlearning

My guess is that actively unlearning on the unfaithful CoTs, or fine tuning to make CoTs more faithful, then distilling would be more effective.

In a preliminary experiment, we found that even when we filtered out forget data from the distillation dataset, the student performed well on the forget data. This was in a setting where the context for forget and retain data likely were quite similar, so the teacher’s predictions on the retain data were enough for the student to recover the behavior on the forget data.

With distillation, the student model learns from the teachers predictions, not the data itself, so to change the learning target, changing the teachers predictions rather than... (read more)

Distillation Robustifies Unlearning

Bruce W. Lee

Bruce W. Lee, Addie Foote, alexinf, leni, Jacob G-W, Harish Kamath, Bryce Woodworth, cloud, TurnTrout

8mo

Current “unlearning” methods only suppress capabilities instead of truly unlearning the capabilities. But if you distill an unlearned model into a randomly initialized model, the resulting network is actually robust to relearning. We show why this works, how well it works, and how to trade off compute for robustness.

Unlearn-and-Distill applies unlearning to a bad behavior and then distills the unlearned model into a new model. Distillation makes it way harder to retrain the new model to do the bad thing.

Distilling the good while leaving the bad behind.

Produced as part of the ML Alignment & Theory Scholars Program in the winter 2024–25 cohort of the shard theory stream.

Read our paper on ArXiv and enjoy an interactive demo.

Robust unlearning

... (read 2160 more words →)

236