Thanks for the thoughtful reply!
I understand “resource scarcity” but I’m confused by “coordination problems”. Can you give an example? (Sorry if that’s a stupid question.)
This is the idea that at some point in scaling up an organization you could lose efficiency due to needing more/better management, more communication (meetings) needed and longer communication processes, "bloat" in general. I'm not claiming it’s likely to happen with AI, just another possible reason for increasing marginal cost with scale.
Resource scarcity seems unlikely to bite here, at least not for long. If some product is very profitable to create, and one of its components has a shortage, then people (or AGIs) will find ways to redesign around that component.
Key resources that come to mind would be electricity and chips (and materials to produce these). I don’t know how elastic production is in these industries, but the reason I expect it to be a barrier is that you’re constrained by the slowest factor. For huge transformations or redesigning significant parts of the current AI pipeline, like using a different kind of computation, I think there’s probably lots of serial work that has to be done to make it work. I agree the problems are solvable, but it shifts from "how much demand will there be for cheap AGI" to "how fast can resources be scaled up".
I wasn’t complaining about economists who say “the consequences of real AGI would be [crazy stuff], but I don’t expect real AGI in [time period T / ever]”. That’s fine!
Instead I was mainly complaining about the economists who have not even considered that real AGI is even a possible thing at all. Instead it’s just a big blind spot for them.
Yeah, I definitely agree.
And I don’t think this is independent of their economics training (although non-economists are obviously capable of having this blind spot too).
Instead, I think that (A) “such-and-such is just not a thing that happens in economies in the real world” and (B) “real AGI is even a conceivable possibility” are contradictory. And I think that economists are so steeped in (A) that they consider it to be a reductio ad absurdum for (B), whereas the correct response is the opposite ((B) disproves (A)).
I see how this could happen, but I'm not convinced this effect is actually happening. As you mention, many people have this blind spot. There's people that claim AGI is already here (and evidently have a different definition of AGI). I think my crux is that this isn't unique to economists. Some people say AGI is already here. Most non-AI people who are worried about AI seem worried that it will take their job, not all jobs. There are some people willing to accept the premise that AGI (as we define it) will exist at face value, but it seems to me that most people outside of AI that question the premise at all, end up not taking it seriously.
I agree that economists make some implicit assumptions about what AGI will look like that should be more explicit. But, I disagree with several points in this post.
On equilibrium: A market will equilibriate when the supply and demand is balanced at the current price point. At any given instant this can happen for a market even with AGI (sellers increase price until buyers are not willing to buy). Being at an equilibrium doesn’t imply the supply, demand, and price won’t change over time. Economists are very familiar with growth and various kinds of dynamic equilibria.
Equilibria aside, it is an interesting point that AGI combines aspects of both labor and capital in novel ways. Being able to both replicate and work in autonomous ways could create very interesting feedback loops.
Still, there could be limits and negative feedback to the feedback loops you point out. The idea that labor adds value and costs go down with scale are usually true but not universal. Things like resource scarcity or coordination problems can cause increasing marginal cost with scale. If there are very powerful AGI and very fast takeoffs, I expect resource scarcity to be a constraint.
I agree that AGI could break usual intuitions about capital and labor. However, I don’t think this is misleading economists. I think economists don’t consider AGI launching coups or pursuing jobs/entrepreneurship independently because they don’t expect it to have those capabilities or dispositions, not that they conflate it with inanimate capital. Even in the post linked, Tyler Cowen says that “I don’t think the economics of AI are well-defined by either “an increase in labor supply,” “an increase in TFP,” or “an increase in capital,” though it is some of each of those.”
Lastly, I fully agree that GDP doesn’t capture everything of value - even now it completely misses value from free resources like wikipedia and unpaid labor like housework, and can underestimate the value of new technology. Still, if AGI transforms many industries as it would likely need to in order to transform the world, real GDP would capture this.
All in all, I don’t think economics principles are misleading. Maybe Econ thinking will have to be expanded to deal with AGI. But right now, the difference in the economists and lesswrongers comes down to what capabilities they expect AGI to have.
I see what you mean. I would have guessed that the unlearned model behavior is meaningfully different than "produce noise on harmful else original". My guess is that the noise if harmful is accurate, but the small differences in logits on non-harmful data are quite important. We didn't run experiments on this. It would be an interesting empirical question to answer!
Also, there could be some variation on how true this is between different unlearning methods. We did find that RMU+distillation was less robust in the arithmetic setting than the other initial unlearning methods.
Fwiw, I'm not sure that RMU is a better unlearning method than simpler alternatives. I think it might just appear better on WMDP because the WMDP datasets are very messy and don't isolate the capability well, which could be done better with a cleaned dataset. Then, the performance on the evaluation relies on unnecessary generalization.
Yeah, I'm also surprised by it. I have two hypotheses, but it could be for other reasons I'm missing. One hypothesis is that we kept temperature=1 for the KL divergence, and using a different temperature might be important to distill faster. The second is that we undertrained the pretrained models, so pretraining was shorter while distillation took around the same amount of time. I'm not really sure though.
I think it depends on what kind of 'alignment' you're referring to. Insofar as alignment is a behavioral property (not saying bad things to users, not being easily jail breakable) I think our results weakly suggest that this kind of alignment would transfer and perhaps even get more robust.
One hypothesis is that pretrained models learn many 'personas' (including 'misaligned' ones) and post training shapes/selects a desired persona. Maybe distilling the post trained model would only, or primarily, transfer the selected persona and not the other ones. I don't think we can draw conclusions yet, but it sounds like an interesting idea for further work! Though it would be expensive to distill a large post trained model, it could be more tractable to find an open source one and evaluate various alignment properties compared to the teacher.
However, for more intrinsic alignment properties (is the model scheming, does the model have a misaligned goal), it's less clear how they might develop in the first place. I'm not sure whether distillation would reliably transfer these properties.
Also importantly, I would be concerned that misalignment could emerge during the RL process or any further training.
Yeah, I totally agree that targeted noise is a promising direction! However, I wouldn't take the exact % of pretraining compute that we found as a hard number, but rather as a comparison between the different noise levels. I would guess labs may have better distillation techniques that could speed it up. It also seems possible that you could distill into a smaller model faster and still recover performance with distillation. This would require modification to UNDO initialization, (e.g. initializing it as a noised version of the smaller model rather than the teacher) but still seems possible. Also, in some cases labs already do distillation, and in these cases it would have a smaller added cost.
Thanks for the comment!
I agree that exploring targeted noise is a very promising direction and could substantially speed up the method! Could you elaborate on what you mean about unlearning techniques during pretraining?
I don't think datafiltering+distillation is analogous to unlearning+distillation. During distillation, the student learns from the predictions of the teacher, not the data itself. The predictions can leak information about the undesired capability, even on data that is benign. In a preliminary experiment, we found that datafiltering+distillation was ineffective in a TinyStories setting, but more effective in the language setting (see this comment). It's possible that real world applications differ from the former setting. Maybe the context in which information about the forget capabilities are revealed are always different/identifiable and datafiltering+distillation would be effective, but I expect this isn't usually the case.
As a concrete example, let's say we want to unlearn the following fact:
The company x data center is in location y.
We filter all of the sentences that give information about the datacenter in location y, but there still is a benign sentence that says:
The company x data center is in location z.
Given the teacher model knows about data centers in location y and z, the teacher will have high probabilities on logits y and z, and the student will learn about both data centers.
Maybe there's a way to have a classifier that predicts whether the teacher model will reveal any information about the forget capability, but it seems a bit complicated by the fact that you can't just look at the top logit.
I do think unlearning+distillation is conceptually analogous to datafiltering+pretraining. However, I think there are practical differences, including the following:
Thanks for the thought! We did try different approaches to noise scheduling like you suggest. From what we tried, adding noise only once resulted in faster distillation for the same total amount of noise added/robustness gained. However, we didn't run comprehensive experiments on it, so it's possible a more experimentation would provide new insights.
My guess is that actively unlearning on the unfaithful CoTs, or fine tuning to make CoTs more faithful, then distilling would be more effective.
In a preliminary experiment, we found that even when we filtered out forget data from the distillation dataset, the student performed well on the forget data. This was in a setting where the context for forget and retain data likely were quite similar, so the teacher’s predictions on the retain data were enough for the student to recover the behavior on the forget data.
With distillation, the student model learns from the teachers predictions, not the data itself, so to change the learning target, changing the teachers predictions rather than the distillation data seems more effective. In that way, datafiltering+pretraining is analogous to unlearning+distillation. That said, we did the same experiment in the language setting, where the context was extremely different between the forget and retain set, and datafiltering+distilling was effective.
I would predict that effectiveness of datafiltering+distillation varies with how similar the context is between the forget and retain data is. In the case of CoT faithfulness, I think the most salient question about context similarity is, How stochastic are models on faithfulness, given an input? I.e. on faithful CoT completions, do they still sometimes give significant weight in the logits to unfaithful completions?
In any case, finetuning for faithful CoT then distilling seems promising. I'm curious if the first step of finetuning for CoT faithfulness has been explored. Even without the distillation step, it could be useful.
I had the same thought. Some of the graphs, on first glance seem to have an inflection point at ChatGPT release, but looking more seem like the trend started before ChatGPT. Like these seem to show even at the beginning in early 2021 more exposed jobs were increasing at a slower rate than less exposed jobs. I also agree the story could be true, but I'm not sure these graphs are strong evidence without more analysis.