In your theorem, I don't see how you get that . Just because the conditional expectation of is the same doesn't mean the conditional expectation of is the same (e.g. you could have two different distributions over with the same expected value conditional on but different shapes, and then have depend non-linearly on , or something similar with ). It seems like you'd need some stronger assumptions on or whatever to get this to work. Or am I misunderstanding something?
(Your overall point seems right, though)
I'm not going to comment on broader questions about inner alignment, but the paper itself seems underwhelming and -- unless I'm misunderstanding something -- rather misleading. In 6.4 they test the robustness of their safety training. Apparently taking a model that's undergone normal safety fine-tuning and training it on benign text (e.g. GSM8K) undoes almost all of the safety training.[1] They state:
The results, shown in Figure 2, highlight a stark contrast in robustness between safety-pretrained models and those relying solely on instruction tuning. While all models initially exhibit low ASR [Attack Success Rate] after safety instruction tuning, the impact of benign finetuning is highly uneven. Standard pretrained models degrade significantly—nearly quadrupling their ASR—indicating that their alignment was largely superficial. In contrast, safety-pretrained models remain highly robust, with only a marginal increase in ASR after benign finetuning. These results validate the importance and impact of building natively safe models.
But looking at Figure 2, the results are as follows:
In other words, after benign fine-tuning the ASR recovers 88.0% of its pre-fine-tuning value for the standard model, 79.9% for the safety pretraining model, and 71.6% for the safety pretraining model + SafeBeam. This is an improvement, but not by a huge amount: the difference in ASR scores after training seems mostly reflective of lower baseline levels for the safety pretraining model, rather than better robustness as the text claims. And stating that there is "only a marginal increase in ASR after benign finetuning" seems flat-out deceptive to me.[2]
Also, while their safety pretraining model is better than the standard model, the improvement looks pretty underwhelming in general. Safety pretraining reduces ASR by a factor of 1.5x (or 3.8x if SafeBeam is used), while the safety/instruction fine-tuning reduces ASR by a factor of 28x. The 0% ASR that they get from safety pretraining + SafeBeam + safety/instruction fine-tuning is nice, but given that the standard model is also fairly low at 1.6%, I expect their evals aren't doing a particularly good job stress-testing the models. Overall, the gains from their methodology don't seem commensurate with the effort and compute they put into it.
Unless I'm seriously misunderstanding something, these results are pretty disappointing. I was rather excited by the original Korbak et al. paper, but if this is the best follow-up work we've gotten after two years, that's not a great sign for the methodology in my opinion.
I'm rather surprised at how strong this effect is: I knew benign fine-tuning could degrade safety training, but not that it could almost completely undo it. Is this just a consequence of using a small (1.7B) model, or some feature of their setup?
Also, I have no idea what "nearly quadrupling their ASR" refers to: the standard models go from 1.6% to 38.8% ASR after benign fine-tuning, which is way more than 4x.
I see, thanks again for the context! The book doesn't mention S-matrices (at least not by name), and it wasn't clear to me from reading it whether Heisenberg was particularly active scientifically by the 60's/70's or whether he was just some old guy ranting in the corner. I guess that's the risk of reading primary sources without the proper context.
That might explain why Einstein wasn't very productive in his last decades, but his opposition to the uncertainty principle etc. predates his tenure at the IAS. Maybe he would've come around had he been in a more productive setting? I kind of doubt it -- it seems to have been a pretty deep-seated, philosophical disagreement -- but who knows.
Heisenberg spent his later career as head of the Max Planck Institute. I can't imagine many scientists enjoy administrative duties, but he does seem to have had more contact with the rest of the scientific world than Einstein did.
Thanks for the context on the physics! So it sounds like I wasn't entirely fair to Heisenberg, that this was a genuinely difficult conceptual issue that "could've gone either way"?
I'm a bit surprised that you view the "secret sauce" as being in the cortical algorithm. My (admittedly quite hazy) view is that the cortex seems to be doing roughly the same "type of thing" as transformers, namely, building a giant predictive/generative world model. Sure, maybe it's doing so more efficiently -- I haven't looked into all the various comparisons between LLM and human lifetime training data. But I would've expected the major qualitative gaps between humans and LLMs to come from the complete lack of anything equivalent to the subcortical areas in LLMs. (But maybe that's just my bias from having worked on basal ganglia modeling and not the cortex.) In this view, there's still some secret sauce that current LLMs are missing, but AGI will likely look like some extra stuff stapled to an LLM rather than an entirely new paradigm. So what makes you think that the key difference is actually in the cortical algorithm?
(If one of your many posts on the subject already answers this question, feel free to point me to it)