Automated / strongly-augmented safety research.
I thought you meant the AI scientist paper has some obvious (e.g. methodological or code) flaws or errors. I find that thread unconvincing, but we've been over this.
it's complete bunk
ignoring the huge errors in it
I genuinely don't know what you're referring to.
Fwiw, I'm linking to it because I think it's the first/clearest demo of how the entire ML research workflow (e.g. see figure 1 in the arxiv) can plausibly be automated using LM agents, and they show a proof of concept which arguably already does something (in any case, it works better than I'd have expected it to). If you know of a better reference, I'd be happy to point to that instead/alternately. Similarly if you can 'debunk it' (I don't think it's been anywhere near debunked).
Yup; there will also likely be a flywheel from automated ML research, e.g. https://sakana.ai/ai-scientist/ whose code is also open-source. Notably, o1 seemed to improve most at math and code, which seem some of the most relevant skills for automated ML research. And it's not clear whether other parts of the workflow will be blockers either, with some recent results of human-comparable automated ideation and reviewing.
Yes. Also, if the LMs after pretraining as Simulators model is right (I think it is) it should also help a lot with safety in general, because the simulator should be quite malleable, even if some of the simulacra might be malign. As long as you can elicit the malign simulacra, you can also apply interp to them or do things in the style of Interpreting the Learning of Deceit for post-training. This chould also help a lot with e.g. coup probes and other similar probes for monitoring.
Automating scalable oversight or RLHF research by quickly discovering new loss functions for training
See Discovering Preference Optimization Algorithms with and for Large Language Models.
Automating the ability to probe AI systems for deceptive power-seeking goals, via automated discovery of low-level features using interpretability tools
See A Multimodal Automated Interpretability Agent and Open Source Automated Interpretability for Sparse Autoencoder Features.
Disagree that it's obvious, depends a lot on how efficient (large-scale) RL (post-training) is at very significantly changing model internals, rather than just 'wrapping it around', making the model more reliable, etc. In the past, post-training (including RL) has been really bad at this.
I think o1 is significant evidence in favor of the story here; and I expect OpenAI's model to be further evidence still if, as rumored, it will be pretrained on CoT synthetic data.
Also, coding and math seem like the most relevant proxy abilities for automated ML research (and probably also for automated prosaic alignment), and, crucially, in these domains it's much easier to generate (including superhuman-level) verifiable, synthetic training data - so that it's hard to be confident models won't get superhuman in these domains soon.
I think o1 is significant evidence in favor of this view.
To maybe further clarify, I think of the Sakana paper roughly like how I think of autoGPT. LM agents were overhyped initially and autoGPT specifically didn't work anywhere near as well as some people expected. But I expect LM agents as a whole will be a huge deal.