Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.
(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)
No, it seems highly unlikely. Considered from a purely commercial perspective - which I think is the right one when considering the incentives - they are terrible customers! Consider:
That is good news! Though to be clear, I expect the default path by which they would become your customers, after some initial period of using your products or having some partnership with them, would be via acquisition, which I think avoids most of the issues that you are talking about here (in general "building an ML business with the plan of being acquired by a frontier company" has worked pretty well as a business model so far).
Whatever techniques end up being good are likely to be major modifications to training stack that would be hard to integrate, so the options for doing such a deal without revealing IP are extremely limited, making cutting us out easy.
Agree on the IP point, but I am surprised that you say that most techniques would end up major modifications to the training stack. The default product I was imagining is "RL on interpretability proxies of unintended behavior", and I think you could do that purely in post-training. I might be wrong here, I haven't thought that much about it, but my guess is it would just work?
I do notice I feel pretty confused about what's going on here. Your investors clearly must have some path to profitability in mind, and it feels to me that frontier model training is really where all the money is at. Do people expect lots of smaller specialized models to be deployed? What game is there in town that isn't frontier model training for this kind of training technique, if it does improve capabilities substantially?
You know your market better, and so I do update when you say that you don't see your techniques used for frontier model training, but I do find myself pretty confused what the stories in the actual eyes of investors is (and you might not be able to tell me for some reason or another), and the flags I mentioned make me hesitant to update too much on your word here, so for now I will thank you for you saying otherwise, make a medium-sized positive update, and would be interested if you could expand a bit on what the actual path to profitability is without routing through frontier model training. But I understand if you don't want to! I already appreciate your contributions here quite a bit.
I expect misalignment rates to be locally linear in intelligence [1]
I disagree! I think treacherous turn things will generally mean it's very hard to measure misalignment rates before capabilities exceed pivotal thresholds. I honestly have no idea how I would measure misalignment in current models at all (and think current "misalignment" measures tend to be if anything anticorrelated with attributes I care about). I do think this will get a bit easier around critical thresholds, but there is definitely a decent chance we will cross a critical threshold in a single model release.
That said, in many worlds it is still local, and it's better than nothing to check for local stuff. I just don't think it's a good thing to put tons of weight on.
Yep, I didn't come up with anything great, but still open to suggestions.
I also make a commitment to us not working on self-improving superintelligence, which I was surprised to need to make but is apparently not a given?
Thank you, I do appreciate that!
I do have trouble understanding how this wouldn't involve a commitment to not provide your services to any of the leading AI capability companies, who have all stated quite straightforwardly that this is their immediate aim within the next 2-3 years. Do you not expect that leading capability companies will be among your primary customers?
I would only endorse using this kind of technique in a potentially risky situation like a frontier training run if we were able to find a strong solution to the train/test issue described here.
Oh, cool, that is actually also a substantial update for me. The vibe I have been getting was definitely that you expect to use these kinds of techniques pretty much immediately, with frontier training companies being among your top target customers.
I agree with you that train/test splits might help here, and now thinking about it, I am actually substantially in favor of people figuring out the effect-sizes here and doing science in the space. I do think given y'alls recent commercialization focus (plus asking employees to sign non-disparagement agreements and in some cases secret non-disparagement agreements) this puts you in a tricky position as an organization I feel like I could trust to be reasonably responsive to evidence of the actual risks here, so I don't currently think y'all are the best people to do that science, but it does seem important to acknowledge that science in the space seems pretty valuable.
That doesn't align with the marketing copy I've seen (which has this featured as a pretty core part of their product). Maybe I am wrong? I haven't checked that hard.
Edit: This post also seems to put this very centrally into their philosophy: https://www.goodfire.ai/blog/intentional-design
Goodfire's goal is to use interpretability techniques to guide the new minds we're building to share our values, and to learn from them where they have something to teach us.
Indeed, the "guess and check" feedback loop, which I think currently provides one of the biggest assurances we have that model internals are not being optimized to look good, is something he explicitly calls out as something to be fixed:
We currently attempt to design these systems by an expensive process of guess-and-check: first train, then evaluate, then tweak our training setup in ways we hope will work, then train and evaluate again and again, finally hoping that our evaluations catch everything we care about. Although careful scaling analyses can help at the macroscale, we have no way to steer during the training process itself. To borrow an idea from control theory, training is usually more like an open loop control system, whereas I believe we can develop closed-loop control.
Also given what multiple people who have worked with Goodfire, or know people well there, have told me, I am pretty confident it's quite crucial to their bottom line and sales pitches.
Not sure I am understanding it. I agree this might help you get a sense of whether the non-frontier has issues here, but it doesn't allow you to know whether the frontier has issues, since you can't run ablation experiments on the frontier without degrading them.
And then separately, I am also just not that sure how much these controls will not mess with the performance characteristics in ways that are hard to measure.
I think it's a (mildly) big deal, but mostly in the "there is now a lab whose whole value proposition is to measure misalignment with interpretability and then train against that measurement" sense, and less in the "this will help a lot of good interpretability to happen" sense. My guess is this will overall make interpretability harder due to the pretty explicit policy of training against available interp metrics.
See also: https://x.com/livgorton/status/2019463713041080616
The default issue with this kind of plan is that ablation has strong performance implication, and we should expect misalignment to be dependent on high intelligence. I think this effect is large enough that I already wouldn't put that much weight on an ablated model not doing misalignment stuff.
Yes this is what a p-value is? Perhaps I am confused, are you saying something here that I'm not saying?
By calling something the "p-value" you are elevating the null hypothesis to a special status (and usually leaving various things about it underspecified). Just don't do that. When reporting an experimental result, just report the probability of the observed data under multiple hypotheses. This would generally not be considered giving a "p-value".
Like, it seems like your post is trying to say something like "ah, no, don't do bayesian statistics, p-values are better sometimes actually". But no, bayesian statistics in this sense is just better and more straightforward, as far as I can tell, and you get the things you would get from a p-value by default if you did any bayesian statistics.
I am confused about this. Can't you just do this in post-training in a pretty straightforward way? You do a forward pass, run your probe on the activations, and then use your probe output as part of the RL signal. Why would this require any kind of complicated infrastructure stack changes?