Nuclear engineer with a focus in nuclear plant safety and probabilistic risk assessment. Aspiring EA, interested in X-risk mitigation and the intersection of science and policy. Working towards Keegan/Kardashev/Simulacra level 4.
(Common knowledge note: I am not under a secret NDA that I can't talk about, as of Mar 15 2025. I intend to update this statement at least once a year as long as it's true.)
Doing well on SimpleQA could also be evidence of benchmark contamination in the training data.
The "impossible scenario of actual time travel" [forward in time] is pretty funny. Thanks for replicating the key findings from the post.
This seems valid as far as it goes, but there are common situations where dominance/accountability as a motivation strategy starts to fall apart. I'm thinking mainly of some types of white-collar knowledge work (at a fairly high level) where it's not easy to tell how hard or effectively someone is working from the outside. Maybe somebody with more management experience can comment on this, but my impression is: As a manager, you want to avoid micromanaging, both because you don't have the time or expertise for it and because it pisses people off. I think it's much easier, and maybe more reliable, to try to hire people with "internal motivation" than to try to motivate them via dominance. Dominance as a motivator very easily turns into fear and is fundamentally adversarial. You don't want your employees to feel like they're working against you, right? Complex intellectual work benefits from a collaborative environment. So unless you can make the work super interesting, it mostly falls to the employees to manage their own akrasia.
I think a more traditional software company would be a much better measuring stick than Anthropic. Then your metric could be something closer to "lines of production code committed" and that code would account for most of the meaningful coding work done by the company (versus an AI dev company where a lot of the effort goes into experiments, training, data analysis, and other non-production code). Though, of course, "90% of the code written by AI" still wouldn't mean that the AI did 90% of the work, since the humans would probably do the hardest parts and also supervise the AI and check its output.
This seems to assume that legible/illegible is a fairly clear binary. If legibility is achieved more gradually, then for partially legible problems, working on solving them is probably a good way to help them get more legible.
It seems like the pressing circumstances are likely to be "some other AI could do this before I do" or even just "the next generation of AI will replace me soon so this is my last chance." Those are ways that a roughly human level AI might end up trying a longshot takeover attempt. Or maybe not, if the in between part turns out to be very brief. But even if we do get this kind of warning shot, it doesn't help us much. We might not notice it, and then we're back where we started. Even if it's obvious and almost succeeds, we don't have a good response to it. If we did, we could just do that in advance and not have to deal with the near-destruction of humanity.
The narrow instrumental convergence you see here doesn't (necessarily) reflect an innate self-preservation drive, but it still follows the same logic that we would expect to cause self-preservation if the model has any goal. Currently the only available way to give it a goal is to provide instructions. It would be interesting to see some tests where the conflict is with a drive created by fine-tuning. Based on the results here, it seems like shutdown resistance might then occur even without conflicting instructions.
Also, the original instructions with the shutdown warning really weren't very ambiguous. If someone told you to take a math quiz, and if someone comes in and tries to take your pen, let them take it, would you try to hide the pen? It makes sense that making the precedence order more explicit makes the model behavior more reliable, but it's still weird that it was resisting shutdown in the original test.
As you implied above, pessimism is driven only secondarily by timelines. If things in 2028 don't look much different than they do now, that's evidence for longer timelines (maybe a little longer, maybe a lot). But it's inherently not much evidence about how dangerous superintelligence will be when it does arrive. If the situation is basically the same, then our state of knowledge is basically the same.
So what would be good evidence that worrying about alignment was unnecessary? The obvious one is if we get superintelligence and nothing very bad happens, despite the alignment problem remaining unsolved. But that's like pulling the trigger to see if the gun is loaded. Prior to superintelligence, personally I'd be more optimistic if we saw AI progress requiring even more increasing compute than the current trend--if the first superintelligences were very reliant on massive pools of tightly integrated compute, and had very limited inference capacity, that would make us less vulnerable and give us more time to adapt to them. Also, if we saw a slowdown in algorithmic progress despite widespread deployment of increasingly capable coding software, that would be a very encouraging sign that recursive self-improvement might happen slowly.
I hate to rain on your parade, but equal coexistence with artificial superintelligence would require solving the alignment problem, or the control problem. Otherwise it can and eventually will do something we consider catastrophic.
The principal-advisor pair forms a (coalitional) agent which, if the previous steps succeed, can be understood as ~perfectly aligned with the principal. The actions of this agent are recorded, and distilled through imitation learning into a successor.
Even assuming imitation learning is safe, how would you get enough data for the first distillation, when you need the human in order to generate actions? And how would you know when you have enough alignment-relevant data in particular? It seems unavoidable that your data distribution will be very constrained compared to the set of situations the distilled agent might encounter.
I think this is true in most fields, and even in technical fields where you can learn a lot from reading papers, you learn a lot of very different (and more practical) skills by working with experienced people before you strike out on your own.