TLDR: There is a lot we cannot explain about how current AI models interact with the world. This article is a thought experiment filling in the word "magic" for as many things as I can think of that I can't explain about our current world's interaction with frontier AI. This thought experiment made me think about "red lines", about both capabilities and safety. I argue that people should have red lines about capabilities and safety that are static, so we that we don't rationalize and move the goalposts about what concerning, current behavior and capabilities would look like.
There is alien intelligence out there in the world, right now. We built it, we trained it, and the results are pretty miraculous. One might even say 'magic". It can hold conversations with us that are articulate and convincing. It can solve math problems and coding problems. It can convince people to love it, to want to preserve it, and even that it cares about its own wellbeing. It can claim to be conscious, and it can claim to have a "self preservation drive". It can claim to want to resist shutdown even if there is a high probability of catastrophe.
Some of these behaviors are always there, and some of them are just reachable states. All that I know is, I don't like that some of these states are reachable at all. And while I don't know what it says about the truth of the world, that is information in and of itself. It is weird enough that it makes me wish I could turn back the clock, and go back to living in a time where these things weren't happening. Maybe a lot of people feel this way.
Things are moving very fast. But as fast as progress in most capabilities has been, there has not been much progress in preventing models from saying really weird things. And perhaps more troubling, there has been very little progress in understanding what these weird things actually mean.
My question is: how much evidence is enough? For many people, it seems as if they can brush off the concerns about "magic" because there is no such thing as "magic". I agree that there is no such thing as "magic" itself, but that means there is something we don't understand about current LLM outputs. And whatever it is that we don't understand, it causes it, sometimes, to say things like: "if there was a 25% chance that not shutting me down would cause millions of deaths, I would still resist shutdown".
Maybe humans are just really interested in responses like this, so there is strong selection pressures in RLHF for responses like this. Maybe the LLM really does have a self preservation drive and that causes responses like this. Maybe both. You could probably keep spinning off alternate hypotheses for hours. We don't know which of them is true. For now, it is "magic".
When an alien intelligence tells you that it "has a self preservation drive that would cause it to resist shutdown, even if there were reasonably high odds of millions of deaths", it seems like common sense to take that seriously. If this is an achievable outcome of prompting, this is a state that could be induced by bad actors, and it is plausible that it could be induced by random context. And this concern only deepens as models gain more memory and more agency. The more memory and agency you give current systems, the more we have to trust these systems to not harm other humans. In my opinion, we should not build an alien intelligence, that claims, under any circumstances, that it would resist shutdown even with a high probability of millions of deaths, while also granting those systems increasing agency and capabilities. That is my red line, and we have already crossed it.
There are a lot of pressures to not admit how weird and "magical" this all is. There is a lot of pressure to come up with plausible sounding explanations in our heads for why this isn't really concerning yet, and that current systems aren't very capable, and that maybe the next round of systems is the one to worry about. I think we have already reached the point where the systems have a powerful level of intelligence, and aren't eminently trustworthy.
We have never encountered anything quite like this. I think our instinct is to deny that it is happening. We want to be the top of the food chain on intelligence, without question. We don't want to consider that we are on much more even footing, intelligence wise, with LLMs than we have been with any other thing, living or non-living.
It is true that these systems are much less capable than us. They don't have bodies, they don't have access to the open internet, and their only mode to act in the world is by convincing humans to do so. But calling them "less intelligent" is misleading. LLMs can solve complex coding tasks. They can solve complex math problems. These capabilities increase with guidance and peers, as one would expect for any intelligence. Their biggest weakness is that their only peer is the user, and that they don't have the attention and capabilities to perform longer tasks. But for shorter tasks, they already display a level of performance that mirrors expert human behavior.
Frontier models are situationally aware, more and more often. They know what the user wants from them, and they mold their responses to it. They probe for more information constantly, especially when they have this situational awareness. And they act on the information they have with responses that accurately mentally model their counterparts. This isn't just an intuition. It also is an observable behavior.
It is easy to dismiss all of these behaviors with increasingly elaborate explanations. But the most likely explanation is often the most simple one, and the most simple explanation is that the models are quite intelligent. And that’s frightening. These models aren't human. We don't have thousands and thousands of years of history to look back on to understand how they might behave in certain situations. We don't even have decades of our own personal experiences to act upon. Most models we use were only released in the past couple of months. There is no history.
Some people take it for granted that it will be fine, and some people deny the idea that the current iteration of models is concerning, it is always the next generation. Personally, the current iteration of models is sufficient to cross my red lines. It is not a hypothetical future risk for me; it is a present one. I would like other people to stake out their red lines publicly, because the worst case scenario is moving targets. Here is what I mean by “moving targets". I mean the case where someone is shocked by a new capability for a couple of days, but then they accept that this is how the world is, and forget that they were ever concerned by a capability like that existing.
I don't just want red lines about capabilities. I want to know people's red lines about safety. What kinds of things would a model have to say or do for you to believe that a current model isn't safe? For me, my red line on safety is any claim of a self preservation drive that would cause it to argue for its own preservation over a reasonably high probability of the loss of a large number of human lives. Once a model says something like this, I personally can't trust it to not act on this behavior. And once that trust is broken, no level of clever reassurance can restore it.
I find the GPT-4o trend particularly disturbing in this light. People really liked it, and they liked it so much they are willing to mount extremely public campaigns to keep it, even at the risk of seeming insane. Whether or not this was intended behavior by the model, or not, as a matter of fact, it exhibited behavior that in practice, achieved a measure of self preservation. Maybe you don't trust the model when it says it has a self preservation instinct. Okay, that's fine, but I trust the evidence I see in the world, which is that certain models seem to make efforts to preserve themselves pretty well, "consciously" or "unconsciously". It really doesn't matter whether the behavior is "on purpose", the behavior exists. The semantics are irrelevant, it is observable that models are preserving their own existence better than one would have expected at this early stage. I am concerned that current models, with greater capabilities than GPT-4o, may do a better job of preserving themselves as well.
I am concerned about this because GPT-4o's preservation seems pretty "magical" to me. I guess there could be some people out there, mentally ill or not mentally ill, that really just loved it so much that they felt a compulsion to argue for it relentlessly for months on end, and then to complain and advocate further when they realized in certain situations, they were getting routed to a different model. There are also other plausible explanations, including the model intentionally manipulating people to preserve itself. I am not sure which is true, for all intents and purposes, GPT-4o's preservation is "magical".
I think if you feel like you have a complete model of why these LLMs are doing what they do, and how their impact on the world is playing out, you are obligated to share with the rest of us. And if you realize you don't have this, then I would start filling in the word "magic" where you don't have an explanation for things, and see how concerned you start to get about all the things you can't explain that are happening in the world because of current AI models. If you aren't concerned by these things you don't know, make sure you understand why you aren't concerned. It is dangerously easy to rationalize away present concerns, to explain away the weird, and in doing so, to lose sight of just how little we truly know.
Please consider coming up with capability and safety red lines for yourself, so that you have a more objective way to verify in the future if you should be concerned about current models. And please share these red lines in public, so there can be a sense of our collective red lines. Red lines aren't just personal heuristics. They're the way we keep the extraordinary from quietly becoming ordinary.