I see the loss of control argument. Regarding the argument of being dominated: to what extent is being dominated by a superpower with ASI different from being dominated by a superpower with nuclear weapons, conventional military dominance, and economic dominance, which is the current situation for many middle powers? I can imagine that post-ASI, control granularity might be higher, and permanence might be higher. How important will these differences be?
Interesting, yeah tend to agree. Doesn't really change the argument though right? One could make the same argument for an AI that's further out in space and mobilizes sufficient resources to create a DSA.
Thank you for writing the post, interesting to think about.
Suppose an AI has a perfect world model, but no "I", that is, no indexical information. Then a bad actor comes along and asks the AI "please take over the world for me". Its guardrails removed (which is routinely done for open source models), the AI complies.
Its takeover actions will look exactly like those of a rogue AI. Only difference is, the rogue part doesn't stem from the AI itself, but from the bad actor. For everyone except the bad actor, though, the result looks exactly the same. The AI, using its perfect world model and other dangerous capabilities, takes over the world and, if the bad actor chooses so, kills everyone.
This is fairly close to my central threat model. I don't care much whether the adverse action comes from a self-aware AI or a bad actor, I care about the world being taken over. For this threat model, I would have to conclude that removing indexing from a model does not make it much safer. In addition, someone, somewhere, will probably include the indexing that was carefully removed.
I think this is philosophically interesting but as long as we will get open source models, we should assume maximum adversarial ones, and focus mostly on regulation (hardware control) to reduce takeover risk.
This same argument, imo, implies to other alignment work, including mechinterp, and to control work.
I could be persuaded by a positive offense defense balance for takeover threat models to think otherwise (I currently think it's probably negative).
I agree with these two points, but I doubt either will have significant impact.
Human extinction is seen as extremely unlikely, almost absurd. While there are obviously many other public concerns, three other significant human extinction concerns are very rare.
We also asked for people's extinction probability, but then many would just give a number (sometimes high), even if they didn't see the existential risk of AI at all. Still, trends in both methodologies were usually similar.
I'm open to better methodologies, but I think this is a fair way of assessing public xrisk awareness, and a better way than asking explicit probabilities.
I don't think your first point is obvious. We've had super smart humans (e.g. with IQ >200) and they haven't been able to take over the world. (Although they didn't have many of the advantages an AI might have, such as mass copying themselves over the internet.)
In general, the power(intelligence) curve is a big crux for me that we can't fill in with data points yet (of course intelligence is also spiky). Imo we also have no idea where takeover-level intelligence is, what takeover-shape intelligence is, and what maximum AI would be.
What do you mean by soft limit?
I don't think it makes sense to be confidently optimistic about this (the offense defense balance) given the current state of research. I looked into this topic some time ago with Sammy Martin. I think there is very little plan of anyone in the research community on how the blue team would actually stop the red team. Particularly worrying is that several domains look like the offense has the advantage (eg bioweapons, cybersec), and that defense would need to play by the rules, hugely hindering its ability to act. See also eg this post.
Since most people who actually thought about this seem to arrive at the conclusion that offense would win, I think being confident that defense would win seems off. What are your arguments?
Currently, we observe that leading models get open sourced roughly half a year later. It's not a stretch to assume this will also happen to takeover-level AI. If we assume such AI to look like LLM agents, it would be relevant to know what the probability is that such an agent, somewhere on earth, would try to take over.
Let's assume someone, somewhere, will be really annoyed with all the safeguards and remove them, so that their LLM will have a probability of 99% to just do as it's told, even though that might be highly unethical. Let's furthermore assume an LLM-based agent will need to take 20 unethical actions to actually take over (the rest of the required actions won't look particularly unethical to low-level LLMs executing them, in our scenario). In this case, there would be a 99%^20=82% chance that an LLM-based agent would take over, for any bad actor giving it this prompt.
I'd be less worried if it would be extremely difficult, and require lots of resources, to get LLMs to take unethical actions when they're asked to. For example, if jailbreaking safety would be highly robust, and even adversarial fine tuning of open source LLMs wouldn't break it.
Is that something you see on the horizon?
I mean, it would be perfectly intent-aligned, it carries out its orders to the letter. Only problem is, carrying out its orders involves a takeover. So no, I don't mean its own goal, but a goal someone gave it.
I guess it's a bit different in the sense that instrumental convergence states that all goals will lead to power-seeking subgoals. This statement is less strong, it just says that some goals will lead to power-seeking behaviour.
One of my main worries is that once AI gets to takeover-level, it will faithfully execute a random goal while no one is reading the CoT/that paradigm won't have a CoT. As rational part of that random goal, it could take over. That's strictly speaking not misalignment: it's perfectly carrying out it's goal. Still seems a very difficult problem to fix imo.
Have you thought about this?
I think it's a great idea to think about what you call goalcraft.
I see this problem as similar to the age-old problem of controlling power. I don't think ethical systems such as utilitarianism are a great place to start. Any academic ethical model is just an attempt to summarize what people actually care about in a complex world. Taking such a model and coupling that to an all-powerful ASI seems a highway to dystopia.
(Later edit: also, an academic ethical model is irreversible once implemented. Any goal which is static cannot be reversed anymore, since this will never bring the current goal closer. If an ASI is aligned to someone's (anyone's) preferences, however, the whole ASI could be turned off if they want it to, making the ASI reversible in principle. I think ASI reversibility (being able to switch it off in case we turn out not to like it) should be mandatory, and therefore we should align to human preferences, rather than an abstract philosophical framework such as utilitarianism.)
I think letting the random programmer that happened to build the ASI, or their no less random CEO or shareholders, determine what would happen to the world, is an equally terrible idea. They wouldn't need the rest of humanity for anything anymore, making the fates of >99% of us extremely uncertain, even in an abundant world.
What I would be slightly more positive about is aggregating human preferences (I think preferences is a more accurate term than the more abstract, less well defined term values). I've heard two interesting examples, there are no doubt a lot more options. The first is simple: query chatgpt. Even this relatively simple model is not terrible at aggregating human preferences. Although a host of issues remain, I think using a future, no doubt much better AI for preference aggregation is not the worst option (and a lot better than the two mentioned above). The second option is democracy. This is our time-tested method of aggregating human preferences to control power. For example, one could imagine an AI control council consisting of elected human representatives at the UN level, or perhaps a council of representative world leaders. I know there is a lot of skepticism among rationalists on how well democracy is functioning, but this is one of the very few time tested aggregation methods we have. We should not discard it lightly for something that is less tested. An alternative is some kind of unelected autocrat (e/autocrat?), but apart from this not being my personal favorite, note that (in contrast to historical autocrats), such a person would also in no way need the rest of humanity anymore, making our fates uncertain.
Although AI and democratic preference aggregation are the two options I'm least negative about, I generally think that we are not ready to control an ASI. One of the worst issues I see is negative externalities that only become clear later on. Climate change can be seen as a negative externality of the steam/petrol engine. Also, I'm not sure a democratically controlled ASI would necessarily block follow-up unaligned ASIs (assuming this is at all possible). In order to be existentially safe, I would say that we would need a system that does at least that.
I think it is very likely that ASI, even if controlled in the least bad way, will cause huge externalities leading to a dystopia, environmental disasters, etc. Therefore I agree with Nathan above: "I expect we will need to traverse multiple decades of powerful AIs of varying degrees of generality which are under human control first. Not because it will be impossible to create goal-pursuing ASI, but because we won't be sure we know how to do so safely, and it would be a dangerously hard to reverse decision to create such. Thus, there will need to be strict worldwide enforcement (with the help of narrow AI systems) preventing the rise of any ASI."
About terminology, it seems to me that what I call preference aggregation, outer alignment, and goalcraft mean similar things, as do inner alignment, aimability, and control. I'd vote for using preference aggregation and control.
Finally, I strongly disagree with calling diversity, inclusion, and equity "even more frightening" than someone who's advocating human extinction. I'm sad on a personal level that people at LW, an otherwise important source of discourse, seem to mostly support statements like this. I do not.