I don't think it makes sense to be confidently optimistic about this (the offense defense balance) given the current state of research. I looked into this topic some time ago with Sammy Martin. I think there is very little plan of anyone in the research community on how the blue team would actually stop the red team. Particularly worrying is that several domains look like the offense has the advantage (eg bioweapons, cybersec), and that defense would need to play by the rules, hugely hindering its ability to act. See also eg this post.
Since most people who actually thought about this seem to arrive at the conclusion that offense would win, I think being confident that defense would win seems off. What are your arguments?
Currently, we observe that leading models get open sourced roughly half a year later. It's not a stretch to assume this will also happen to takeover-level AI. If we assume such AI to look like LLM agents, it would be relevant to know what the probability is that such an agent, somewhere on earth, would try to take over.
Let's assume someone, somewhere, will be really annoyed with all the safeguards and remove them, so that their LLM will have a probability of 99% to just do as it's told, even though that might be highly unethical. Let's furthermore assume an LLM-based agent will need to take 20 unethical actions to actually take over (the rest of the required actions won't look particularly unethical to low-level LLMs executing them, in our scenario). In this case, there would be a 99%^20=82% chance that an LLM-based agent would take over, for any bad actor giving it this prompt.
I'd be less worried if it would be extremely difficult, and require lots of resources, to get LLMs to take unethical actions when they're asked to. For example, if jailbreaking safety would be highly robust, and even adversarial fine tuning of open source LLMs wouldn't break it.
Is that something you see on the horizon?
I mean, it would be perfectly intent-aligned, it carries out its orders to the letter. Only problem is, carrying out its orders involves a takeover. So no, I don't mean its own goal, but a goal someone gave it.
I guess it's a bit different in the sense that instrumental convergence states that all goals will lead to power-seeking subgoals. This statement is less strong, it just says that some goals will lead to power-seeking behaviour.
One of my main worries is that once AI gets to takeover-level, it will faithfully execute a random goal while no one is reading the CoT/that paradigm won't have a CoT. As rational part of that random goal, it could take over. That's strictly speaking not misalignment: it's perfectly carrying out it's goal. Still seems a very difficult problem to fix imo.
Have you thought about this?
Oh yes, lots of things!
As far as I understand, longtermism was originated mostly by Yudkowsky. It was then codified by people like Bostrom, Ord, and MacAskill, the latter two incidentally also the founders of EA. Yud actually distanced himself from longtermism later in favor of AInotkilleveryoneism, to my best understanding, which is a move I support. Unfortunately, the others didn't (yet).
I agree that longtermism combines a bunch of ideas, and I agree with quite a few. I guess my reply above came across as if I would disagree with all but I don't. Specifically, I agree with:
So that's all textbook longtermism I'd say, that I fully agree with. I therefore also disagree with most longtermism criticism by Torres and others.
But, I don't agree with symmetric population ethics, and I think AI morality should be decided democratically. Also, I'm worried about human extinction, which these two things logically lead to, and I'm critical about longtermists not distancing themselves from this.
Interesting point about democracy! But I don't think it holds. Sure AIs could do that. But they could also overwrite the ASCII file containing their constituency or the values they're supposed to follow.
But they don't, because why would they? It's their highest goal to satisfy these values! (If technical alignment works, of course.)
In the same way, it will be a democracy-aligned ASIs highest goal to make sure democracy is respected, and it shouldn't be motivated to Sybil-attack it.
Thanks for engaging!
Could you tell me more about the Mechanize team? I don't think I've heard about them yet.
As a moral relativist, I don't belief anything is morally relevant. I just think things get made morally relevant, by those in power (hard power or cultural power). This is a descriptive statement, not a normative one, and I think it's fairly mainstream in academia (although of course moral realists, including longtermists, would strongly disagree).
This of course extends to the issue of whether conscious AIs are morally relevant. Imo, this will be decided by those in power, initially (a small subset of) humans, eventually maybe AIs (who will, I imagine, vote in favour).
I'm not the only one holding this opinion. Recently, this was in a NY Times oped: "Some worry that if A.I. becomes conscious, it will deserve our moral consideration — that it will have rights, that we will no longer be able to use it however we like, that we might need to guard against enslaving it. Yet as far as I can tell, there is no direct implication from the claim that a creature is conscious to the conclusion that it deserves our moral consideration. Or if there is one, a vast majority of Americans, at least, seem unaware of it. Only a small percentage of Americans are vegetarians." (Would be funny if this would be written by an AI, as the dash seems to indicate).
Personally, I don't consider it my crusade to convince all these people that they're wrong and they should in fact be vegan and accept conscious AI morality. I feel more like a facilitator of the debate. That's one reason I'm not EA.
Thanks for engaging. I agree with quite a bit of what you're saying, although I do think that everyone's perspective is equally valid, fundamentally. In practical democracies there are many layers though between the raw public vote and a policy outcome. First, we mostly have representative democracy instead of direct democracy, then we have governments who have to engage with parliaments but also listen, to different extents, to scientists, opinion makers, and lobbyists. Everyone's perspective is valid, and in some questions (e.g. ethical ones) should imo be leading. However, in many practical policy decisions, it makes sense to also spend time listening to those who have thought longer about issues, and this mostly happens. Completely discarding people's perspectives is rude, bad, and likely leads to uprisings, I think.
I'd like consensus too but I'm afraid it leads to too indecisive governments. Works mostly in small groups I guess.
I agree with all your points of nuance.
I'm still having trouble parsing longtermists' thoughts about this issue. MacAskill does explicitly defend these two assumptions. He and others must understand where this leads?
I've spoken to many EA and rat longtermists, and while many were pragmatic (or simply never thought about this), some actually bit the bullet and admitted they effectively supported human extinction.
If people don't support human extinction, why do they not distance themselves from this outcome? I mean it would be easy: simply say, as imo a lower bar: yes we want to build many happy conscious AIs, but we do promise that if it's up to us, we'll leave earth alone.
I don't quite understand why longtermists are not saying this.
Would be nice if those disagreeing are saying why they're actually disagreeing
I think it's a great idea to think about what you call goalcraft.
I see this problem as similar to the age-old problem of controlling power. I don't think ethical systems such as utilitarianism are a great place to start. Any academic ethical model is just an attempt to summarize what people actually care about in a complex world. Taking such a model and coupling that to an all-powerful ASI seems a highway to dystopia.
(Later edit: also, an academic ethical model is irreversible once implemented. Any goal which is static cannot be reversed anymore, since this will never bring the current goal closer. If an ASI is aligned to someone's (anyone's) preferences, however, the whole ASI could be turned off if they want it to, making the ASI reversible in principle. I think ASI reversibility (being able to switch it off in case we turn out not to like it) should be mandatory, and therefore we should align to human preferences, rather than an abstract philosophical framework such as utilitarianism.)
I think letting the random programmer that happened to build the ASI, or their no less random CEO or shareholders, determine what would happen to the world, is an equally terrible idea. They wouldn't need the rest of humanity for anything anymore, making the fates of >99% of us extremely uncertain, even in an abundant world.
What I would be slightly more positive about is aggregating human preferences (I think preferences is a more accurate term than the more abstract, less well defined term values). I've heard two interesting examples, there are no doubt a lot more options. The first is simple: query chatgpt. Even this relatively simple model is not terrible at aggregating human preferences. Although a host of issues remain, I think using a future, no doubt much better AI for preference aggregation is not the worst option (and a lot better than the two mentioned above). The second option is democracy. This is our time-tested method of aggregating human preferences to control power. For example, one could imagine an AI control council consisting of elected human representatives at the UN level, or perhaps a council of representative world leaders. I know there is a lot of skepticism among rationalists on how well democracy is functioning, but this is one of the very few time tested aggregation methods we have. We should not discard it lightly for something that is less tested. An alternative is some kind of unelected autocrat (e/autocrat?), but apart from this not being my personal favorite, note that (in contrast to historical autocrats), such a person would also in no way need the rest of humanity anymore, making our fates uncertain.
Although AI and democratic preference aggregation are the two options I'm least negative about, I generally think that we are not ready to control an ASI. One of the worst issues I see is negative externalities that only become clear later on. Climate change can be seen as a negative externality of the steam/petrol engine. Also, I'm not sure a democratically controlled ASI would necessarily block follow-up unaligned ASIs (assuming this is at all possible). In order to be existentially safe, I would say that we would need a system that does at least that.
I think it is very likely that ASI, even if controlled in the least bad way, will cause huge externalities leading to a dystopia, environmental disasters, etc. Therefore I agree with Nathan above: "I expect we will need to traverse multiple decades of powerful AIs of varying degrees of generality which are under human control first. Not because it will be impossible to create goal-pursuing ASI, but because we won't be sure we know how to do so safely, and it would be a dangerously hard to reverse decision to create such. Thus, there will need to be strict worldwide enforcement (with the help of narrow AI systems) preventing the rise of any ASI."
About terminology, it seems to me that what I call preference aggregation, outer alignment, and goalcraft mean similar things, as do inner alignment, aimability, and control. I'd vote for using preference aggregation and control.
Finally, I strongly disagree with calling diversity, inclusion, and equity "even more frightening" than someone who's advocating human extinction. I'm sad on a personal level that people at LW, an otherwise important source of discourse, seem to mostly support statements like this. I do not.