My response to the alignment / AI representatives proposals:
Even if AIs are "baseline aligned" to their creators, this doesn't automatically mean they are aligned with broader human flourishing or capable of compelling humans to coordinate against systemic risks. For an AI to effectively say, "You are messing up, please coordinate with other nations/groups, stop what you are doing" requires not just truthfulness but also immense persuasive power and, crucially, human receptiveness. Even if pausing AI was the correct thing to do, Claude is not going to suggest this to Dario for obvious reasons. As we've seen even with entirely human systems (Trump’s Administration and Tariff), possessing information or even offering correct advice doesn't guarantee it will be heeded or lead to effective collective action.
[...] "Politicians...will remain aware...able to change what the system is if it has obviously bad consequences." The climate change analogy is pertinent here. We have extensive scientific consensus, an "oracle IPCC report", detailing dire consequences, yet coordinated global action remains insufficient to meet the scale of the challenge. Political systems can be slow, captured by short-term interests, or unable to enact unpopular measures even when long-term risks are "obviously bad." The paper [gradual disempowerment] argues AI could further entrench these issues by providing powerful tools for influencing public opinion or creating economic dependencies that make change harder.
Extract copy pasted from a longer comment here.
I don't see how these help.
First, it seems to me that interoperability + advisors would be useless for helping people sociopolitically maneuver precisely up until the point the AI models are good enough to disempower most of them. Imagine some not-particularly-smart person with no AI expertise and not much mental slack for fighting some abstract battle about the future of humanity. A demographic of those people is then up against the legal teams of major corporations and the departments of major governments navigating the transition. The primary full-time jobs of the people working in the latter groups, at which they'd be very skilled, would be figuring out how to disempower the former demographics. In what case do the at-risk demographics stand any chance?
Well, if the AI models are good enough that the skills and the attentions of the humans deploying them don't matter. Which is to say: past the point at which the at-risk demographics are already disempowered.
Conversely, if the AI capabilities are not yet there, then different people can use AIs more or less effectively depending on how smart and skilled they are, and how much resources/spare attention for the fight they have. In which case the extant powers are massively advantaged and on-expectation win.
Second, I'm skeptical about feasibility. This requires, basically, some centralized distribution system which (1) faithfully trains AI models to be ultimately loyal to their end advisees, (2) fully subsidizes the compute costs for serving these model to all the economically useless people (... in the US? ... in the world?). How is that centralized system not subverted/seized by the extant powers? (E. g., the government insisting, in a way that sounds surface-level reasonable, on ultimate loyalty to its laws first, which it can then freely rewrite to have complete control.)
Like, suppose this system is set up some time before AI models are good enough to make human workers obsolete. As per the first point, the whole timespan prior to the AI models becoming that good would involve the entrenched powers successfully scheming towards precisely the gradual disempowerment we're trying to prevent. How do we expect this UBI-like setup[1] to still be in place, uncorrupted, by the point AI models become good enough, given that it is the largest threat to the ability of extant powers to maintain/increase their power? This would be the thing everyone in power is trying to dismantle. (And, again, unless the AIs are already good enough to disempower the risk demographics, the risk demographics plus their AI representatives would be massively worse at fighting this battle than their opponents plus their AI representatives.)
Third: Okay, let's suppose the system is somehow in place, everyone in the world has a loyal AI representative advising them and advocating for their interests. What are these representatives supposed to do? Like, imagine some below-average-intelligence menial-labor worker hopelessly outclassed by robots. What moves is their AI representative supposed to make to preserve their power and agency? They don't really have any resources to bargain with; it's the ground truth.
Will the AIs organize them to march to the military bases/datacenters/politicians' and CEOs' homes and physically seize the crucial resources/take the key players hostage, or what?
There was a thought experiment going around on Twitter a few months back, which went:
Suppose people have to press one of two buttons, Blue and Red. If more than 50% of people press Blue, nobody dies. If more than 50% press Red, everyone who pressed Blue dies. Which button do you press?
Red is guaranteed to make you safe, and in theory, if everyone pressed Red, everyone would be safe. But if we imagine something isomorphic to this happening in reality, we should note that getting 100% of people to make the correct choice is incredibly hard. Some would be distracted, or confused, or they'd slip and fall and smack the wrong button accidentally.
To me, all variations on "let's give everyone in the world an AGI advisor to prevent disempowerment!" read as "let's all play Red!". The details don't quite match (50% survival/non-disempowerment rate is not at all guaranteed), but the vibe and the failure modes are the same.
Which is also supposed to be implemented during Trump's term?
One objection that I have with your feasibility section is that you seem to lump in "the powers that be" as a single group.
This would be the thing everyone in power is trying to dismantle
The world is more multipolar than this, and so are the US legal and political systems. Trump and the silicon valley accelerationist crowd do hold a lot of power, but just a few years ago they were persona non-grata in many circles.
Even now when they want to pass bills or change laws, they need to lobby for support from disparate groups both within and without their party. With a sufficiently multipolar world where even just a few different groups have powerful models assisting in their efforts, there will be some who want to change laws and rules in one way, others who want to change it in a different way, and others who don't want to change it at all. There will be some who are ambivalent.
I'm not saying the end result isn't corruption, I think that parasitic middlemanning any redistribution is a basin of attraction for any political spending and/or power. But there will be many different parties competing to corrupt it, or shore it up, according to their own beliefs and interests.
I think the argument that making the world more multipolar where a more diverse array of parties have models, may in fact lead to greater stability and less corruption (or at least more diverse coalitions when coalition building occurs).
Implicit in my views is that the problem would be mostly resolved if people had aligned AI representatives which helped them wield their (current) power effectively.
Can you make the case for this a bit more? How are AI representatives going to help people prevent themselves becoming disempowered / economically redundant? (Especially given that you explicitly state you are skeptical of "generally make humans+AI (rather than just AI) more competitive").
Mandatory interoperability for alignment and fine-tuning
Furthermore, I don't really see how fine-tuning access helps create AI representatives. Models are already trained to be helpful and most people don't have very useful personal data that would make their AI work much better for them (that can't be put in context of any model).
The hope here would be to get the reductions in concentration of power that come from open source
The concentration of power from closed source AI comes from (1) the AI companies' profits and (2) the AI companies having access to more advanced AI than the public. Open source solves (1), but fine-tuning access solves neither. (Obviously your "Deploying models more frequently" proposal does help with (2)).
Fine-tuning access could address (1) if there's still sufficient access to drive down prices insofar as the fine-tuned model operators capture profit that would otherwise go to the main AI labs.
Fine-tuning access allows the public to safely access models that might be too dangerous to open-source/open-weight.
Thanks a lot for this post! I appreciate you taking the time to engage, I think your recommendations are good, and I agree with most of what you say. Some comments below.
"the intelligence curse" or "gradual disempowerment"—concerns that most humans would end up disempowered (or even dying) because their labor is no longer valuable.
The intelligence curse and GD are not equivalent. In particular, I expect @Jan_Kulveit & co. would see GD as a broader bucket including also various subtle forms of cultural misalignment (which tbc I think also matter!), whereas IC is more specifically about things downstream of economic (and hard power, and political power) incentives. (And I would see e.g. @Tom Davidson's AI-enabled coup risk work as a subset of IC, as representing the most sudden and dramatic way that IC incentives could play out)
It's worth noting I doubt that these threats would result in huge casualty counts (due to e.g. starvation) or disempowerment of all humans (though substantial concentration of power among a smaller group of humans seems quite plausible).
[fn:]
That said, I do think that technical misalignment issues are pretty likely to disempower all humans and I think war, terrorism, or accidental release of homicidal bioweapons could kill many. That's why I focus on misalignment risks.
I think if you follow the arguments, disempowerment of all humans is plausible, and disempowerment of the vast majority even more so. I agree that technical misalignment is more likely to lead to high casualty counts if it happens (and I think the technical misalignment --> x-risk pathway is possible and incredibly urgent to make progress on).
I think there's also a difference between working on mitigating very clear sequences of steps that lead to catastrophe (e.g. X --> Y --> everyone drops dead), and working on maintaining the basic premises that make things not broken (e.g. for the last 200 years when things have been getting much better, the incentives of power and humans have been remarkably correlated, and maybe we should try to not decorrelate them). The first is more obvious, but I think you should also be able to admit theories of change of the second type at least sufficiently that, for example, you would've decided to resist communism in the 1950s ("freedom good" is vague, and there wasn't yet consensus that market-based economies would provide better living standards in the long run, but it was still correct to bet against the communists if you cared about human welfare! basic liberalism is very powerful!).
Implicit in my views is that the problem would be mostly resolved if people had aligned AI representatives which helped them wield their (current) power effectively.
Yep, this is a big part of the future I'm excited to build towards.
- I'm skeptical of generally diffusing AI into the economy, working on systems for assisting humans, and generally uplifting human capabilities. This might help some with societal awareness, but doesn't seem like a particularly leveraged intervention for this. Things like emulated minds and highly advanced BCIs might help with misalignment, but otherwise seems worse than AI representatives (which aren't backdoored and don't have secret loyalties/biases).
I think there are two basic factors that affect uplift chances:
(More fundamentally, there's also the question of how high you think human/AI complementarity at cognitive skills to be—right now it's surprisingly high IMO)
I'm skeptical that local data is important.
I'm curious what your take on the basic Hayek point is?
- I agree that AI enabled contracts, AI enabled coordination, and AIs speeding up key government processes would be good (to preserve some version of rule of law such that hard power is less important). It seems tricky to advance this now.
I expect a track record of trying out some form of coordination at scale is really helpful for later getting it into government / into use by more "serious" actors. I think it's plausible that it's really hard to get governments to try any new coordination or governance mechanism before it's too late, but if you wanted to increase the odds, I think you should just very clearly be trying them out in practice.
- Understanding agency, civilizational social processes, and how you could do “civilizational alignment” seems relatively hard and single-single aligned AI advisors/representatives could study these areas as needed (coordinating research funding across many people as needed).
I agree these are hard, and also like an area where it's unclear if cracking R&D automation to the point where we can hill-climb on ML performance metrics gets you AI that does non-fake work on these questions. I really want very good AI representatives that are very carefully aligned to individual people if we're going to have the AIs work on this.
The current race towards agentic AGI in particular is much more like 50% cultural/path-dependent than 5% cultural/path-dependent and 95% obvious. I think the decisions of the major labs are significantly influenced by particular beliefs about AGI & timelines; while these are likely (at least directionally) true beliefs, it's not at all clear to me that the industry would've been this "situationally aware" in alternative timelines.
This is probably cruxy here, as I viewed the race to replace humans with AI as much less path-dependent ever since I realized the giant scale-up of compute happened, as well as the bitter lesson occuring, combined with scale-up of pure self-supervised learning as hitting slowdowns, and more generally subscribe to a view in which research is less path-dependent than people think.
More generally, I'm very skeptical of changing the ultimate paradigm for AGI into something that's safer but less competitive, and I believe your initial proposals relied on changing the AI paradigm to significantly complement humans using local knowledge, rather than straight-up automate them, but I view automation as unlocking >99% of the value or more due to the long tail of cases that occur IRL, so this is a big amount of value to give up.
(More fundamentally, there's also the question of how high you think human/AI complementarity at cognitive skills to be—right now it's surprisingly high IMO)
I also suspect this is a lesser crux, and while I do think complementarities exist, I'd say that the human+AI complement is basically always much less valuable than an AI straight up replacing the human, if replacing the human actually worked.
The intelligence curse and GD are not equivalent.
Yep, though I think solutions are often overlapping. I should have clarified this.
I'm not going to respond to the rest of this (at least right now), sorry.
Do states and corporations also have their aligned representatives? Is the cognitive power of the representatives equal, roughly equal, or wildly unequal? If it is unequal, why are the resulting equilibria pro-human? (i.e. if I imagine individual humans like me represented by eg GPT4 while the government runs tens of thousands o4s, I would expect my representative to get convinced about whatever government wants)
Note: as a general policy I'm not planning on engaging with the comments here, this is just because I don't want to spend a bunch of time on this topic and this could easily eat a bunch of time. Sorry about that.
Aligning AI representatives / advisors to individual humans: If every human had a competitive and aligned AI representative which gave them advice on how to advance their interests as well as just directly pursuing their interests based on their direction (and this happened early before people were disempowered), this would resolve most of these concerns.
My personal prediction is that this would result in vast coordination problems that would likely rapidly lead to war and x-risk. You need a mechanism to produce a consensus or social compact, one that is at least as effective as our existing mechanisms, preferably more so. (While thinking about this challenge, please allow for the fact that 2–4% of humans are sociopathic, so an AI representative representing their viewpoint is likely to be significantly less prosocial.)
Possibly you were concealing some assumptions of pro-social/coordination behavior inside the phrase "aligned AI representative" — I read that as "aligned to them, and them only, to the exclusion of the rest of society — since they had it realigned that way", but possibly that's not how you meant it?
Some of my thoughts on avoiding the intelligence curse or gradual disempowerment and ensure that humans stay relevant:
Human authentification and real world activities seem indeed very important. Deepfake is a form of disempowerment and can destroy or destabilize states before employment becomes a concern. AI generated content can already be near or sometimes strictly undistinguishable from human generated content. Texts, pictures, videos. We are just at the beginning of the flood. Disinformation explodes on the internet and governments fall in the hands of populist and nationalist parties the one after the other. It's also a dramatic concern for justice. Should we go back to analogic contents ?
Good to see your work on this! I'll avoid jumping in on weighing this relative to other problems as it's not the core of your post.
Rudolf and I are proponents of alignment to the user, which seems very similar to your second suggestion. Do you think there's a difference in the approach you outline vs the one we do? I'm considering doing a larger write-up on this approach, so your feedback would be helpful.
Nope, not claiming my proposal here is different from "alignment to the user". I probably should have made this clear. I wasn't trying to claim the interventions were novel approaches (though I think mandatory interoperability is), just that my prioritization/highlighting was novel.
My guess is for the prioritization work in particular, it would be useful to understand the threat model better.
Seems right, I just had some thoughts which seemed maybe useful so I decided to quickly write them up.
(Rudolf encouraged me to post as a top level post, I was initially going to post as a quick take.)
I was surprised to not see much consideration, either here or in the original GD and IC essays, of the brute force approach of "ban development of certain forms of AI," such as Anthony Aguirre proposes. Is that more (a) because it would be too difficult to enforce such a ban or (b) because those forms of AI are considered net positive despite the risk of human disempowerment?
Not commenting on here, but from my perspective, in very short form
- bans and pauses have a big problem to overcome: being "incentive compatible" (it's mostly not enforcement - stuff can be enforced by hard power - but why would actors agree?)
- in some sense this is a coordination problem
- my guess is most likely form how to overcome the coordination problem in good way involves some AI cognition helping humans to coordinate -> suggests differential technological development
- other viable forms of overcoming the coordination problem seems possible, but often unappealing for various reasons I don't want to advocate atm
There have recently been various proposals for mitigations to "the intelligence curse" or "gradual disempowerment"—concerns that most humans would end up disempowered (or even dying) because their labor is no longer valuable. I'm currently skeptical that the typically highlighted prioritization and interventions are best and I have some alternative proposals for relatively targeted/differential interventions which I think would be more leveraged (as in, the payoff is higher relative to the difficulty of achieving them).
It's worth noting I doubt that these threats would result in huge casualty counts (due to e.g. starvation) or disempowerment of all humans (though substantial concentration of power among a smaller group of humans seems quite plausible).[1] I decided to put a bit of time into writing up my thoughts out of general cooperativeness (e.g., I would want someone in a symmetric position to do the same).
(This was a timeboxed effort of ~1.5 hr, so apologies if it is somewhat poorly articulated or otherwise bad. Correspondingly, this post is substantially lower effort than my typical post.)
My top 3 preferred interventions focused on these concerns are:
Some things which help with above:
Implicit in my views is that the problem would be mostly resolved if people had aligned AI representatives which helped them wield their (current) power effectively.
To be clear, something like these interventions has been highlighted in prior work, but I have a somewhat different emphasis and prioritization and I'm explicitly deprioritizing other interventions.
Deprioritized interventions and why:
(I'm not discussing interventions targeting misalignment risk, biorisk, or power grab risk, as these aren't very specific to this threat model.)
Again, note that I'm not particularly recommending these interventions on my views about the most important risks, just claiming these are the best interventions if you're worried about "intelligence curse" / "gradual disempowerment" risks.
That said, I do think that technical misalignment issues are pretty likely to disempower all humans and I think war, terrorism, or accidental release of homicidal bioweapons could kill many. That's why I focus on misalignment risks.