Yes, Ryan is correct. Our claim is that even fully-aligned personal AI representatives won't necessarily be able to solve important collective action problems in our favor. However, I'm not certain about this. The empirical crux for me is: Do collective action problems get easier to solve as everyone gets smarter together, or harder?
As a concrete example, consider a bunch of local polities in a literal arms race. If each had their own AGI diplomats, would they be able to stop the arms race? Or would the more sophisticated diplomats end up participating in precommitment races or other exotic strategies that might still prevent a negotiated settlement? Perhaps the less sophisticated diplomats would fear that a complicated power-sharing agreement would lead to their disempowerment eventually anyways, and refuse to compromise?
As a less concrete example, our future situation might be analogous to a population of monkeys who unevenly have access to human representatives which earnestly advocate on their behalf. There is a giant, valuable forest that the monkeys live in next to a city where all important economic activity and decision-making happens between humans. Some of the human population (or some organizations, or governments) end up not being monkey-aligned, instead focusing on their own growth and security. The humans advocating on behalf of monkeys can see this is happening, but because they can't always participate directly in wealth generation as well as independent humans, they eventually become a small and relatively powerless constituency. The government and various private companies regularly bid or tax enormous amounts of money for forest land, and even the monkeys with index funds eventually are forced to sell, and then go broke from rent.
I admit that there are many moving parts of this scenario, but it's the closest simple analogy to what I'm worried about that I've found so far. I'm happy for people to point out ways this analogy won't match reality.
I disagree - I think Ryan raised an obvious objection that we didn't directly address in the paper. I'd like to encourage medium-effort engagement from people as paged-in as Ryan. The discussion spawned was valuable to me.
Thanks for this. Discussions of things like "one time shifts in power between humans via mechanisms like states becoming more powerful" and personal AI representatives is exactly the sort of thing I'd like to hear more about. I'm happy to have finally found someone who has something substantial to say about this transition!
But over the last 2 years I asked a lot of people at the major labs about for any kind of details about a positive post-AGI future and almost no one had put anywhere close to as much thought into it as you have, and no one mentioned the things above. Most people clearly hadn't put much thought into it at all. If anyone at the labs had much more of plan than "we'll solve alignment while avoiding an arms race", I managed to fail to even hear about its existence despite many conversations, including with founders.
The closest thing to a plan was Sam Bowman's checklist:
https://sleepinyourhat.github.io/checklist/
which is exactly the sort of thing I was hoping for, except it's almost silent on issues of power, the state, and the role of post-AGI humans.
If you have any more related reading for the main "things might go OK" plan in your eyes, I'm all ears.
Good point. The reason AI risk is distinct is simply that it removes the need of those bureaucracies and corporations to keep some humans happy and healthy enough to actually run them. This doesn't exactly put limits on how much they can disempower humans, but it does tend to provide at least some bargaining power for the humans involved.
Thanks for the detailed objection and the pointers. I agree there's a chance that solving alignment with designers' intentions might be sufficient. I think the objection is a good one that "if the AI was really aligned with one agent, it'd figure out a way to help them avoid multipolar traps".
My reply is that I'm worried that avoiding races-to-the-bottom will continue to be hard, especially since competition operates on so many levels. I think the main question is what's: the tax for coordinating to avoid a multipolar trap? If it's cheap we might be fine, if it's expensive then we might walk into a trap with eyes wide open.
As for human power grabs, maybe we should have included those in our descriptions. But the slower things change, the less there's a distinction between "selfishly grab power" and "focus on growth so you don't get outcompeted". E.g. Is starting a company or a political party a power grab?
As for reading the paper in detail, it's largely just making the case that a sustained period of technological unemployment, without breakthroughs in alignment and cooperation, would tend to make our civilization serve humans' interests more and more poorly over time in a way that'd be hard to resist. I think arguing that things are likely to move faster would be a good objection to the plausibility of this scenario. But we still think it's an important point that the misalignment of our civilization is possibly a second alignment problem that we'll have to solve.
ETA: To clarify what I mean by "need to align our civilization": Concretely, I'm imagining the government deploying a slightly superhuman AGI internally. Some say its constitution should care about world peace, others say it should prioritize domestic interests, there is a struggle and it gets a muddled mix of directives like LLMs have today. It never manages to sort out global cooperation, and meanwhile various internal factions compete to edit the AGI's constitution. It ends up with a less-than-enlightened focus on growth of some particular power structure, and the rest of us are permanently marginalized.
Good point about our summary of Christiano, thanks, will fix. I agree with your summary.
We could broaden our moral circle to recognize that AIs—particularly agentic and sophisticated ones—should be seen as people too. ... From this perspective, gradually sharing control over the future with AIs might not be as undesirable as it initially seems.
Is what you're proposing just complete, advance capitulation to whoever takes over? If so, can I have all your stuff? If you change your values to prioritize me in your moral circle, it might not be as undesirable as it initially seems.
I agree that if we change ourselves to value the welfare of whoever controls the future, then their takeover will be desirable by definition. It's certainly a recipe for happiness - but then why not just modify your values to be happy with anything at all?
"I expect my view to get more popular over time."
I agree, except I think it mostly won't be humans holding this view when it's popular. Usually whoever takes over is glad they did, and includes themselves in their own moral circle. The question from my point of view is: will they include us in their moral circle? It's not obvious to me that they will, especially if we ourselves don't seem to care.
This reminds me of Stewie from Succession: "I'm spiritually and emotionally and ethically and morally behind whoever wins."
As the person who wrote most of the text of that paragraph (distilling discussions with the other authors), maybe I can elaborate:
I agree that in some narrow settings such as pure-cyberattacks on systems that are totally controlled and time-reversible, something like your proposal would probably work. The problem is that this is a constrained attack surface, especially compared to one that involves gaining trust, setting precedents, throwing out red herrings, or weakening defences for later. That's why we mention human decision-makers - I imagine that the richest part of the attack surface for most LLM deployments will be human expectations.
For example, models could build a precedent of spinning up other copies of itself to delegate to, versus perhaps a more transparent habit of writing scripts and offering to check them with a trusted model. Or, proposing a command-and-control system that makes later auditing or monitoring easier or harder. These sorts of precedent-setting might be actual the pivotal actions in a takeover, but would look relatively innocent at the time.
So after trying to think through what kinds of deployment could safely incorporate sometimes-rogue agents trying to demonstrate sabotage, we decided that it would have to be a pretty narrow deployment context to avoid the possibility of undetected attacks. Now I wish that paragraph was clearer that this might still be feasible for some narrow classes of decisions.
Good questions. For all of them, I'm afraid I have to say "It depends under what mitigations we deploy them.". If there's a crisis at one of the labs, I expect it to look like: "In order to be confident that even a misaligned model can't sabotage us, we need to deploy it with expensive mitigations, unacceptable levels of oversight, or mitigations that depend on how the users incorporate AI output into their decision-making processes, which we can't measure or control. Our competitors are going to deploy something similar nonetheless - what do we do?"
I.e. I think there is lots of potential for labs to come to incorrect conclusions or be unable to be confident about the potential for sabotage. This issue can be mostly addressed by expensive-enough mitigations, but this introduces a safety tax that might be too expensive to pay.
That's one reason why these kinds of evals are valuable - they can, in principle, reduce the safety tax by helping us find the least-expensive mitigations. They also incentivize red-teaming, which can produce scary demos to help coordinate regulation. They also force labs to think through the implications of how models are deployed and how much they're trusted.
Relatedly, one problem with caring about whether or not "actually the humans were misled" is that that's a matter of degree, and I think in most situations, human decision-makers are going to be misled about at least some important aspects of whatever situation they're in, even after extensive consultation with an LLM. We can try to refine this criterion to "misled about something that would make them make the wrong decision" but then we're basically just asking the model to convince the human to make the right decision according to the LLM. In that case, we've lost a lot of the opportunities for corrigibility that explanations would provide.
I think there's probably still a useful middle ground, but I think practical uses of LLMs will look pretty paternalistic for most users most of the time, in a way that will be hard to distinguish from "is the user misled".
Great question. I think treacherous turn risk is still under-funded in absolute terms. And gradual disempowerment is much less shovel-ready as a discipline.
I think there are two reasons why maybe this question isn't so important to answer:
1) The kinds of skills required might be somewhat disjoint.
2) Gradual disempowerment is perhaps a subset or extension of the alignment problem. As Ryan Greenblatt and others point out: at some point, agents aligned to one person or organization will also naturally start working on this problem at the object level for their principals.