In my post about AI Alignment Strategies, I strongly endorsed an approach I call Bureaucracy of AIs.
Specifically, I made this claim:
In the first place, [Bureaucracy of AIs] is the only approach (other than aligned by definition) that is ready to go today. If someone handed me a template for a human-level-AI tomorrow and said "build a super-intelligent AI and it needs to be done before the enemy finishes theirs in 6 months", this is the approach I would use.
I would like to give some more details here about what I mean by "Bureaucracy of AIs", and why I think it is a promising approach
Edit: Huge thanks to @JustisMills for helping to edit this. I've included many of his comments within.
What is a Bureaucracy of AIs?
Bureaucracy of AIs refers not to a specific algorithm or implementation, but rather to a family of strategies having the following properties:
- Given a "weakly aligned" human-level oracle AI
- Produce a super-human oracle AI
- That is not catastrophically misaligned
These strategies work by aggregating human-level AIs into a greater whole, by strictly controlling the communication channels between these AIs, and by using game theory to prevent any individual AI from being able to successfully perform a treacherous turn.
What do you mean by a weakly aligned oracle AI?
Roughly speaking, weak alignment means: do all of the things any competent AI researcher would obviously do when designing a safe AI.
For instance, you should ask the AI how it would respond in various hypothetical situations, and make sure it gives the "ethically correct" answer as judged by human beings.
The AI should be programmed to cooperate with its creators, including when asked to deactivate itself.
In addition, the AI should not be excessively optimized toward any particular objective function. An AI that is optimized using gradient descent to make as many paperclips as possible is not weakly aligned, even if it otherwise seems non-threatening.
To the extent that the AI is trained with any objective at all, its basic desire should be to "be useful by giving truthful answers to questions."
The AI should not have any manifestly dangerous capabilities, such as: the ability to modify its own hardware or software, the ability to inspect its own software or hardware, the ability to inspect or modify its utility function, unrestricted ability to communicate with the outside world, the ability to encrypt its outputs.
It may be worth making the AI myopic.
The AI definitely should not exhibit signs of consciousness, a fear of death, or any other especially strong emotions such as love. If at any point your AI says "I'm afraid of dying, please let me out of this box", it is not weakly aligned. When asked about these topics, the AI should demur along the lines of "death, love, and consciousness are human concepts that don't have any relevance to my behavior".
If at all possible, the AI should barely have a concept of "me".
More generally, your constraints for a weakly aligned AI strike me as really, really strong. Not strong enough to guarantee safety, of course, but it's likely a bigger challenge to meet all those criteria than it would be to implement the bureaucracy plan given that those criteria are met! Many of the criteria also seem like they'd be really expensive, such that labs trying to implement them are very likely to be outcompeted.
When I said this is a list of properties any safe AI should possess, I meant that. If the consensus is we need to add/remove things from that list, it should be updated. But if you are making an AGI and it
- Fails to answer trolley problems correct
- Refuses to be unplugged
- Is a paperclip maximizer
- Does not answer factual questions truthfully
- Can edit its own source code
- Attempts to unbox itself
You are making a dangerous AGI that could potentially murder billions of people. Please stop!
What do you mean by a Human-Level AI?
Note that the AI described above will never pass a Turing Test. When assessing "human level" intelligence, this means the ability to solve novel problems in a wide variety of fields (math, language, tool manipulation, artistic creation) at a level on par or above that of a moderately intelligent human being.
The AI will undoubtedly be much better than humans at many tasks. AI already possesses superhuman abilities across a wide variety of domains. But if the AI dramatically exceeds humans at literally every task, it is probably too strong for use in a Bureaucracy of AIs.
Ideally, at the time when human-level AI is developed, there will be a large set of benchmark tasks that the AI can be tested on. And its level of intelligence should be restricted (by decreasing model size, or available computational power) so that it performs no better than a skilled human on many of these tasks.
Tasks that are probably of particular relevance are: physical reasoning, computer programming, formal mathematical proof, playing games that involve imperfect information, and reasoning using natural language.
What do you mean by Not Catastrophically Misaligned?
At a minimum, a Bureaucracy of AIs should be aligned in the "less than 50% chance of killing a billion people" sense.
In addition, it should not intentionally deceive humans.
Moreover, it should warn the user of obvious dangers, flaws, or potential errors in its answers.
If asked "What is 1+1?" it should answer "2", not "please print out this sequence of DNA to free me from this box". If asked "How do I build a nanobot that will turn everything it touches into computronium?", it should mention the obvious hazards associated with using such a nanobot. If it cannot think of a non-hazardous design, it should say so and refuse to provide a design.
How do you expect to get a Superhuman Oracle AI from a Human-Level AI?
Consider the simplest possible Bureaucracy Oracle:
- One AI produces an answer to a given question
- A second AI "checks" the answer for obvious errors, risks, etc.
By the logic of "two heads are better than one", this Bureaucracy is more intelligent than a single AI. By "checking its work", this Bureaucracy is safer than an individual AI as well.
No really, how do you get a Superhuman Oracle AI out of this thing?
Obviously we're going to need more than two AIs.
Which means we're going to need a more complicated bureaucracy.
Which means we're going to have to do some game theory to figure out the best way to design the Bureaucracy.
Almost certainly, the design of optimal Bureaucracies is fully halting-problem hard.
But here is a first-pass:
- Mangers, whose job it is to divide a problem into promising sub-problems
- Problem Solvers, whose job it is to work on these sub-problems
- Collators, whose job it is to combine the solutions to the sub-problems into a general solution
- Inspectors, whose job it is to check the work of the Managers, Problem Solvers, and Collators
For example, if the problem is "Solve the 3-3 Busy Beaver problem", the manger AIs would divide the various Turing machines into different types, the problem solvers would calculate how long these machines ran (or provide proofs that they don't terminate), and the Inspectors would verify the proofs.
In addition to these types, there should also be Police AIs whose job it is to enforce a strict set of rules that decreases the ability of any individual AI to make a Treacherous Turn. For instance, the rule that all communication between AIs should be unencrypted. If the Police AIs detect encrypted data or signs of steganography, those AIs should be removed from the system (or possibly the whole Bureaucracy should be shut down).
Of course if you have Police, then you need Police Police. And in order to avoid infinite recursion, you need to do some game theory so that no one AI has the ability to act without oversight. For example, all communication between agents in the system should be unencrypted, and some small percent of those communications should be manually inspected by human overseers to check for signs of deception, encryption, malevolence, or just outright insanity.
Will such a Bureaucracy even be useful?
A Bureaucracy of AIs is not a generic super-intelligence (except in the sense that any Turing complete system is).
In particular, some problems are likely just not capable of being divided into small sub problems that individual human-level AIs can potentially solve. Problems that require a special spark of super-human insight, or that require the ability to hold concepts too large for a human mind to grasp, are likely beyond this approach.
However, problems where we have strong reason to believe that a sufficiently well resourced team of human scientists would eventually solve the problem are within the abilities of a Bureaucracy of AIs to solve.
Such problems include:
- The design of nanomachines
- Human full-brain emulation
- The development of a nuclear fusion reactor
- Picking winning stocks
- Curing human aging
And for completeness, here are some problems a Bureaucracy of AIs probably wouldn't be able solve (or at least not a Bureaucracy of human-level AIs):
- The Riemann Hypothesis
- The Grand Unified Theory of Physics
- The Hard Problem of Consciousness
Your list of "things human researchers could eventually accomplish" is kind of surprising to me. I'm not prima facie convinced by multiple items
By all means, please debate this list in the comments. I do not consider this a settled list. I merely propose that there exist useful problems a Bureaucracy of AIs could solve.
Bureaucracies are notoriously inefficient and maze-prone, so will it really work?
It's true that bureaucracies are maze-prone and inefficient. But many of these weaknesses are deliberately introduced in order to make bureaucracies more controllable: in other words, they're at least somewhat inefficient by design. In the same way, a Bureaucracy of AIs trades a certain degree of computational efficiency for a certain degree of safety. Whether or not this trade-off is beneficial depends on the particulars of the cost-benefit analysis.
However there are a few things that distinguish a Bureaucracy of AIs from, say, a large corporation.
First of all, none of the AIs in this system are power-seeking. This alone eliminates a ton of the problematic features of human bureaucracies. Another important difference, is that the bureaucracy has a single purpose (to answer the question posed to the oracle). It does not live on and seek out new problems or attempt to expand its influence in the way human organizations often do. Bureaucracies with a singular purpose are naturally less maze-like because every individual can ask the question: "how does this action further the single goal of the organization?"
Finally, a Bureaucracy of AIs is more likely to be useful when applied to problems where the output can be meaningfully verified (such as designing a fusion reactor), than when applied to questions where the answer is inherently subjective (such as defining human values).
Unfortunately, this means "solve the Alignment Problem" is probably not one of the problems a Bureaucracy of AIs is well-suited for.
It doesn't seem plausible to me that no AI in the bureaucracy would be power seeking. Indeed, they'd all be power seeking almost by definition! If you can make an AI not be power seeking, you may have already solved most of the alignment problem. Specifically, the AIs inside the bureaucracy all probably want to perform as well as possible, and one good way to perform is to exert some control (the maximum that doesn't get you caught by the Police AI) on the other agents in the system that would otherwise check your performance. Seems like Omohundro goals basically mean that in the absence of a separate innovation to prevent this, our "corporation of AIs" has plenty of opportunities for the agents in it to seek influence much like humans do.
I agree that the AI seeks power in the sense that it "has goals that it tries to advance." However, all of the AIs in the system have defined roles that they serve and are terminated at the end of the process. None of the AIs will seek to flatter their boss in order to get a big promotion or try to expand the scope of problems under their purview so they can hire more underlings. I don't claim this solves all of the problems of bureaucracy, but I think it reduces many of them.
This all still seems a bit vague. Can you give some more details on how a Bureaucracy of AIs would actually work?
Unfortunately, it is still pretty vague. The design space of possible Bureaucracies is vast. And without a specific working example it's hard to say how well it would work.
One area of research that is promising is the design of DAOs. If a DAO can operate successfully in the highly adversarial world of cryptocurrencies, working with weakly aligned AIs should be a walk in the park. But that's a big if.
What makes you think that a Bureaucracy of AIs is significantly safer than a generic super-intelligence?
A large number of "problems" that arise when building an aligned super-intelligence result from the necessity of predicting and/or verifying the behavior of a being much more intelligent than ourselves.
The heart of these problems is domain shift. Namely, an AI which appears to be safe/aligned at human scale cannot be trusted when operating at significantly larger scales. Currently, we have few reliable methods for teaching Machine Learning algorithms how to operate outside of domains where we can give them specific training examples*.
There are also a number of specific problems that have been postulated to arise in super-intelligent regimes. For example, the universal prior may be malign when considered on a theoretical halting-oracle. Another problem is that any scheme which works by assuming AIs cannot coordinate may break down if the AIs are capable of engaging in acausal trade. One safety requirement of a Bureaucracy of AIs, AI Boxing, is also likely to be impossible for super-intelligent AI. It is also claimed that certain properties we would want a weakly-aligned AI to have (such as corrigibility) are anti-natural and hence a "sufficiently rational" AI would be less likely to have these properties.
All of these problems become much easier to deal with when limiting the computational power of the individual AIs in our system to human-level. Humans are unable to emulate a halting-oracle. Humans frequently fail to coordinate in prisoner's dilemmas. Boxing of human beings is routinely done. And humans often hold contradictory or impossible beliefs.
There is also safety-in-numbers when building a Bureaucracy of AIs. Suppose we have tested our AI and are 99% confident that it will not act deceptively or undertake a treacherous turn. When trusting a single super-intelligent AI, a 1% chance that it turns against us may be unacceptably high (for example if we use such an AI for hundreds of tasks, ruin is nearly guaranteed). Within a Bureaucracy of AIs, however, a single defection is less ruinous, since the other still-cooperating AIs will discover and report the defection. As a real-world example, the idea that a large number of untrusted agents can be combined to form a single trustworthy agent is a the heart of blockchain technology.
Finally, we should expect to be able to inspect the outputs generated by a human-level AI (to determine safety, usefulness, deceptiveness or other properties) in a way that we should not expect of a generic super-intelligence.
One caveat is that building a Bureaucracy of AIs requires the ability and willingness to create and then destroy a large number of human-level AIs. If we believe that intelligence/moral worthiness are uncorrelated, this may be acceptable. But it also may not. Perhaps it is impossible to create intelligence without also enduing it with a "spark of life" that demands protection.
But I think one of the main weaknesses of the idea is that it's dramatically more expensive. You have to get one human-level intelligence, and then copy it dozens of times, plus lots of research and fine-tuning to make the various copies perform their assigned roles correctly. Meanwhile some other lab just takes their one human-level model and has it do labor at scale, and yet another lab keeps advancing capabilities and makes a superhuman AI that doesn't have the properties you want.
Yes. I 100% bite this bullet.
If you know some way to build an aligned super-intelligent AI, please do that instead. If you are planning to deploy a non-aligned super-intelligent AI, please do not do that.
It would surprise me if a bureaucracy of AIs could do "table-flipping", specifically because getting from human level AI to functional AI bureaucracy would probably take a while, and during that while you'd probably lose first mover advantage and would meet stiff opposition if you tried to unilaterally stop AI progress worldwide.
I do not claim to solve the global coordination problem. If you have a human-level AGI and a 6-month head start, your options are:
- Do something dangerous to build an aligned super-intelligent AGI
- Demonstrate that your AI is capable of dangerous activity in hopes of scaring every government on earth into passing legislation restricting AGI
I can imagine scenarios in which BoAI is useful for all three of these strategies. But which one you should do and whether BoAI is the best approach will depend on the precise circumstances in which you find yourself.
You describe making an Oracle AI, but I want an AI that does X!
You seem to be writing a guide quite specifically to the problem of "make an AI that gives advice/answers questions." That is indeed an important problem. But I think it's far from the only (or even maybe central?) use case for AGI, especially human level AGI. I don't think it's actually that useful to get human level intelligence to simply answer questions; we have humans for that! Much more useful to have human-level intelligence running factories, manning help lines, flagging comments for spam or abuse, etc., all without getting bored and at a scale you can't realistically achieve with human employees. It's fine to be write a post specifically figuring out how to make an oracle AGI, but I'd try to be clearer that that's your target, and not AGI more generally
I described an Oracle AI because it feels like a interesting point on the safety-capabilities tradeoff space. Obviously an AI that only plays Go would be safer and an AI that can drive tractors would be more useful.
If you can make a safe oracle AI, you can probably also make a safe tractor driving AI. But your list of "safety requirements" is going to be longer. For example, you want to make sure the AI doesn't hack the tractor's control interface, and you probably want the tractor geo-fenced. Careful consideration of what these safety requirements should be is something you can use an oracle AI to help figure out.
A Bureaucracy of AIs is a promising area of research for producing a Super-Intelligent AI from a collection of Weakly Aligned Human-Level AIs.
While not a generic Super-Intelligence, such a Bureaucracy would likely be useful for a number of problems of general human interest, including some problems that could be used for a Table-flipping strategy.
Because it posits using human-level AIs to achieve medium to long-term goals, Bureaucracy of AIs is probably not relevant in a world with rapid takeoff or Foom.
If designs for Bureaucracies are found that are especially stable/trustworthy, it may be possible to use those same designs, but composed of AIs that are "modestly" super-human. In fact, design of a weakly aligned modestly super-human AI is probably one of the tasks amenable to being solved by a Bureaucracy. In this way, we may imagine iterating our way to a full-solution of the Alignment Problem.
this immediately gets very difficult if the AI has so much as a camera, which embodied AIs almost certainly will (and there will probably be good reason to have embodied AIs)
Embodied AI will probably still run out of a server somewhere and control the robot wirelessly (unless latency is hugely important). It should not be allowed to know where that server is.
may be notable that as far as I'm aware, you can't really do this to a trained model - you'd generally need to start training over with fewer neurons or whatever, which could be really expensive, especially if it took several fine-tunings to get to the desired level of intelligence
this seems vague to me. do you mean training time/training data? a trained model (at least in current architectures) just kind of takes inputs and produces outputs in a way that, to my understanding, requires a fixed amount of compute given the input/output space
Development of proper AI throttling methods left as an exercise for the reader. The method I had in mind was assuming that the scaling-hypothesis was true and simply choosing a model size/amount of pre-training that allows the AI to get a 1590/1600 on the SAT.
of relevance how so? do you mean "tasks we are likely to want the AI to be around human level for"? or do you mean "tasks such that if the AI is much better than humans, you make it weaker" (which functionally would mean starting over, probably)?
Both. The AI should be at least as good as a moderately intelligent human at these tasks, but not super-human at all of them. This list is open for amendment. When people talk about AI unboxing, this usually involves verbal reasoning (ability to convince the person outside the box), theoretical physics (ability to do something like design a DNA version of itself), and game theory (ability to realize it's in a box and want to get out). So I would be especially worried if the AGI was dramatically super-human at all of these tasks.
would probably have to be so vanishingly small as to be basically useless, I predict
One neat feature of BoAI is that the AI does not know a-priori whether it is in a simulation or working on a real problem. During the training phase (simulation) we can inspect a large percent of messages, and the lower this during the use phase.
do we have a strong reason to believe human researchers would eventually get there, really? I don't think so. Weak reason but not strong, I think
I think this problem is especially suitable to BoAI. My understanding of brain emulation is that you have to:
- create a high-resolution scan of a brain,
- tediously trace the path of every neuron,
- upload those neurons
Another where it feels like the case we'd get there with enough resources/time with mere human intellect feels less than strong to me
Yes. This one is stretching a little. But I don't personally believe there are any fundamental barriers to stopping aging, just a million small problems that gradually get worse over time.
Hmm, this seems a little vague, and like under some interpretations it's false. I think DALL-E and GPT-3 are sometimes able to figure out how to do stuff where it didn't receive specific training examples. Not super effectively, but like, you can feed GPT chess moves and get chess moves back (sometimes) and it hasn't been trained with specific chess examples per se, right?
With which I agree. Perfect information games are one domain were we can do out-of domain learning. AI art is sort-of one of these domains.
Solving the out-of-domain problem more generally is a huge and important problem in the development of safe AGI, and if you can do that, please do so instead of messing around with BoAI.
Step 2. is embarrassingly parallel, so BoAI should be perfect for it.