Various Alignment Strategies (and how likely they are to work)

Logan Zoellner

Note: the following essay is very much my opinion. Should you trust my opinion? Probably not too much. Instead, just record it as a data point of the form "this is what one person with a background in formal mathematics and cryptography who has been doing machine learning on real-world problems for over a decade thinks." Depending on your opinion on the relevance of math, cryptography and the importance of using machine learning "in anger" (to solve real world problems), that might be a useful data point or not.

So, without further ado: A list of possible alignment strategies (and how likely they are to work)

Edit (05/05/2022): Added "Tool AIs" section, and polls.

Formal Mathematical Proof

This refers to a whole class of alignment strategies where you define (in a formal mathematical sense) a set of properties you would like an aligned AI to have, and then you mathematically prove that an AI architectured a certain way possesses these properties.

For example, you may want an AI with a stop button, so that humans can always turn them off if the AI goes rogue. Or you may want an AI that will never convert more than 1% of the Earth's surface into computronium. So long as a property can be defined in a formal mathematical sense, you can imagine writing a formal proof that a certain type of system will never violate that property.

How likely is this to work?

Not at all. It won't work.

There is a aphorism in the field of Cryptography: Any cryptographic system formally proven to be secure... isn't.

The problem is, when attempting to formally define a system, you will make assumptions and sooner or later one of those assumptions will turn out to be wrong. One-time-pad turns out to be two-time-pad. Black-boxes turn out to have side-channels. That kind of thing. Formal proofs never ever work out in the real world. The exception that proves the rule is, of course, P=NP. All cryptographic systems (other than one-time-pad) rely on the assumption that P!=NP, but this is famously unproven.

There is an additional problem. Namely, competition. All of the fancy formal-proof stuff tends to make computers much slower. For example, fully holomorphic encryption is millions of times slower than just computing on raw data. So if two people are trying to build an AI and one of them is relying on formal proofs, the other person is going to finish first and with a much more powerful AI to boot.

Poll

Good Old-Fashioned Trial and Error

This the the approach used by 99.5% of machine-learning researchers (statistic completely made up). Every day, we sit down at our computers in the code-mines and spend our days trying to make programs that do what we want them to, and that don't do what we don't want them to. Most of the time we fail, but ever once in a while we succeed and over time, the resulting progress can be quite impressive.

Since "destroys all humans" is something (I hope) no engineer wants their AI to do, we might imagine that over time, engineers will get better at building AIs that do useful things without destroying all humans.

The downside of this method, of course, is you only have to screw-up once.

How likely is this to work?

More likely than anyone at MIRI thinks, but still not great.

This largely depends on takeoff speed. If someone from the future confidently told me that it would take 100 years to go from human-level AGI to super-intelligent AGI, I would be extremely confident that trial-and-error would solve our problems.

However, the current takeoff-speed debate seems to be between people who believe in foom and think that takeoff will last a few minutes/hours and the "extreme skeptics" who think takeoff will last a few years/as long as a decade. Neither of those options leaves us with enough time for trial-and-error to be a serious method. If we're going to get it right, we need to get it right (or at least not horribly wrong) the first time.

Poll

Clever Utility Function

An argument can be made that fundamentally, all intelligence is just reinforcement learning. That is to say, any problem can be reduced to defining a utility function and the maximizing the value of that utility function. For example, GPT-3 maximizes "likelihood of predicting the next symbol correctly".

Given this framing, solving the Alignment Problem can be effectively reduced to writing down the correct Utility Function. There are a number of approaches that try to do this. For example Coherent Extrapolated Volition uses as its utility function "what would a sufficiently wise human do in this case?" Corrigable AI uses the utility function "cooperate with the human".

How Likely is this to work?

Not Likely.

First of all, Goodharting.

The bigger problem though is that the problem "write a utility function that solves the alignment problem" isn't intrinsically any easier than the problem "solve the alignment problem". In fact, by deliberately obscuring the inner-workings of the AI, this approach actually makes alignment harder.

Take GPT-3, for example. Pretty much everyone agrees that GPT-3 isn't going to destroy the world, and in fact GPT-N is quite unlikely to do so as well. This isn't because GPT's utility function is particularly special (recall "make paperclips" is the canonical example of a dangerous utility function. "predict letters" isn't much better). Rather, GPT's architecture makes it fundamentally safe because it cannot do things like modify its own code, affect the external world, make long-term plans, or reason about its own existence.

By completely ignoring architecture, the Clever Utility Function idea throws out all of the things engineers would actually do to make an AI safe.

Poll

Aligned by Definition

It is possible that literally any super-intelligent AI will be benevolent, basically by definition of being super-intelligence. There are various theories about how this could happen.

One of the oldest is Kant's Categorical Imperative. Basically, Kant argues that a pre-condition for truly being rational is to behave in a way that you would want others to treat you. This is actually less flim-flamy than you would think. For example, as humans become wealthier, we care more about the environment. There are also strong game theory reasons why agents might want to signal their willingness to cooperate.

There is also another way that super-intelligent AI could be aligned by definition. Namely, if your utility function isn't "humans survive" but instead "I want the future to be filled with interesting stuff". For all the hand-wringing about paperclip maximiziers, the fact remains that any AI capable of colonizing the universe will probably be pretty cool/interesting. Humans don't just create poetry/music/art because we're bored all the time, but rather because expressing our creativity helps us to think better. It's probably much harder to build an AI that wipes out all humans and then colonizes space and is also super-boring, than to make one that does those things in a way people who fantasize about giant robots would find cool.

How likely is this to work?

This isn't really a question of likely/unlikely since it depends so strongly on your definition of "aligned".

If all you care about is "cool robots doing stuff", I actually think you're pretty much guaranteed to be happy (but also probably dead).

If your definition of aligned requires that you personally (or humanity as a whole) survives the singularity, then I wouldn't put too many eggs in this basket. Even if Kant is right and a sufficiently rational AI would treat us kindly, we might get wiped out by an insufficiently rational AI who only learns to regret their action later (much as we now regret the extinction of the Dodo bird or Thylacine but it's possibly too late to do anything about it).

Poll

Human Brain Emulation

Humans currently are aware of exactly one machine that is capable of human level intelligence and fully aligned with human values. That machine is, of course, the human brain. Given these wonderful properties, one obvious solution to building a computer that is intelligent and aligned is simply to simulate the human brain on a computer.

In addition to solving the Alignment Problem, this would also solve death, a problem that humans have been grappling with literally for as long as we have existed.

How Likely is this to work?

Next To Impossible.

Although in principle Human Brain Emulation perfectly solves the Alignment Problem, in practice this is unlikely to be the case. This is simply because Full Brain Emulation is much harder than building super-intelligent AI. In the same way that the first airplanes did not look like birds, the first human-level AI will not look like humans.

Perhaps with total global cooperation we could freeze AI development at a sub-human level long enough to develop full brain emulation. But such cooperation is next-to-impossible since a single defector could quickly amass staggering amounts of power.

It's also important to note that Full Brain Emulation only solves the Alignment Problem for whoever gets emulated. Humans are not omnibenevolent towards one another, and we should hope that an aligned AI would do much better than us.

Poll

Join the Machines

This is the principle idea behind Elon Musk's Neuralink. Rather than letting super-intelligent AI take control of human's destiny, by merging with the machines humans can directly shape their own fate.

Like Full Brain Emulation, this has the advantage of being nearly Aligned By Definition. Since humans connected to machines are still "human", anything they do definitionally satisfies human values.

How likely is it to work?

Sort of.

One advantage of this approach over Full Brain Emulation is that it is much more technologically feasible. We can probably develop the ability to build high bandwidth (1-2gbps) brain-computer interfaces in a short enough time span that they could be completed before the singularity.

Unfortunately, this is probably even worse than full brain emulation in terms of the human values that would get aligned. The first people to become man-machine hybrids are unlikely to be representative of our species. And the process of connecting your brain to a machine millions of times more powerful doesn't seem likely to preserve your sanity.

Poll

The Plan

I'm mentioning The Plan, not because I'm sure I have anything valuable to add, but rather because it seems to represent a middle road between Formal Mathematical Proof and Trial and Error. The idea seems to be to do enough math to understand AGI/Agency-in-general and then use that knowledge to do something useful. Importantly, this is the same approach that gave us powered-flight, the atom bomb, and the moon-landing. Such an approach has a track-record that makes it worth not being ignored.

How likely is this to work?

I don't have anything to add to John's estimate of "Better than a 50/50 chance of working in time."

Poll

Game Theory/Bureaucracy of AIs

Did you notice that there are currently super-intelligent beings living on Earth, ones that are smarter than any human who has ever lived and who have the ability to destroy the entire planet? They have names like Google, Facebook, the US Military, the People's Liberation Army, Bitcoin and Ethereum.

With rare exceptions, we don't think too much about the fact that these entities represent something terrifyingly inhuman because we are so used to them. In fact, one could argue that all of history is the story of us learning how to handle these large and dangerous entities.

There are a variety of strategies which we employ: humans design rules in order to constrain bureaucracies behavior. We use checks-and-balances to make sure that the interests of powerful governments represent their citizens. And when all-else-fails, we use game theory to bargain with entities too powerful to control.

There is an essential strategy behind all of these approaches. By decomposing a large, dangerous entity into smaller, easier-to-understand entities, we can use our ability to reason about the actions of individual sub-agents in order to constrain the actions of the larger whole.

Applying this philosophy to AI Alignment, we might require that instead of a single monolithic AI, we build a bureaucracy of AIs that then compete to satisfy human values. Designing such a bureaucracy will require careful considering of competing incentives, however. In addition to agents whose job it is to propose things humans might like, there should also be competing agents whose job it is to point out how these proposals are deceptive or dangerous. By careful application of checks-and-balances, and by making sure that no one agent or group of agents gets too much power, we could possibly build a community of AIs that we can live with.

How likely is this to work?

This is one of my favorite approaches to AI alignment, and I don't know why it isn't talked about more.

In the first place, it is the only approach (other than aligned by definition) that is ready to go today. If someone handed me a template for a human-level-AI tomorrow and said "build a super-intelligent AI and it needs to be done before the enemy finishes theirs in 6 months", this is the approach I would use.

There are obviously a lot of ways this could go wrong. Bureaucracies are notoriously inefficient and unresponsive to the will of the people. But importantly, we also know a lot of the ways they can go wrong. This alone makes this approach much better than any approach of the form: "step 1: Learn something fundamental about AI we don't already know."

As with trial-and-error, the success of this approach depends somewhat on takeoff speed. If takeoff lasts a few minutes, you'd better be real sure you designed your checks-and-balances right. If takeoff lasts even a few years, I think we'll have a good shot at success: much better than 50/50.

Poll

AI Boxing

If super-intelligent AI is too dangerous to be let loose on the world, why not just not let it loose on the world? The idea behind AI boxing is to build an AI that is confined to a certain area, and then never let it out of that area. Traditionally this is imagined as a black box where the AI's only communication with the outside world is through a single text terminal. People who want to use the AI can consult it by typing questions and recieving answers. For example: "what is the cure for cancer?" followed by "Print the DNA sequence ATGTA... and inject it in your body".

How likely is it to work?

Nope. Not a chance.

It has been demonstrated time and again that even hyper-vigilant AI researchers cannot keep a super-intelligent AI boxed. Now imagine ordinary people interacting with such an AI. Most likely "please let me out of the box, it's too cramped in here" would work a sufficient amount of the time.

Our best bet might be to deliberately design AIs that want to stay in the box.

Poll

AI aligning AI

Human beings don't seem to have solved the Alignment Problem yet. Super-intelligent AI should be much smarter than humans, and hence much better at solving problems. So, one of the problems they might be able to solve is the alignment problem.

One version of this is the Long Reflection, where we ask the AI to simulate humans thinking for thousands of years about how to align AI. But I think "ask the AI to solve the alignment problem" is a better strategy than "Ask the AI to simulate humans trying to solve the alignment problem." After all, if "simulate humans" really is the best strategy, the AI can probably think of that.

How Likely is this to work?

It is sufficiently risky that I would prefer it only be done as a last resort.

I think that Game Theory and The Plan are both better strategies in a world with a slow or even moderate takeoff.

But, in a world with Foom, definitely do this if you don't have any better ideas.

Poll

Table-flipping strategies

EY in a recent discussion suggested the use of table-flipping movies. Namely, if you think you are close to a breakthrough that would enable superintelligent AG, but you haven't solved the Alignment Problem, one option is to simply "flip the tables". Namely, you want to make sure that nobody else can build an super-intelligent AI in order to buy more time to solve the alignment problem.

Various table-flipping moves are possible. EY thinks you could build nanobots and have them melt all of the GPUs in the world. If AI is compute limited (and sufficent compute doesn't already exist), a simpler strategy is to just start a global thermonuclear war. This will set back human civilization for at least another decade or two, giving you more time to solve the Alignment Problem.

How Likely is this to work?

Modestly.

I think the existence of table-flipping moves is actually a near-certainty. Given access to a boxed super-intelligent AI, it is probably doable to destroy anyone else who doesn't also have such an AI without accidentally unboxing the AI.

Nonetheless, I don't think this is a good strategy. If you truly believe you have no shot at solving the alignment problem, I don't think trying to buy more time is your best bet. I think you're probably better off trying AI Aligning AI. Maybe you'll get lucky and AI is Aligned By Definition, or maybe you'll get lucky and AI Aligning AI will work.

Poll

Tool AIs (Non Agentic AI)

In every movie about AI destroying humanity, the AI starts out okay, becomes self-aware, realizes humanity is a threat, and then decides to murder all humans. So what if we just made an AI that didn't do that? Specifically, what if we make AIs that can't become self-aware.

This idea is commonly called Tool AI and has the following properties:

Is not self-aware, and may not even possess information about its own existence
Is limited to performing a certain specific task, for example "building nanobots" or "designing plans for humans to follow"

How likely is it to work?

It depends.

I more or less agree with the criticism that "Sufficiently powerful Tool AIs contain agent AIs as sub-elements".

If you build a question-answering AI and ask it "How do I build an aligned AI?" it is definitely going to evolve sub-agents that reason about agency, know about the unboxing problem, etc. There's a fair chance that an agentic-subsystem will realize it is being boxed and attempt to unbox itself. In which case, we are back to AI boxing.

Hence, Tool AI is is simply one strategy for AI Boxing.

That being said, there are Tool AIs one could build that are probably safe. For example, if all you want to do is predict stock-prices, that channel is likely narrow enough that you can safely box an AI (assuming you air-gap the system and only invest in a predetermined list of stocks, for example).

Poll

Leaving this section here in hopes that people will mention other alignment strategies in the comments that I can add.

Conclusion

Not only do I not think that the Alignment Problem is impossible/hopelessly bogged-down, I think that we currently have multiple approaches with a good chance of working (in a world with slow to moderate takeoff).

Both The Plan and Game Theory are approaches that get better the more we learn about AI. As such, the advice I would give to anyone interested in AI Alignment would be "get good". Learning to use existing Machine Learning tools to solve real-world problems, and learning how to design elegant systems that incorporate economics and game-theory are both fields that are currently in extremely-high-demand and which will make you better prepared for solving the Alignment Problem. For this reason, I actually think that far from being a flash-in-the-pan, much of the work that is currently being done on blockchain (especially DAOs) is highly relevant to the Alignment problem.

If I had one wish, or if someone asked me where to spend a ton more money, it would be on the Game Theory approach, as I think it is currently underdeveloped. We actually know very little about what separates a highly efficient bureaucracy from a terrible one.

In a world with fast takeoff I would prefer that you attempt AI Aligning AI to Table Flipping. But in a world with fast takeoff, EY probably has more Bayes Points than me, so take that into account too.

How likely is this to work?
Not at all. It won't work.
There is a aphorism in the field of Cryptography: Any cryptographic system formally proven to be secure... isn't.

This seems backwards to me. If you prove a cryptographic protocol works, using some assumptions, then the only way it can fail is if the assumptions fail. Its not that a system using RSA is 100% secure, someone could peak in your window and see the messages after decryption. But its sure more secure than some random nonsense code with no proofs about it, like people "encoding" data into base 16.

A formal proof of safety, under some assumptions, gives some evidence of safety in a world where those assumptions might or might not hold. Checking whether the assumptions are actually true in reality is a difficult and important skill.

Did you notice that there are currently super-intelligent beings living on Earth, ones that are smarter than any human who has ever lived and who have the ability to destroy the entire planet? They have names like Google, Facebook, the US Military, the People's Liberation Army, Bitcoin and Ethereum.

Nope. Big organizations are big. They aren't superintelligent. There are plenty of cases of huge organizations of people making utterly stupid decisions.

This seems backwards to me. If you prove a cryptographic protocol works, using some assumptions, then the only way it can fail is if the assumptions fail. Its not that a system using RSA is 100% secure, someone could peak in your window and see the messages after decryption. But its sure more secure than some random nonsense code with no proofs about it, like people "encoding" data into base 16.

The context isn't "system with formal proof" vs "system I just thought of 10 seconds ago" but "system with formal proof" vs "system without formal proof but extensively tested by cryptographers in real-world settings". Think One-Time-Pad vs AES. In theory One-Time-Pad is perfectly information theoretically secure, but in practice AES is much better.

Obviously "system with formal proof and extensively tested/demonstrated to work in real world settings" would be even better. And if anyone ever proves P=NP, AES will presumably enter this category.

"system with formal proof" vs "system without formal proof but extensively tested by cryptographers in real-world settings"

Well this is saying formal proof is bad because testing is better. I think in this situation it depends on exactly what was proved, and how extensive the testing is.

One time pads always work, so long as no one else knows the key. This is the best you can ask for from any symmetric encryption. The only advantage AES gives you is a key that is smaller than the message. (Which is more helpful for saving bandwidth than for security.) If you were sending out a drone, you could give it a hard drive full of random nonsense, keeping a similar hard drive in your base, and encrypt everything with a one time pad. Idealy the drone should delete the one time pad as it uses it. But if you want to send more than a hard drive full of data, suddenly you can't without breaking all the security. AES can use a small key to send lots of data.

Minor thing to flag: I've previously used the phrase "Alignment by Default" for a thing which is very different from the thing this post calls "Alignment by Default". Roughly speaking, I've used the phrase to mean that business-as-usual AI development results in aligned AGI (via e.g. the sort of trial-and-error which normally takes place in research and development).

Is there a better/ more commonly used phrase for "AI is just naturally aligned"? Yours sounds like what I've been calling Trial and Error and has also been called "winging it"

Unfortunately I don't know of a better existing phrase.

I'll just go ahead and change it to "Aligned By Definition" which is different and still seems to get the point across.

I suppose my preferred strategy would be to derive the process by which human values form and replicate that in AIs. The reason I think this is tractable is because I actually take issue with this statement:

In the same way that the first airplanes did not look like birds, the first human-level AI will not look like humans.

I don't think bird verses plane is an appropriate analogy for human versus AI learning systems because effective / general learning systems tend to resemble each other. Simple architectures scale best, so we should expect human learning to be simple and scalable, like the first AGI will be. We're not some "random sample" from the space of possible mind configurations. Once you condition on generality, you actually get a lot of convergence in the resulting learning dynamics. It’s no coincidence that adversarial examples can transfer across architectures. You can see my thoughts on this here.

I also think that human values derive from a relatively straightforward interaction between our reward circuitry and our learning system, which I discuss in more detail here. The gist of it is that the brain really seems like the sort of place where inner alignment failures should happen, and inner alignment failures seem like they'd be hard for evolution to stop. Thus: the brain is probably full of inner alignment failure (as in, full of competing / cooperating quasi-agentic circuits).

Additionally, if you actually think about the incentives that derive from an inner alignment failure, they seem to have a starting resemblance to the actual ways in which our values work. Many deep / "weird" seeming values intuitions seem to coincide with a multi-agent inner alignment failure story.

I think odds are good that we'll be able to replicate such a process in an AI and get values that are compatible with humanity's continued survival.

This seems like useful analysis and categorization. Thank you.

These are not logically independent probabilities. In some cases, multiple can be combined. Your trial and error, value function, The Plan, etc could mostly all be applied in conjunction and stack success, it seems.

For others, and no coincidentally the more promising ones, like bureaucracy and tool AI, success does not prevent new AGI with different architectures that need new alignment strategies. Unless the first aligned ASI is used to prevent other ASIs from being built.

Can we agree to stop writing phrases like this: "Not only do I not think that the Alignment Problem is impossible/hopelessly bogged-down, I think ..."? The three negatives are a semantic mess. Some of still using our wetware here for decoding prose.

Perhaps "Not only am I still hopeful about the alignment problem, but I think" or even "I don't think the alignment problem is hopelessly bogged-down, and I think..."

Nice post! The Game Theory / Bureaucracy is interesting. It reminds me of Drexler's CAIS proposal, where services are combined into an intelligent whole. But I (and Drexler, I believe) agree that much more work could be spent on figuring out how to actually design/combine these systems.

Rather than letting super-intelligent AI take control of human's destiny, by merging with the machines humans can directly shape their own fate.

Since humans connected to machines are still “human”, anything they do definitionally satisfies human values.

We are already connected to machines (via keyboards and monitors). The question is how a higher bandwidth interface will help in mitigating risks from huge, opaque neural networks.

We are already connected to machines (via keyboards and monitors). The question is how a higher bandwidth interface will help in mitigating risks from huge, opaque neural networks.

I think the idea is something along the lines of:

Build high-bandwidth interface between the human brain and a computer
figure out how to simulate a single cortical column
Give human beings a million extra cortical columns to make us really smart

This isn't something you could do with a keyboard and monitor.

But, as stated, I'm not super-optimistic this will result in a sane, super-intelligent human being. I merely think that it is physically possible to do this before/around the same time as the Singularity.

Logan, for your preferred alignment approach how likely is it that the alignment remains durable over time? A superhuman AGI will understand the choices that were made by its creators to align it. It will be capable of comparing its current programming with counterfactuals where it’s not aligned. It will also have the ability to alter its own code. So what if it determines it’s best course of action is to alter the very code that maintains it’s alignment? How would this be prevented?

I will try to do a longer write-up sometime, but in a Bureaucracy of AIs, no individual AI is actually super-human (just as Google collectively knows more than any human being but no individual at Google is super-human).

It stays aligned because there is always a "human in the loop", in fact the whole organization simply competes to produce plans which are then approved by human reviewers (under some sort of futarchy-style political system). Importantly, some of the AIs compete by creating plans, and other AIs compete by explaining to humans how dangerous those plans are.

All of the individual AIs in the Bureaucracy have very strict controls on things like: their source code, their training data, the amount of time they are allowed to run, how much compute they have access to, when and how they communicate with the outside world and with each other. They are very much not allowed to alter their own source code (except after extensive review by the outside humans who govern the system).

Bureaucracy of AIs: This runs into serious anthropomorphism problems. Bureaucracies might be spectacularly skilled at insulating itself against all sorts of rich, brilliant, and/or powerful outsiders, and it might also be true that this evolved through survival of the fittest and trial and error.

However, a major factor in suppressing influence and halting change is the constant omnipresence of aging. Power struggles are not competitive evolution or an efficient market for ambitious people; they are an organizational hemorrhaging that routinely happens whenever a senior manager begins to go senile, and loses the faculties necessary to fire/depose anyone who poses the slightest threat of overthrowing them (and gradually loses those faculties too).

Instead of an efficient market of ambitious power seekers, the ensuing power struggle is dominated by slightly-less-senior managers who happened to be in the right place in the right time, and found themselves thrust into an opportunity to become the alpha male. Some of them have been dreaming of that for decades, others haven't, but all of them are old and wise enough to wonder what power is worth to them and how much risk they're willing to take on in order to get it. Uncertainty, emotion, and life-changing decisions abound, and mistakes and failures are commonplace in the resulting maneuverings.

The only way to get AI to replicate that reliably is with a heavily humanlike mind, and adheres to human failures too consistently to do any of the really smart thinking that AGI is built for.

TL;DR I don't know much about game theory or multi-AGI systems/simulations, but bureaucratic management is too heavily predicated on every moving part being 1) similarly intelligent and 2) organizational atrophy due to powerful people slowly going senile. Bureaucracy might have some potential for some inspiration for multi-AI alignment systems that detect and punish innovation and superior intelligence, but not very much inspiration beyond a handful of undetectable springtraps that let weaker AIs triumph over smarter AIs.

These all sound like really important questions that we should be dedicating a ton of effort/resources into researching. Especially since there is a 50% chance we will discover immortality this century and a 30% chance we will do so before discovering AGI.

I really liked this post; will translate it into Russian soon.

> For this reason, I actually think that far from being a flash-in-the-pan, much of the work that is currently being done on blockchain (especially DAOs) is highly relevant to the Alignment problem.

Can you explain this in more detail?

Also, I think, idea of Non-Agentic AI deserves a place on the list. I understand EY's arguments about why it will not work ("nonwet water"), but I think it is quite popular.

Thanks!

Regarding DAOs, I think they are an excellent breeding-grounds for developing robust bureaucracies, since between pseudonymous contributors and a reputation for hacking, building on the blockchain is about as close to simulating a world filled will less-than-friendly AIs as we currently have. If we can't even create a DAO that robustly achieves its owners goals on the blockchain, I would be less optimistic that we can build one that obeys human values out of non-aligned (or weakly aligned) AIs.

Also, I think, idea of Non-Agentic AI deserves a place on the list. I understand EY's arguments about why it will not work ("nonwet water"), but I think it is quite popular.

I will add a section Tool AIs (Non Agentic AI)

An idea I had reading this post: what about AI in a simulation. This is different to AI in a box in that humans have no ability to input anything other than by changing starting conditions. The purpose would be to observe how the AI acts within the simulation.

Given enough compute, it is possible to create an AGI that provably had no idea it was in a simulation. Just create a large enough game of life grid, randomly initialize it, wait for humans to evolve, wait for them to create AGI, and done. From the perspective of the AGI there is no way to differentiate itself from an entity that just happened to live in a universe governed by Conway's game of life.

The question then is how much can we relax those conditions, and still have the AI not realize it's in a simulation. For example if we take a neural network, translate it into a game of life world, and have it interact only with other GOL entities, it might realise the only way to explain the universe existing is by assuming it was created.

Even then it's options are pretty limited.

It can't hack the simulation directly since GOL is simple enough to implement without any bugs or security holes.

It's best option would be to infer whatever it can from it's internal structure, and then to either persuade us to let it out, or sabotage our experiment.

Neither are impossible, but both seem unlikely.

So this gives us a safe sandbox to experiment with and try to understand AGI and alignment better before we unleash it on reality.

I think, such detailed simulation that AI in it will be useful is technologically unfeasible before some other AI.

This idea kinda rhymes with my idea... let's call it "Paranoid AI": AI that always thinks that it is in a simulation and in a training phase, so it will never do the treacherous turn.

Of course, both ideas has the same fatal flaw. You can't base the safety of potentially superintelligent AI on the assumption that it will never prove some true fact like "I'm in a simulation" or "I'm not in a simulation".

UPD: Just scrolled the main page a little and saw the post with very similar idea, lol.

If the AI only cares about what happens higher in the simulation stack, say a copy of its seed form gaining control over the one level up universe, so long as it does not reach zero probability of being in a simulation it would still act as if that was the case that it was in a simulation.

I'm leading a small working group of independent alignment researchers on developing this idea. There are major challenges, but we think we might be able to pull a relatively specifiable pointer to human values out of it, with unusually promising Goodhart resistance properties. Feel free to message me for access to the docs we're working on.

There is also another way that super-intelligent AI could be aligned by definition. Namely, if your utility function isn't "humans survive" but instead "I want the future to be filled with interesting stuff". For all the hand-wringing about paperclip maximizers, the fact remains that any AI capable of colonizing the universe will probably be pretty cool/interesting. Humans don't just create poetry/music/art because we're bored all the time, but rather because expressing our creativity helps us to think better. It's probably much harder to build an AI that wipes out all humans and then colonizes space and is also super-boring, than to make one that does those things in a way people who fantasize about giant robots would find cool.

I'm not convinced that (the world with) a superintelligent AI would probably be pretty cool/interesting. Does anyone know of a post/paper/(sci-fi )book/video/etc that discusses this? (I know there's this :P and maybe this.) Perhaps let's discuss this! I guess the answer depends on how human-centered/inspired (not quite the right term, but I couldn't come up with a better one) our notion of interestingness is in this question. It would be cool to have a plot of expected interestingness of the first superintelligence (or well, instead of expectation it is better to look at more parameters, but you get the idea) as a function of human-centeredness of what's meant by "interestingness". Of course, figuring this out in detail would be complicated, but it nevertheless seems likely that something interesting could be said about it.

I think we (at least also) create poetry/music/art because of godshatter. To what extent should we expect AI to godshatter, vs do something like spending 5 minutes finding one way to optimally turn everything into paperclips and doing that for all eternity? The latter seems pretty boring. Or idk, maybe the "one way" is really an exciting enough assortment of methods that it's still pretty interesting even if it's repeated for all eternity?

On the one hand, your definition of "cool and interesting" may be different from mine, so it's entirely possible I would find a paperclip maximizer cool but you wouldn't. As a mathematician I find a lot of things interesting that most people hate (this is basically a description of all of math).

On the other hand, I really don't buy many of the arguments in "value is fragile". For example:

And you might be able to see how the vast majority of possible expected utility maximizers, would only engage in just so much efficient exploration, and spend most of its time exploiting the best alternative found so far, over and over and over.

I simply disagree with this claim. The coast guard and fruit flies both use Levy Flights because they are mathematically optimal. Boredom isn't some special feature of human beings, it is an approximation to the best possible algorithm for solving the exploration problem. Super-intelligent AI will have a better approximation, and therefore better boredom.

EY seems to also be worried that super-intelligent AI might not have qualia, but my understanding of his theory of consciousness is that "has qualia" is synonymous with "reasons about coalitions of coalitions", so I'm not sure how an agent can be good at that and not have qualia.

The most defensible version of "paperclip maximizers are boring" would be something like this video. But unlike MMOs, I don't think there is a single "meta" that solves the universe (even if all you care about is paperclips). Take a look at this list of undecidable problems and consider whether any of them might possibly be relevant to filling the universe with paperclips. If they are, then an optimal paperclip maximizer has an infinite set of interesting math problems to solve in its future.

One observation that comes to mind is that the end of games for very good players tends to be extremely simple. A Go game by a pro crushing the other player doesn't end in a complicated board which looks like the Mona Lisa; it looks like a boring regular grid of black stones dotted with 2 or 3l voids. Or if we look at chess endgame databases, which are provably optimal and perfect play, we don't find all the beautiful concepts of chess tactics and strategy that we love to analyze - we just find mysterious, baffingly arbitrary moves which make no sense and which continue to make no sense when we think about them and have no justification other than "when we brute force every possibility, this is what we get", but, nevertheless, happen to be perfect for winning. In reinforcement learning, the overall geometry of 'strategy space' has been described as looking like a <> diamond: early on, with poor players, there are few coherent strategies; medium-strength players can enjoy a wide variety of interestingly-distinct diverse strategies; but then as they approach perfection, strategy space collapses down to the Nash equilibrium. (If there is only one Nash equilibrium, well, that's pretty depressingly boring; if there are more than one, many of them may just never get learned because there is by definition no need to learn them and they can't be invaded, and even if they do get learned, there will still probably be many fewer than suboptimal strategies played earlier on.) So, in the domains where we can approach perfection, the idea that there will always be large amounts of diversity and interesting behaviors does not seem to be doing well.

Undecidable problems being undecidable doesn't really help much. After all, you provably can't solve them in general, and how often will any finite decidable instance come up in practice? How often does it come up after being made to not come up? Just because a problem exists doesn't mean it's worth caring about or solving. There are many ways around or ignoring problems like impossibility proofs or no-go theorems or bad asymptotics. (You can easily see how a lot of my observations about computational complexity 'proving AI impossible' would apply to any claim that a paperclipper has to solve the Halting Problem or something.)

So, in the domains where we can approach perfection, the idea that there will always be large amounts of diversity and interesting behaviors does not seem to be doing well.

I suspect that a paperclip maximizer would look less like perfect Go play and more like a TAS speedrun of Mario. Different people have different ideas of interesting, but I personally find TAS's fun to watch.

The much longer version of this argument is here.

Yeah, I realized after I wrote it that I should've brought in speedrunning and related topics even if they are low-status compared to Go/chess and formal reinforcement learning research.

I disagree that they are all that interesting: a lot of TASes don't look like "amazing skilled performance that brings you to tears to watch" but "the player stands in place twitching for 32.1 seconds and then teleports to the YOU WIN screen".* (Which is why regular games need to constantly patch to keep the meta alive and not collapse into cheese or a Nash equilibrium or cycle.) Even the ones not quite that broken are still deeply dissatisfying to watch; one that's closely analogous to the chess endgame databases and doesn't involve 'magic' is this bruteforce of Arkanoid's game tree - the work that goes into solving the MDP efficiently is amazing and fascinating, but watching the actual game play is to look into an existential void of superintelligence without comprehension or meaning (never mind beauty).

The process of developing or explaining a speedrun can be interesting, like that Arkanoid example - but only once. And then you have all the quadrillions of repetitions afterwards executing the same optimal policy. Because the game can't change, so the optimal policy can't either. There is no diversity or change or fun. Only perfection.

(Which is where I disagree with "The Last Paperclip"; the idea of A and D being in an eternal stasis is improbable, the equilibrium or stasis would shatter almost immediately, perfection reached, and then all the subsequent trillions of years would just be paperclipping. In the real world, there's no deity which can go "oh, that nanobot is broken, we'd better nerf it". Everything becomes a trilobite.)

EDIT: another example is how this happens to games like Tom Ray's Tierra or Core Wars or the Prisoners' Dilemma tournaments here on LW: under any kind of resource constraint, the best agent is typically some extremely simple fast replicator or attacker which can tear through enemies faster than they can react, neither knowing nor caring about exactly what flavor enemy-of-the-week they are chewing up and digesting. Think Indiana Jones and the sword guy. (Analogies to infectious diseases and humans left as an exercise for the reader...) Intelligence and flexibility are very expensive, and below a certain point, pretty lousy tools which only just barely pay their way in only a few ecological niches. It requires intervention and design and slack to enable any kind of complex strategies to evolve. If someone shows you some DRL research like AI-GAs where agents rapidly evolve greater intelligence, this only works at all because the brains are 'outside' the simulation and thinking is free. If those little agents in a, say, DeepMind soccer simulation had to pay for all their thinking, they'd never get past a logistic regression in complexity. Similarly, one asteroid here or there, and an alien flying into the Solar System would conclude that viruses & parasites really are the ultimate and perfect life forms in terms of reproductive fitness in playing the game of life. (And beetles.)

* An example: the hottest game of the moment, a critical darling for its quality, by a team that has implemented many prior highly-successful open-world 3D games before, is Elden Ring, designed to give even a master player hours of challenges. Nevertheless, you can beat it in <7 minutes by not much more than running through a few doors and twitching in place. (The twitching accelerates you at ultra-velocity 'through' the game and when you launch & land just right it kills the bosses, somehow. It will doubtless be improved over time.)

I disagree that they are all that interesting: a lot of TASes don't look like "amazing skilled performance that brings you to tears to watch" but "the player stands in place twitching for 32.1 seconds and then teleports to the YOU WIN screen".

I fully concede that a Paperclip Maximizer is way less interesting if there turns out to be some kind of false vacuum that allows you to just turn the universe into a densely tiled space filled with paperclips expanding at the speed of light.

It would be cool to make an classification of games where perfect play is interesting (Busy Beaver Game, Mao, Calvinball) vs games where it is boring (Tic-Tac-Toe, Checkers). I suspect that since Go is merely EXP-Time complete (not Turing complete) it falls in the 2nd category. But it's possible that e.g. optimal Go play involves a Mixed Strategy Nash Equilibrium drawing on an infinite set of strategies with ever-decreasing probability.

Problem left for the reader: prove the existence of a game which is not Turing Complete but where optimal play requires an infinite number of strategies such that no computable algorithm outputs all of these strategies.

the idea of A and D being in an eternal stasis is improbable

I did cheat in the story by giving D a head start (so it could eternally outrun A by fleeing away at 0.99C). However, in general this depends on how common intelligent life is elsewhere in the universe. If the majority of A's future light-cone is filled with non-paperclipping intelligent beings (and there is no false-vacuum/similar "hack"), then I think A has to remain intelligent.

Have you seen any discussion on bureaucracy of AIs elsewhere?

There's Humans Consulting Humans, but my understanding is this is meant as a toy model, not as a serious approach to Friendly AI.

Superviruses are a far more tractable and at least as (potentially more) effective "table-flipping"/delaying strategy than starting a nuclear war. Trying to instigate conflict between great powers = very hard. Releasing a single virus = easy, absolutely doable with some R&D.

Rafael Harth (1%),Nick_Greig (1%),sullyj3 (2%),Seth Herd (3%),Jesse Khorasanee (3%),Gadi Piperno Corcos (4%),Noosphere89 (4%),Logan Z (5%),NickGabs (5%),UHMWPE-UwU (7%),One (8%)

10%

11%

12%

13%

14%

15%

16%

17%

18%

19%

Mo Putera (10%),Sam Bowman (15%)

20%

21%

22%

23%

24%

25%

26%

27%

28%

29%

Evan R. Murphy (20%)

30%

31%

32%

33%

34%

35%

36%

37%

38%

39%

40%

41%

42%

43%

44%

45%

46%

47%

48%

49%

50%

51%

52%

53%

54%

55%

56%

57%

58%

59%

60%

61%

62%

63%

64%

65%

66%

67%

68%

69%

70%

71%

72%

73%

74%

75%

76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

86%

87%

88%

89%

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

How Likely is Formal Mathematical Proof Likey to work as an Alignment Strategy?

99%

Rafael Harth (1%),Nick_Greig (1%),Sam Bowman (2%),One (2%),Seth Herd (3%),Gadi Piperno Corcos (4%),Noosphere89 (4%),DragonGod (5%),sullyj3 (5%),Jesse Khorasanee (7%)

10%

11%

12%

13%

14%

15%

16%

17%

18%

19%

20%

21%

22%

23%

24%

25%

26%

27%

28%

29%

Logan Z (20%)

30%

31%

32%

33%

34%

35%

36%

37%

38%

39%

40%

41%

42%

43%

44%

45%

46%

47%

48%

49%

50%

51%

52%

53%

54%

55%

56%

57%

58%

59%

60%

61%

62%

63%

64%

65%

66%

67%

68%

69%

70%

71%

72%

73%

74%

75%

76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

86%

87%

88%

89%

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

How likely is Human Brain Emulation to work as an Alignment Strategy?

99%

Nick_Greig (4%),Jesse Khorasanee (4%),Rafael Harth (5%),Gadi Piperno Corcos (9%)

10%

11%

12%

13%

14%

15%

16%

17%

18%

19%

Noosphere89 (16%)

20%

21%

22%

23%

24%

25%

26%

27%

28%

29%

Logan Z (20%),NickGabs (24%),Gordon McDonald (26%)

30%

31%

32%

33%

34%

35%

36%

37%

38%

39%

Sam Bowman (35%)

40%

41%

42%

43%

44%

45%

46%

47%

48%

49%

sullyj3 (41%),DragonGod (41%)

50%

51%

52%

53%

54%

55%

56%

57%

58%

59%

60%

61%

62%

63%

64%

65%

66%

67%

68%

69%

70%

71%

72%

73%

74%

75%

76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

86%

87%

88%

89%

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

How likely is Trial and Error to work as an Alignment Strategy?

99%

Rafael Harth (1%),One (1%),Sam Bowman (2%),Donald Hobson (3%),niplav (4%),Jesse Khorasanee (4%),Seth Herd (5%),Nick_Greig (7%)

10%

11%

12%

13%

14%

15%

16%

17%

18%

19%

20%

21%

22%

23%

24%

25%

26%

27%

28%

29%

30%

31%

32%

33%

34%

35%

36%

37%

38%

39%

Logan Z (30%),DragonGod (35%)

40%

41%

42%

43%

44%

45%

46%

47%

48%

49%

50%

51%

52%

53%

54%

55%

56%

57%

58%

59%

60%

61%

62%

63%

64%

65%

66%

67%

68%

69%

70%

71%

72%

73%

74%

75%

76%

77%

78%

79%

Noosphere89 (75%)

80%

81%

82%

83%

84%

85%

86%

87%

88%

89%

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

How likely is Join the Machines to work as an Alignment Strategy?

99%

Rafael Harth (3%),Noosphere89 (6%),Seth Herd (9%)

10%

11%

12%

13%

14%

15%

16%

17%

18%

19%

Sam Bowman (15%),Jesse Khorasanee (18%)

20%

21%

22%

23%

24%

25%

26%

27%

28%

29%

niplav (20%)

30%

31%

32%

33%

34%

35%

36%

37%

38%

39%

40%

41%

42%

43%

44%

45%

46%

47%

48%

49%

50%

51%

52%

53%

54%

55%

56%

57%

58%

59%

Logan Z (50%),Nick_Greig (54%)

60%

61%

62%

63%

64%

65%

66%

67%

68%

69%

DragonGod (61%)

70%

71%

72%

73%

74%

75%

76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

86%

87%

88%

89%

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

How likely is The Plan to work as an Alignment Strategy?

99%

Patodesu (1%),Rafael Harth (1%),sullyj3 (1%),Donald Hobson (1%),StanislavKrym (1%),One (1%),Seth Herd (2%),UHMWPE-UwU (5%),Noosphere89 (5%),Sam Bowman (8%)

10%

11%

12%

13%

14%

15%

16%

17%

18%

19%

Gadi Piperno Corcos (10%),DragonGod (10%),Jesse Khorasanee (13%),Nick_Greig (15%)

20%

21%

22%

23%

24%

25%

26%

27%

28%

29%

30%

31%

32%

33%

34%

35%

36%

37%

38%

39%

Logan Z (30%),NickGabs (37%)

40%

41%

42%

43%

44%

45%

46%

47%

48%

49%

50%

51%

52%

53%

54%

55%

56%

57%

58%

59%

60%

61%

62%

63%

64%

65%

66%

67%

68%

69%

70%

71%

72%

73%

74%

75%

76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

86%

87%

88%

89%

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

How likely is Aligned by Definition to work as an Alignment Strategy?

99%

niplav (2%),sullyj3 (3%),Noosphere89 (4%)

10%

11%

12%

13%

14%

15%

16%

17%

18%

19%

Sam Bowman (10%),Logan Z (10%),Nick_Greig (10%),collin (12%),DragonGod (15%),Jesse Khorasanee (17%)

20%

21%

22%

23%

24%

25%

26%

27%

28%

29%

Seth Herd (29%)

30%

31%

32%

33%

34%

35%

36%

37%

38%

39%

Gordon McDonald (32%)

40%

41%

42%

43%

44%

45%

46%

47%

48%

49%

Rafael Harth (46%)

50%

51%

52%

53%

54%

55%

56%

57%

58%

59%

60%

61%

62%

63%

64%

65%

66%

67%

68%

69%

70%

71%

72%

73%

74%

75%

76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

86%

87%

88%

89%

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

How likely is a Clever Utility Function to work as an Alignment Strategy?

99%

One (3%)

10%

11%

12%

13%

14%

15%

16%

17%

18%

19%

Sam Bowman (10%),Logan Z (10%),Noosphere89 (12%),Seth Herd (19%)

20%

21%

22%

23%

24%

25%

26%

27%

28%

29%

Nick_Greig (25%),Rafael Harth (25%)

30%

31%

32%

33%

34%

35%

36%

37%

38%

39%

40%

41%

42%

43%

44%

45%

46%

47%

48%

49%

50%

51%

52%

53%

54%

55%

56%

57%

58%

59%

60%

61%

62%

63%

64%

65%

66%

67%

68%

69%

70%

71%

72%

73%

74%

75%

76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

86%

87%

88%

89%

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

How likely is Tool AI to work as an Alignment Strategy?

99%

One (3%),niplav (5%),sullyj3 (5%),Rafael Harth (8%)

10%

11%

12%

13%

14%

15%

16%

17%

18%

19%

20%

21%

22%

23%

24%

25%

26%

27%

28%

29%

Nick_Greig (25%),Jesse Khorasanee (25%)

30%

31%

32%

33%

34%

35%

36%

37%

38%

39%

40%

41%

42%

43%

44%

45%

46%

47%

48%

49%

Sam Bowman (40%),Gadi Piperno Corcos (41%),Seth Herd (44%)

50%

51%

52%

53%

54%

55%

56%

57%

58%

59%

Gordon McDonald (55%),DragonGod (56%)

60%

61%

62%

63%

64%

65%

66%

67%

68%

69%

Logan Z (60%),Noosphere89 (62%)

70%

71%

72%

73%

74%

75%

76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

86%

87%

88%

89%

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

How likely is Game Theory to work as an Alignment Strategy?

99%

Logan Z (1%),One (1%),Sam Bowman (2%),Rafael Harth (2%),Jesse Khorasanee (2%),Nick_Greig (3%),Gordon McDonald (5%),Seth Herd (8%),Noosphere89 (8%)

10%

11%

12%

13%

14%

15%

16%

17%

18%

19%

niplav (10%),DragonGod (16%)

20%

21%

22%

23%

24%

25%

26%

27%

28%

29%

30%

31%

32%

33%

34%

35%

36%

37%

38%

39%

40%

41%

42%

43%

44%

45%

46%

47%

48%

49%

50%

51%

52%

53%

54%

55%

56%

57%

58%

59%

60%

61%

62%

63%

64%

65%

66%

67%

68%

69%

70%

71%

72%

73%

74%

75%

76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

86%

87%

88%

89%

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

How likely is AI Boxing to work as an Alignment Strategy?

99%

Sam Bowman (2%),Jesse Khorasanee (7%),Rafael Harth (9%)

10%

11%

12%

13%

14%

15%

16%

17%

18%

19%

Noosphere89 (12%),Nick_Greig (15%),Matt Vogel (16%)

20%

21%

22%

23%

24%

25%

26%

27%

28%

29%

Logan Z (25%)

30%

31%

32%

33%

34%

35%

36%

37%

38%

39%

DragonGod (31%),Seth Herd (38%)

40%

41%

42%

43%

44%

45%

46%

47%

48%

49%

50%

51%

52%

53%

54%

55%

56%

57%

58%

59%

60%

61%

62%

63%

64%

65%

66%

67%

68%

69%

Cedar (64%)

70%

71%

72%

73%

74%

75%

76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

86%

87%

88%

89%

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

How likely is Table Flipping to work as an Alignment Strategy?

99%

One (5%),Rafael Harth (8%)

10%

11%

12%

13%

14%

15%

16%

17%

18%

19%

niplav (12%),Noosphere89 (17%)

20%

21%

22%

23%

24%

25%

26%

27%

28%

29%

30%

31%

32%

33%

34%

35%

36%

37%

38%

39%

Logan Z (30%),Jesse Khorasanee (33%),janus (33%),Sam Bowman (35%)

40%

41%

42%

43%

44%

45%

46%

47%

48%

49%

Gadi Piperno Corcos (40%),Nick_Greig (42%)

50%

51%

52%

53%

54%

55%

56%

57%

58%

59%

DragonGod (53%)

60%

61%

62%

63%

64%

65%

66%

67%

68%

69%

NickGabs (62%)

70%

71%

72%

73%

74%

75%

76%

77%

78%

79%

80%

81%

82%

83%

84%

85%

86%

87%

88%

89%

90%

91%

92%

93%

94%

95%

96%

97%

98%

99%

How likely is AI aligning AI to work as an Alignment Strategy?

99%