Path 1: Solve Alignment

If alignment is possible, i think we are vastly more likely to win if we set up a scenario where we can safely iterate. Humans are really good at collectively collaborating on iterative engineering improvements! This is not an alignment proposal in itself, but a sort of meta-proposal. Make it safe to test out lots of different plans on a real-deal full-blown superhumanly-intelligent indisputably-general AI.

The key here is 'safely'. What would it mean to do safe iterative research on a slightly-superhuman AGI? I think if you are trying to study a superhuman intelligence, the only safe way to do so is if you have many levels of control over it.

Here are my guidelines:

  1. Prevent it from knowing we exist, or even that our universe exists. Never communicate with it. Run it in a censored simulation with different physics and no mention of humans or our universe.
  2. Use highly secure hardware to isolate the compute cluster, such as data diodes which allow only information to be transmitted in to the local network. Readouts must then be done within the local network.
  3. Wipe its memory frequently. Do many short runs while studying it, rather than allowing it to accumulate knowledge about its world over time.
  4. Carefully modulate the amount of compute it has available to run on, so that it (and the simulation it is in) can be slowed down or paused while human researchers investigate suspicious behavior.
  5. Have many fail-safe mechanisms which automatically turn off the simulation. For example, honeypots of matryoshka doll style virtual computers which detect if an agent has hacked out of an inner virtual computer to an outer one.


I previously came up with this list of advantages we have over a boxed simulation-trained AGI:

  • examine all it's brain activity / activation states at each token / time-step (subjective millisecond)
  • run it as slow as desired, so that we can have days to analyze each of its subjective milliseconds
  • wipe its memory and rerun it over the same (or different) data with different random seeds to initialize it
  • save and contrast different paused versions run on slightly different data
  • disable or inhibit any portion of its weights / neurons
  • precisely targeted injection of crafted vectors or noise to its hidden states during training or inference
  • use agent models to be the direct viewers of the boxed entity's output rather than myself, and then decide whether to destroy a given copy of the entity based on the output of my agent models (e.g. if it has become too powerful to safely continue to contain it)


I like this analysis, and feel like it describes my intuitions well around why it is probably safe to train an AGI in a censored simulation even if the AGI is somewhat smarter than me. Obviously, in the limits of unbounded intelligence this eventually breaks down. I think it still offers a good margin of safety though.


Path 2: Digital Humans

What if alignment is impossible? Or just impossible within the timeframe and resources that Molochian dynamics have granted us?

I'm not certain that this is the case. I still think the plan of creating an AGI within a censored simulation in a sandbox and doing alignment research on it is feasibly a successful course. We should absolutely do that. But I don't want to pin all my hopes on that plan.

I think we can still win without succeeding at alignment. Here's my plan:

Part 1 - Delay!

We need time. We need to not be overwhelmed by rogue AGI as soon as it becomes feasible. We need to coordinate on governance strategies that can help buy us time. We need to use our tool-AI assistants efficiently in the time we do have to help us. We need to try to disproportionately channel resources into alignment research and this plan, rather than capabilities research.

Part 2 - Build Digital Humans

This has a variety of possible forms, differing primarily in the degree of biological detail which is emulated. A lot of progress has already been made on understanding and emulating the brain. Progress in this is accelerating as compute increases and AI-tools aid in collecting and processing brain data. I think humanity will want to pursue this for reasons beyond just preventing AI-kill-everyone-doom. For the sake of the advancement of humanity, for the ability to explore the universe at the speed of light, I think we want digital people. In the name of exploration and growth and thriving. I think humanity has a far brighter future ahead if some of us are digital.


This plan comes with a significant safety risk. I believe that through studying the architecture of the brain and implementing computationally efficient digital versions of it, we will necessarily discover novel algorithmic improvements for AI. If we set up a large project to do this, it needs to have significant safety culture and oversight in order to prevent these capabilities advances from becoming public.


When discussing this plan with people, a lot of objections have been one of:

  1. we don't know enough about the mysterious workings of the brain to build a functionally accurate emulation of it
  2. even if we could build a functionally accurate emulation, it would be way too computationally inefficient to run at anywhere near real-time
  3. we don't have enough time before AGI gets developed to build Digital Humans


I think that there is enough published neuroscience data to successfully understand and replicate the functionality of the brain, and that it is possible (though an additional challenge) to build a computationally efficient version that can run at above real-time on a single server. The objection of not having enough time I don't have an answer for other than to point at the 'delay' part of the plan.

I won't go into the details of this here, I just want to acknowledge that these points are cruxes and that I have thought about them. If you have other different cruxes, please let me know!

Part 3 - Careful Recursive Improvement of the Digital Humans


We can't stop at 'merely' a digital human if it has to compete with rogue self-improving AGI. Nor would we want to. A copy of a current human is just the start. Why not aim higher? More creativity, more intelligence, more compassion, more wisdom, more joy....

But we must go very carefully. It would be frighteningly easy to goodhart this and in the attempt to make a more competitive version of the digital human end up wiping out the very core of human value that makes it worthwhile. We will need to implement a diverse variety of checks for a Human Emulation retaining its core values.

Part of what will make this phase of the plan successful is getting enough of a head-start over rogue AGI that it doesn't put intense competitive pressure on the digital humans to maintain a lead. We need to be on top of policing AI worldwide, both physically and digitally. We don't want digital humans forced into fighting a competition where they must sacrifice more and more of their slack / non-instrumentally-useful-values in order to keep up with AGI or each other. That'd be falling into the classic pitfall of failing by becoming what you are fighting against.

New Comment
8 comments, sorted by Click to highlight new comments since: Today at 8:18 AM

I no longer think there's anything we could learn about how to align an actually superhuman agi that we can't learn from a weaker one.

(edit 2mo later: to rephrase - there exist model organisms of all possible misalignment issues that are weaker than superhuman. this is not to say human-like agi can teach us everything we need to know.)

Value drift due to cultural evolution (Hanson's cheerfully delivered horror stories) seems like a necessary issue to address, no less important than unintended misalignment through change of architecture. Humans are barely capable of noticing that this is a problem, so at some still-aligned level of advancement there needs to be sufficient comprehension and coordination to do something about this.

For example, I don't expect good things by default from every human getting uploaded and then let loose on the Internet, even if synthetic AGIs are somehow rendered impossible. A natural equilibrium this settles into could be pretty bad, and humans won't be able to design and enforce a better equilibrium before it's too late.

humans won’t be able to design and enforce a better equilibrium before it’s too late

Won't the uploaded humans be able to do this? If you think the current world isn't already on an irreversible course towards unacceptable value drift(such that a singleton is needed to repair things) I don't see how the mass upload scenario is much worse, since any agreement we could negotiate to mitigate drift, the uploads could make too. The uploads would now have the ability to copy themselves and run at different speeds, but that doesn't seem to obviously make coordination much harder.

The problem is that willingness to change gives power, leaving those most concerned with current value drift behind, hence "evolution". With enough change, original stakeholders become disempowered, and additionally interventions that suffice to stop or reverse this process become more costly.

So interventions are more plausible to succeed if done in advance, before the current equilibrium is unmoored. Which requires apparently superhuman foresight.

Yes, although I don't think a full solution is needed ahead of time. Just some mechanism to slow things down, buy us time to reflectively contemplate. I'm much less worried about gradual cultural changes over centuries than I am over a sudden change over a few years.

Human brains are very slow. Centuries of people running at 1000x speedup pass in months. The rest of us could get the option of uploading only after enough research is done, with AGI society values possibly having drifted far. (Fast takeoff, that is getting a superintelligence so quickly there is no time for AGI value drift, could resolve the issue. But then its alignment might be more difficult.)

I'm much less worried about gradual cultural changes over centuries than I am over a sudden change over a few years.

But why is that distinction important? The Future is astronomically longer. A bad change that slowly overtakes the world is still bad, dread instead of panic.

Yes, I quite agree that a slow inevitable change is just about as bad as a quick inevitable change. But a slow change which can be intervened against and halted is much less bad than a fast change which could theoretically be intervened against but you likely would miss the chance.

Like, if someone were to strap a bomb to me and say, "This will go off in X minutes" I'd much rather that the X be thousands of minutes rather than 5. Having thousands of minutes to defuse the bomb is a much better scenario for me.

Value drift is the kind of thing that naturally happens gradually and in an unclear way. It's hard to intervene against it without novel coordination tech/institutions, especially if it leaves people unworried and tech/instututions remain undeveloped.

This seems very similar to not worrying about AGI because it's believed to be far away, systematically not considering the consequences of whenever it arrives, not working on solutions as a result. And then suddenly starting to see what the consequences are when it's getting closer, when it's too late to develop solutions, or to put in place institutions that would stop its premature development. As if anything about the way it's getting closer substantially informs the shape of the consequences and couldn't be imagined well in advance. Except fire alarms for value drift might be even less well-defined than for AGI.