Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Produced during the Stanford Existential Risk Initiative (SERI) ML Alignment Theory Scholars (MATS) Program of 2022, under John Wentworth


“Overconfidence in yourself is a swift way to defeat.”

- Sun Tzu


TL;DR: Escape into the Internet is probably an instrumental goal for an agentic AGI. An incompletely aligned AGI may escape prematurely, and the biggest failure mode for this is probably the AGI socially engineering the alignment researchers. Thus, opening an additional information channel between the researchers and the AGI (e.g., adding an interpretability tool and/or researcher) is inherently risky. The expected cost of adding this channel may even exceed the expected scientific benefit. Whether this is true depends on the informational efficiency of the channel (which increases the expected scientific benefit), the level of the AGI’s capabilities (which increases the risk of its premature escape), and the security level of the AGI’s box setup (which decreases the risk).


Post: Suppose we are building a machine-learning model that has a high chance of becoming Artificial General Intelligence (AGI). We are trying to align it, but we’re uncertain about whether our training is working. Also, rapidly increasing capabilities may push us into a Sharp Left Turn scenario. How do we handle this volatile situation as safely as possible? We will need to adopt a security mindset, aimed at effectively preventing catastrophic outcomes.

We propose a procedure (inspired by the pioneering work of Roman Yampolskiy) for how to train the AGI towards alignment in a secure and effective manner. Most of the ideas here are not new. Nevertheless, we hope that our game-theoretic framing of the problem will be able to help alignment researchers better plan their research directions and future alignment efforts.

Imagining the Scenario

Consider a team of researchers who are trying to build an aligned AGI. This situation can be represented by the following simplified model.

Definition: A “box” is defined by a collection of failsafes designed to confine a system away from the outside world. The empty collection comprised of zero failsafes is called the “trivial box” (i.e., an AI which is allowed unrestricted access to send and receive data on the Internet).

Step 1 - The AI is trained

The AGI emerges in a box. The precise time of the AGI's emergence and/or its superhuman capabilities probably cannot be predicted, though the research team may have a rough estimate.

Step 2 - The AGI may try to escape

At any point in the training or testing process, the AGI may attempt to break out of the box. If all failsafes of the box are breached before the team notices, the AGI escapes and the game is over (Step 4). One example of this is the scenario in which the AGI hacks itself into the Internet. Another example is the scenario in which the AGI releases power-seeking malware agents into the Internet.

If the AGI fails to escape, the researchers may decide either to retrain the AGI from scratch, fine-tune (which is less computationally expensive but may fail to remove the AGI’s escape intent/misalignment), or table the project entirely. 

The trivial box cannot prevent escapes.

Step 3 - The researchers run tests on the AGI

In our simplified model, researchers open a collection of information channels between themselves and the AGI in the box. As noted by Yampolskiy, the information from these channels is dual-use:

  1. The human researchers can use the channel to probe various properties of the AGI, to help them train it to be more aligned. In addition to advancing the science of alignment, the channel can also be used for scientific benefits in general.
  2. The AGI can use the channel to socially engineer the researchers on the other end. Socially engineered researchers can then help the AGI break the failsafes, which makes the AGI’s escape easier.

Consequently, adding information channels is one of the strongest ways to weaken failsafes.

An inadequate box setup could allow the AGI to also open information channels unintended by the research team. The AGI might do so by hacking, by using its internal circuits to send radio waves, and even by using different speeds of the cooling fan to send messages by Morse code.

However, suppose that the box setup uses theoretically robust cybersecurity, combined with an actual physical box that is designed to not let any covert information enter or leave. In this case, risks of human social engineering would mostly arise from the information channels that the researchers themselves choose to open.

Step 4 - Game Over: The AGI is deployed or escapes

The game ends when the AGI escapes or is deployed outside the box. This terminally changes future utility based on how aligned or misaligned the AGI is. We want to maximise the chance that the AGI is aligned in the very likely scenario in which it eventually leaves the box.

What will happen?

It seems that either deletion, escape, or deployment of the AGI is almost certain to happen eventually. Escaping from the box is likely to be an AGI’s instrumental goal. Also, the researchers will likely be eventually forced by overconfident management to remove the failsafes (i.e., deploy the AI), even if the AGI is not yet aligned.

Finally, if there is a perceived race towards AGI deployment by multiple AI organisations, this would further exacerbate their motivations to deploy prematurely. A superintelligent AGI, once on the Internet, will likely change the world in an irreversible manner, in the direction of whatever mysterious goals it evolved during its training. We can, however, guess that one of its instrumental goals will be preventing the emergence of competing AGIs with different goals. AGI creators who are overconfident about their alignment ability could perceive this as a huge first-mover advantage, which would motivate them to deploy prematurely. This in turn forces the non-overconfident AGI creators to choose between also deploying prematurely or being too late: the Unilateralist’s Curse.

Recall that

  1. we have made little progress in how to align an AGI,
  2. a misaligned AGI would probably be a power-seeking risk endangering all of humanity, and
  3. it seems plausible that a superintelligent AGI could emerge in the near-term future.

Thus, it is essential that there is (1) coordination between AI organisations and (2) AI governance towards safe AI, rather than a race towards premature deployment.

But this is probably difficult to achieve, so we will assume that the AGI is deployed in a finite time T unless otherwise stated.

An example alignment plan

Suppose a leading AI organisation (say, on the cusp of being the first to develop an AI agent) approached us today and asked us how to align their AI agent. We would probably tell them that very little is known about how to align an AI agent, that it would probably endanger humanity, and that they shouldn’t build it (at least before the science of alignment is solved).

But if they insist that they will build it anyway even without our input, and there seems to be no way to convince them otherwise, we might have to provide them with a reasonable alignment plan. Such a plan would aim to maximise expected utility, which might look something like this:

  • Train the AI agent in a secure box until it has high capabilities.
  • Run some automated tests that have very high per-bit efficiency (e.g., automated tests that give a percentage).
  • If it fails, either retrain or table the model.
  • Run some higher bit-rate experiments and look for very specific signs of unaligned activity.
  • If it fails, either retrain or table the model.
  • Use some coarse interpretability tools to roughly check the goals/values of the model.
  • If it fails, either retrain or table the model.
  • Use finer interpretability tools to look for signs of deception (e.g., Eliciting Latent Knowledge, or ELK).
  • If it fails, either retrain or table the model.
  • Do one of the following. Either (1) repeat the above process indefinitely (gaining scientific benefits along the way) or (2) if all positive-expected-value tests check out, deploy the model online.

(Since we currently know very little about how to align an AI agent, Option 2 would most likely deploy a misaligned, power-seeking risk endangering all of humanity. Thus, we would strongly recommend Option 1 instead, at least until a miraculous breakthrough occurs in the science of alignment: perhaps with the help of the boxed AGI. However, recall that the researchers might be forced to use Option 2 at a finite time T by overconfident management, or by a perceived race to deployment against other AI organisations.)

Observe that the tests start from those that are the most scientifically informative per bit to those that are less so. Tests whose per-bit expected scientific benefit are outweighed by the per-bit expected cost (due to human social engineering risk by the AGI) are not used.


1.  Maximise utility-per-bit

When testing the model, we want to maximise the scientific value per bit (benefit-per-bit) compared to the possible social-engineering risk due to those bits (cost-per-bit).

For example, researchers who use high-volume, low-efficiency interpretability tools on the AGI’s middle weights may be opening themselves up to a large risk of being socially engineered. This would constitute a sizable expected cost, since premature escape by an incompletely aligned AGI will likely be catastrophic and much worse than the counterfactual.

In contrast, running an automated test that gives a percentage (e.g., of whether a given property occurs) would be much better per-bit than having people read all the outputs and judge for themselves, since it would greatly reduce the ability of the AGI to perform social engineering attacks. As our confidence in the AGI’s alignment increases, we can slowly increase the bits for which we ask.

Overall, we should be maximising expected utility, and this does not always mean that we should aim for a complete victory. It would be great if we could prevent deceptive misalignment in a guaranteed manner. But this could require an inordinate amount of information, especially for complex evolved agents. The expected scientific benefit of the likely inefficient information channel required to get this information may be outweighed by its expected cost. A better course of action would be to use information channels that are more efficient on expectation, even if they may not achieve the best-case scenario of guaranteeing no deceptive misalignment. If the highest-expected-utility training plan we can realistically think of—for a given box setup—is to mostly rely on parsimonious summary statistics and hope to get lucky, then so be it.

Conversely, maximising expected utility does not always mean that we should give up on boxing once the AGI has superhuman intelligence. There are likely AI box designs that are both informative and secure against even the most intelligent AI (e.g., a securely boxed truth oracle with no output channels other than answering multiple-choice questions, whose answers are patiently and extensively audited by many human experts). Given that even a hypothetically aligned AI could phase-transition into misalignment after it increases in capabilities, running high-efficiency tests afterwards in an optimised box setup could potentially be very informative, and well worth considering.

Note: Our framework of raw computer bits, while easy to exposit, might not be the most accurate framing of the problem. For one thing, the same amount of information can be riskier or less risky depending on the rate at which it is audited by the researchers. This is because humans are more likely to make well-calibrated decisions when they have sufficient time to think things through. For another, the AGI might be able to robustly mask its intent with respect to some information channels, but not others. Incorporating these and other considerations may lead to a more realistic generalisation of our model.

2.  When to open or close information channels

We should not open an information channel unless it has a positive expected utility.

If the AI being studied has very low capabilities, then it should not be too risky to open additional channels of information. In fact, doing so might be beneficial, since it could provide valuable information about how a future AGI of a similar architecture might work.

Once you have a highly capable AI, however, the research team should be more cautious about their channels of information. In fact, they might want to consider closing their low-efficiency, high-risk information channels as soon as they suspect that the AI could make a Sharp Left Turn (to superhuman capabilities) with a non-negligible probability: even if the probability is low. This is because if the AGI has superhuman intelligence, then it could be very skilled at both social engineering and hacking: particularly if it is a Large Language Model trained on code, human writing, and/or images. We should plan for the plausible case that the AGI will effectively exploit its information channels to learn about, outmanoeuvre, and manipulate humans to escape.

Thus, we should put the cost-per-bit of information to be quite high. This leads us to a counterintuitive rule-of-thumb:

For a sufficiently capable model, unless the informational benefit is sufficiently high (relative to the riskiness of the box setup), an interpretability tool and/or alignment researcher should be removed, not added, on the margin.

3.  Social engineering is probably the biggest failure mode

This is true even of cybersecurity today, whose adversaries are just human hackers, rather than an AGI with superintelligent capabilities in human social engineering.

Even today's AI, neither superintelligent nor agentic, is sufficient to socially engineer humans. While conversing with Google's LaMDA model as a tester, Google software engineer Blake Lemoine was led to incorrectly believe that LaMDA was just “a sweet kid," to share confidential information about LaMDA publically, to invite a lawyer to represent LaMDA, to propose the unethicality of Google’s activities to a member of the House Judiciary Committee, and to eventually get fired from his dream job at Google for his aggressive actions.

This social engineering occurred by accident rather than by the AI's agency (LaMDA is most likely neither sentient as Lemoine claimed, nor agentic, nor of superhuman general capabilities). But the following is clear. An additional person looking at a superintelligent AGI's output, even a well-informed and well-meaning alignment researcher, constitutes an additional failure point.

Directions for Research

1.  Optimise AI boxing

The most obvious way to bring down expected costs, and thereby make all information channels more viable, is to improve AI boxing. This would entail empirically measuring both the level of security and the level of informativeness of various box setups. We can then work towards answering the following questions:

For a given probability of AI escape, which box setup maximises the expected scientific benefit without increasing the expected AI escape risk by an amount more than that level?


For a given level of scientific benefit, which box setup minimises the expected AI escape risk without constraining the amount of expected scientific benefit to be less than that level?

Excellent ideas on pragmatic AI boxing (e.g., by Roman Yampolskiy, David Chalmers, and others) await implementation, redteaming, and optimisation. They include:

  1. putting the AGI in a virtual machine’s test environment that looks naturally occurring rather than designed, 
  2. using homomorphic encryption to obstruct the AGI from analysing its own source code, and
  3. training the alignment researchers to withstand social engineering,

Other promising failsafes aimed at reducing risks of social engineering include multi-factor authentication, time delays for important decisions, and collective rather than unilateral decision-making. Finally, impeding AI escape with a large and diverse collection of cybersecurity failsafes (some of which can be nested multiple times) is probably ideal from a safety-mindset perspective.

Given that human social engineering is probably the biggest failure mode, it seems especially promising to systematise and optimise Eliezer Yudkowsky’s pioneering roleplaying experiments of AI escape, which could help redteam how vulnerable various box setups are to human social engineering. One way to adapt Yudkowsky’s setup to our game-theoretic framing could be to give the “human researcher” a list of scientific questions that, if answered correctly by the “AGI,” would yield positive points. To obtain the answers to these scientific questions, the “human researcher” opens a collection of information channels allowed by the box setup. The “AGI,” whose role can be played by a team of experts in social persuasion, may use these informational channels to escape its box. Its escape, if successful, would apply a substantial negative penalty to the score of the “human researcher.” The average score (over multiple AI escape games) may then be informative of how robust the given box setup is against human social engineering.

In a sense, we are describing a Box Contest (analogous to the ELK Contest). First, systematise a way to empirically estimate the robustness of various box setups. Then, get as many people as we can to brainstorm new box setups, so that the best such setup may eventually be found.

2.  Get AI organisations to box their research environments

A box is only helpful if it is used. So, we should get AI organisations to envelop their neural-net research environments in optimised box setups. 

Suppose all AI organisations with the capacity to build an AGI agree to preemptively secure their research environments with our box setups. These setups would have been empirically selected to maximise the expected scientific benefit obtainable from taking on a certain level of AI escape risk (the risk level would be shared across all participating AI organisations, so as to discourage defection). This would substantially reduce the perceived race towards premature AGI deployment, thereby enhancing our prospects for solving alignment beforehand.

For this to be achieved, it seems that (1) better coordination between AI organisations and (2) better AI governance would be crucial. This is probably difficult, but not impossible. We would really appreciate hearing any creative suggestions you may have about how to make this happen!

3.  Develop informationally efficient tests for alignment

For an information channel to be helpful rather than counterproductive, it needs to be sufficiently efficient (compared to the riskiness of the box setup). So, we should prioritise quality over quantity. 

Some potentially efficient tests for alignment include:

Automated summary statistics regarding the AGI’s behaviour. Running the AGI in a realistic test environment and outputting summary statistics of its overall behaviour seems likely to be scientifically informative on a per-bit basis.

Coarse-grained interpretability tools. If we had interpretability tools that could look at higher-level features of the model, this could reduce the amount of information we need to understand it deeply.

Interpretability tools that find specific things. Examples of this approach include ELK and natural abstractions research.

4.  Better understand what channels to use and when

It would be helpful to know how to, for any given situation, empirically estimate how much utility-per-bit a given interpretability tool would yield. This would differentiate the helpful interpretability tools (positive utility-per-bit) from the counterproductive ones (negative utility-per-bit).

Note that doing this accurately is probably easier said than done. Predicting a priori the scientific benefit of a given collection of experiments is notoriously difficult. Most likely, predicting the magnitude of the human-social-engineering risk posed by an AGI—of an unknown, potentially superhuman level of capabilities—would also be quite difficult. Even a small advance in our ability to predict either the expected scientific benefit or the expected cost (of adding a given interpretability tool and/or researcher) would likely be a hard-won victory.

5.  Study generalisable facts about existing systems

The risks of human social engineering and of AI escape are probably quite low at our current level of AI capabilities. Thus, if we can probe for patterns shared by current low-risk models that—as far as we know—seem to generalise across the different capability levels in our historical record, then we can hope that these patterns might also generalise to high-risk AGI models in the future. If this turns out to be true, then we would have succeeded in obtaining high-value information without taking on a significant risk.

The downside, of course, is that we a priori have no guarantee that any pattern shared by low-capabilities models will generalise to the high-capabilities regime. Thus, we should also consider using informationally efficient tests on our potentially superintelligent AGI (even if it might be risky) to probe whether the earlier pattern we found (1) remains true or (2) has phase-transitioned into something completely different.


Epistemic status: We still have relatively high uncertainty in our scenario. However, we hope that our attempt at formulating the problem from a fresh perspective can help alignment researchers contemplate what we think are potentially promising but often overlooked research directions. 


Acknowledgements: We are grateful to Matthias Mayer and David Udell for their helpful feedback on the draft.

New Comment
30 comments, sorted by Click to highlight new comments since: Today at 10:18 AM

Love the Box Contest idea. AI companies are already boxing models that could be dangerous, but they've done a terrible job of releasing the boxes and information about them. Some papers that used and discussed boxing:

  • Section 2.3 of OpenAI's Codex paper. This model was allowed to execute code locally. 
  • Section 2 and Appendix A of OpenAI's WebGPT paper. This model was given access to the Internet. 
  • Appendix A of DeepMind's GopherCite paper. This model had access to the Internet, and the authors do not even mention the potential security risks of granting such access. 
  • DeepMind again giving access to the Google API without discussing any potential risks. 

The common defense is that current models are not capable enough to write good malware or interact with search APIs in unintended ways. That might well be true, but someday it won't be, and there's no excuse for setting a dangerous precedent. Future work will need to set boxing norms and build good boxing software. I'd be very interested to see follow-up work on this topic or to discuss with anyone who's working on it. 

Yes. I strongly suspect a model won't need to be anywhere close to an AGI before it's capable of producing incredibly damaging malware. 

We will need to adopt a security mindset, aimed at effectively preventing catastrophic outcomes.

You might want to read Yudkowsky's post on security mindset:

coral:  An ordinary paranoid programmer imagines that an adversary might try to read the file containing all the usernames and passwords. They might try to store the file in a special, secure area of the disk or a special subpart of the operating system that's supposed to be harder to read. Conversely, somebody with security mindset thinks, “No matter what kind of special system I put around this file, I'm disturbed by needing to make the assumption that this file can't be read. Maybe the special code I write, because it's used less often, is more likely to contain bugs. Or maybe there's a way to fish data out of the disk that doesn't go through the code I wrote.”

amber:  And they imagine more and more ways that the adversary might be able to get at the information, and block those avenues off too! Because they have better imaginations.

coral:  Well, we kind of do, but that's not the key difference. What we'll really want to do is come up with a way for the computer to check passwords that doesn't rely on the computer storing the password at all, anywhere.

Point is: "build ever-fancier boxes" is exactly the sort of thing security mindset is not about. What we really want to do is to not build a thing which needs to be boxed in the first place.

... and yeah, we'll still probably put it in a box, for the same reason that keeping password hashes secure is a good idea. We might as well. But that's not really where the bulk of the security comes from.

“We'll still probably put it in a box, for the same reason that keeping password hashes secure is a good idea. We might as well. But that's not really where the bulk of the security comes from.”

This seems true in worlds where we can solve AI safety to the level of rigor demanded by security mindset. But lots of things in the world aren’t secure by security mindset standards. The internet and modern operating systems are both full of holes. Yet we benefit greatly from common sense, fallible safety measures in those systems.

I think it’s worth working on versions of AI safety that are analogous to boxing and password hashing, meaning they make safety more likely without guaranteeing or proving it. We should also work on approaches like yours that could make systems more reliably safe, but might not be ready in time for AGI. Would you agree with that prioritization, or should we only work on approaches that might provide safety guarantees?

It's not a question of "making safety more likely" vs "guarantees". Either we will basically figure out how to make an AGI which does not need a box, or we will probably die. At the point where there's an unfriendly decently-capable AGI in a box, we're probably already dead. The box maybe shifts our survival probability from epsilon to 2*epsilon (conditional on having an unfriendly decently-capable AGI running). It just doesn't increase our survival probability by enough to be worth paying attention to, if that attention could otherwise be spent on something with any hope at all of increasing our survival probability by a nontrivial amount.

The main reason to bother boxing at all is that it takes relatively little marginal effort. If there's nontrivial effort spent on it, then that effort would probably be more useful elsewhere.

Either we will basically figure out how to make an AGI which does not need a box, or we will probably die.

AGI will likely be DL based, and like just about any complex engineered system, it will require testing. Only fools would unbox AGI without extensive alignment testing (yes, you can test for alignment, but only in sim boxes where the AGI is not aware of the sim).

So boxing (simboxing) isn't some optional extra safety feature, it is absolute core essential.

At the point where there's an unfriendly decently-capable AGI in a box, we're probably already dead.

Nah, human level AGI isn't much of a risk - as long as it's in a proper simbox. Most of the risk comes from our knowledge, ie the internet.

I'm not saying one should forego the box. I'm saying the box does not shift our survival probability by very many bits. If our chances of survival are low without the box, they're still low with it.

Whether the boxed AI is capable enough to break out of the box isn't even particularly relevant; the problem is relying on iterative design to achieve alignment in the first place.

I actually agree with the preamble of that post:

In worlds where AI alignment can be handled by iterative design, we probably survive. So long as we can see the problems and iterate on them, we can probably fix them, or at least avoid making them worse.

By the same reasoning: worlds where AI kills us are generally worlds where, for one reason or another, the iterative design loop fails.

So far, so good. Notice here you are essentially saying that iterative design is all important, and completely determines survival probability - the opposite of "does not shift our survival probability by very many bits".

So, if we want to reduce X-risk, we generally need to focus on worlds where the iterative design loop fails for some reason; in worlds where it doesn’t fail, we probably don’t die anyway.

Or you could do the obvious thing and ... focus on ensuring you can safely iterate.

Fast takeoff is largely ruled out by physics, and moreover can be completely constrained in a simbox.

Eventual deceptive alignment is the main failure mode, and that's specifically what simboxing is for. (The AI can't deceive us if it doesn't believe we exist).

But also, just to clarify, simboxing is essential because it enables iteration, but simboxing/iteration itself isn't an alignment solution.

Fast takeoff is largely ruled out by physics, and moreover can be completely constrained in a simbox

This comment prompted me to finally read the linked post, which was very enjoyable, well done. It seems to have little-to-nothing to do with fast takeoff, though; I don't think most people who expect fast takeoff primarily expect it to occur via more efficient compute hardware or low-level filter circuits. Those with relatively conservative expectations expect fast takeoff via the ability to spin up copies, which makes coordination a lot easier and also saves needing each agent to independently study and learn existing information. Those with less conservative expectations also expect algorithmic improvements at the higher levels, more efficient information gathering/attention mechanisms, more efficient search/planning, more generalizable heuristics/features and more efficient ways to detect them, etc.


Fast takeoff traditionally implies time from AGI to singularity measured in hours or days, which you just don't get with merely mundane improvements like copying or mild algorithmic advances. EY (and perhaps Bostrom to some extent) anticipated fast takeoff explicitly enabled by many OOM brain inefficiency, such that the equivalent of many decades of Moore's Law could be compressed into mere moments. The key rate limiter in these scenarios ends up being the ability to physically move raw materials through complex supply chains processes to produce more computing substrate, which is bypassed through the use of hard drexlerian nanotech.

But it turns out that biology is already near optimal-ish (cells in particular already are essentially optimal nanoscale robots; thus drexlerian nanotech is probably a pipe dream), so that just isn't the world we live in.

Quoting Yudkowsky's Intelligence Explosions Microeconomics, page 30:

What sort of general beliefs does this concrete scenario of “hard takeoff ” imply about returns on cognitive reinvestment? It supposes that:

  • An AI can get major gains rather than minor gains by doing better computer science than its human inventors.
  • More generally, it’s being supposed that an AI can achieve large gains through better use of computing power it already has, or using only processing power it can rent or otherwise obtain on short timescales—in particular, without setting up new chip factories or doing anything else which would involve a long, unavoidable delay.
  • An AI can continue reinvesting these gains until it has a huge cognitive problem-solving advantage over humans.

... so Yudkowsky's picture of hard takeoff explicitly does not route through inefficiency in the brain's compute hardware, it routes through inefficiency in algorithms. He's expecting the Drexlerian nanotech to come at the end of the hard takeoff; nanotech is not the main mechanism by which hard takeoff is enabled.

The core idea of hard takeoff is that algorithmic advances can get to superintelligence without needing to build lots of new hardware. Your brain efficiency post doesn't particularly argue against that.

That quote seemed to disagree so much with my model of early EY that I had to go back and re-read it. And I now genuinely think my earlier summary is still quite accurate.

[pg 29]:

Some sort of AI project run by a hedge fund, academia, Google,37 or a government, advances to a sufficiently developed level (see section 3.10) that it starts a string of selfimprovements that is sustained and does not level off. This cascade of self-improvements might start due to a basic breakthrough by the researchers which enables the AI to understand and redesign more of its own cognitive algorithms ..

Once this AI started on a sustained path of intelligence explosion, there would follow some period of time while the AI was actively self-improving, and perhaps obtaining additional resources, but hadn’t yet reached a cognitive level worthy of being called “superintelligence.” This time period might be months or years,[^38] or days or seconds.[^39]

At some point the AI would reach the point where it could solve the protein structure prediction problem and build nanotechnology—or figure out how to control atomic force microscopes to create new tool tips that could be used to build small nanostructures which could build more nanostructures—or perhaps follow some smarter and faster route to rapid infrastructure. An AI that goes past this point can be considered to have reached a threshold of great material capability. From this would probably follow cognitive superintelligence (if not already present); vast computing resources could be quickly accessed to further scale cognitive algorithms.

Notice I said "Fast takeoff traditionally implies time from AGI to singularity measured in hours or days, which you just don't get with merely mundane improvements like copying or mild algorithmic advances." - Which doesn't disagree with anything here, as I was talking about time from AGI to singularity, and regardless EY indicates rapid takeoff to superintelligence probably requires drexlerian nanotech.

AGI -> Superintelligence -> Singularity

Also EY clearly sees nanotech as the faster replacement for slow chip foundry cycles, as I summarized:

. Given a choice of investments, a rational agency will choose the investment with the highest interest rate—the greatest multiplicative factor per unit time. In a context where gains can be repeatedly reinvested, an investment that returns 100-fold in one year is vastly inferior to an investment which returns 1.001-fold in one hour. At some point an AI’s internal code changes will hit a ceiling, but there’s a huge incentive to climb toward, e.g., the protein-structure-prediction threshold by improving code rather than by building chip factories

Without drexlerian nanotech to smash through the code-efficiency ceiling the only alternative is the slower chip foundry route, which of course is also largely stalled if brains are efficient and already equivalent to end moore's law tech.

Regardless, in the brain efficiency post I also argue against many OOM brain software efficiency. (If anything, the brain's incredible data efficiency is increasingly looking like a difficult barrier for AGI)

I think there's some inconsistent usage of "superintelligence" here. IIRC Yudkowsky also mentioned somewhere that he doesn't expect humans to be able to build nanotech any time soon without AGI, therefore presumably he expects the AGI needs to be very superhuman to build nanotech. His fast takeoff scenario therefore involves the AGI reaching very superhuman levels before it starts to invest in manufacturing. But he's using the term "superintelligence" for something quite a bit more powerful than just "very superhuman".

For strategic purposes, it's the weaker version (i.e. "very superhuman") which is mostly relevant.

Regardless, in the brain efficiency post I also argue against many OOM brain software efficiency.

You argued that current DL systems are mostly less data-efficient than the brain (and at best about the same). That is extremely weak evidence that nothing more data-efficient than the brain exists. And you didn't argue at all about any other dimensions of reasoning algorithms - e.g. search efficiency, ability to transport information/models to new domains, model expressiveness, efficiency of plans, coordination/communication, metastuff, etc.

I think you are missing the forest of my argument for it's trees. The default hypothesis - the one that requires evidence to update against - is now that the brain is efficient in most respects, rather than the converse.

The larger update is that evolution is both fast and efficient. It didn't proceed through some slow analog of moore's law where some initial terribly inefficient designs are slowly improved. Biological evolution developed near-optimal nanotech quickly, and then slowly built up larger structure. It moved slowly only because it was never optimizing for intelligence at all, not because it is inherently slow and inefficient. But intelligence is often useful so eventually it developed near-optimal designs for various general learning machines - not in humans - but in much earlier brains.

Human brains are simply standard primate brains, scaled up, with a few tweaks for language. The phase transition around human intelligence is entirely due to language adding another layer of systemic organization (like the multicellular transition); due to culture allowing us to learn from all past human experiences, so our (compressed) training dataset scales with our exponentially growing population vs being essentially constant as for animals.

Deep learning is simply reverse engineering the brain (directly and indirectly), and this was always ever the only viable path to AGI [1]. Based on the large amount of evidence we have from DL and neuroscience it's fairly clear (to me at least) that the the brain is also probably near optimal in data efficiency (in predictive gain per bit of sensor data per unit of compute - not to be confused with sample efficiency which you can always improve at the cost of more compute).

Of course AGI will have advantages (mostly in expanding beyond the limitations of human lifetimes and associated brain sizes and slow interconnect); but overall it's more like the beginning of a cambrian explosion that is a natural continuation of brain biological evolution, rather than some alien invasion.

  1. At this point we have actually heavily explored the landscape of bayesian learning algorithms and huge surprises are unlikely. ↩︎

The default hypothesis - the one that requires evidence to update against - is now that the brain is efficient in most respects, rather than the converse.

I think you have basically not made that case, certainly not to such a degree that people who previously believed the opposite will be convinced. You explored a few specific dimensions - like energy use, heat dissipation, circuit depth. But these are all things which we'd expect to have been under lots of evolutionary pressure for a long time. They're also all relatively "low-level" things, in the sense that we wouldn't expect to need unusually intricate genetic machinery to fine-tune them; we'd expect all those dimensions to be relatively accessible to evolutionary exploration.

If you want to make the case of that brain efficiency is the default hypothesis, then you need to argue it in cases where the relevant capabilities weren't obviously under lots of selection pressure for a long time (e.g. recently acquired capabilities like language), or where someone might expect architectural complexity to be too great for the genetic information bottleneck. You need to address at least some "hard" cases for brain efficiency, not just "easy" cases.

Or, another angle: I'd expect that, by all of the efficiency measures in your brain efficiency post, a rat brain also looks near-optimal. Therefore, by anology to your argument, we should conclude that it is not possible for some new biological organism to undergo a "hard takeoff" (relative to evolutionary timescales) in intelligent reasoning capabilities. Where does that argument fail? What inefficiency in the rat brain did humanity improve on? If it was language, why do expect that the apparently-all-important language capability is near-optimal in humans? Also, why do we expect there won't be some other all-important capability, just like language was a new super-important capability in the rat -> human transition?

I already made much of the brain architecture/algorithms argument in an earlier post: "The Brain as a Universal Learning Machine".

In a nutshell EY/LW folks got much of their brain model from the heuristics and biases, ev psych literature which is based on the evolved modularity hypothesis, which turned out to be near completely wrong. So just by merely reading the sequences and associated lit LW folks have unfortunately picked up a fairly inaccurate default view of the brain.

In a nutshell the brain is a very generic/universal learning system built mostly out of a few different complimentary types of neural computronium (cortex, cerebellum, etc) and an actual practical recursive self improvement learning system that rapidly learns efficient circuit architecture from lifetime experience. The general meta-architecture is not specific to humans, primates, or even mammals, and in fact is highly convergent and conserved - evolution found and preserved it again and again across wildly divergent lineages. So there isn't so much room for improvement in architecture, most of the improvement comes solely from scaling.

Nonetheless there are important differences across the lineages: primates along with some birds and perhaps some octopoda have the most scaling efficient archs in terms of neuron/synapse density, but these differences are most likely due to diverging optimization pressures along a pareto efficiency frontier.

The difference in brain capabilities are then mostly just scaling differences: human brains are just 4x scaled up primate brains, having nearly zero detectable divergences from the core primate architecture (brain size is not a static feature of arch, the arch also defines a scaling plan, so you can think of size as being a tunable hyperparam with many downstream modifications to the wiring prior). Rodent brain arch has probably the worst scaling plan, probably they are optimized for speed and rarely grew large.

I think Yudkowsky used to expect improvements in the low-level compute too, e.g. "Still, pumping the ions back out does not sound very adiabatic to me?".

Did you actually read the rest of that post? Because the entire point was to talk about ways iterative design fails other than fast takeoff and the standard deceptive alignment story.

Or you could do the obvious thing and ... focus on ensuring you can safely iterate.

The question is not whether one can iterate safely, the question is whether one can detect the problems (before it's too late) by looking at the behavior of the system. If we can't detect the problems just by seeing what the system does, then iteration alone will not fix the problems, no matter how safe it is to iterate. In such cases, the key thing is to expand the range of problems we can detect.

Did you actually read the rest of that post? Because the entire point was to talk about ways iterative design fails other than fast takeoff and the standard deceptive alignment story.

I skimmed the rest, but it mostly seems to be about how particular alignment techniques (eg RLHF) may fail, or the difficulty/importance of measurement, which I probably don't have much disagreement with. Also in general the evidence required to convince me of some core problem with iteration would be strictly enormous - as it is inherit to all evolutionary processes (biological or technological).

If we can't detect the problems just by seeing what the system does, then iteration alone will not fix the problems, no matter how safe it is to iterate. In such cases, the key thing is to expand the range of problems we can detect.

Yes. Again (safe) iteration is necessary, but not sufficient. A wind tunnel isn't a solution for areodynamic control; rather it's a key enabling catalyst. You also need careful complete tests for alignment, various ways to measure it, etc.

There's a difference between "Iterative design" and "Our ability to impact iterative design." I think John is saying in his post that iterative design is an important attribute of the problem (i.e, whether the AI alignment problem is amenable to iterative design) but in the comment above, he's saying iterative design techniques aren't super important, because if iterative design won't work, they're useless - and if iterative design will work, we're probably okay without the box anyway, even though of course we should still use it.

Which is ridiculous because it is the simbox alone which allows iterative design.

Replace "AI alignment" with flight and "box" with windtunnel:

There's a difference between "Iterative design" and "Our ability to impact iterative design." I think John is saying in his post that iterative design is an important attribute of the problem (i.e, whether the flight control problem is amenable to iterative design) but in the comment above, he's saying iterative design techniques aren't super important, because if iterative design won't work, they're useless - and if iterative design will work, we're probably okay without the wind tunnel anyway, even though of course we should still use it.

The wind tunnel is not a great analogy here since it fails to get at the main disagreement - if you test an airplane in a wind tunnel and it fails catastrophically, it doesn't then escape the wind tunnel and crash in real life. Given that, it is safe to test flight methods in a wind tunnel and build on them iteratively. (Note: I'm not trying to be pedantic about analogies here - I believe that the wind tunnel argument fails to replicate the core disagreement between my understanding of you and my understanding of John)

John says "Either we will basically figure out how to make an AGI which does not need a box, or we will probably die. At the point where there's an unfriendly decently-capable AGI in a box, we're probably already dead." My understanding is that John is quite pessimistic about an AGI being containable by a simbox if it is otherwise misaligned. If this is correct, that makes the simbox relatively unimportant - the set of AGI's that are safe to deploy into a simbox and unsafe to deploy into the real world are very small, and that's why it doesn't shift survival probability very much.

It would still be dumb not to use it, because any tiny advantage is worth taking, but it's not going to be a core part of the solution to alignment - we should not depend on a solution that plans on iterating in the simbox until we get it right. As opposed to a wind tunnel, where you can totally throw a plane in there and say "I'm pretty sure this design is going to fail, but I want to see how it fails" and this does not, in fact, cause the plane to escape into the real world and destroy the world.

Now, you might think that a well-designed simbox would be very likely to keep a potentially misaligned AGI contained, and thus the AI alignment problem is probably amenable to iterative design. That would then narrow down the point of disagreement.

Yes, a well designed simbox can easily contain AGI, just as you or I aren't about to escape our own simulation.

Containment actually is trivial for AGI that are grown fully in the sim. It doesn't even require realism: you can contain AGI just fine in a cartoon world, or even a purely text based world, as their sensor systems automatically learn the statistics of their specific reality and fully constrain their perceptions to that reality.

I strongly agree with John that “what we really want to do is to not build a thing which needs to be boxed in the first place.” This is indeed the ultimate security mindset.

I also strongly agree that relying on a “fancy,” multifaceted box that looks secure due to its complexity, but may not be (especially to a superintelligent AGI), is not security mindset.

One definition of security mindset is “suppose that anything that could go wrong, will go wrong.” So, even if we have reason to believe that we’ve achieved an aligned superintelligent AGI, we should have high-quality (not just high-quantity) security failsafes, just in case our knowledge does not generalize to the high-capabilities domain. The failsafes would help us efficiently and vigilantly test whether the AGI is indeed as aligned as we thought. This would be an example of a security mindset against overconfidence in our current assumptions.

I have a pretty boring terminological comment: as far as I understand, what you're proposing is not a method of aligning ("directing") AI systems, but for opposing them.

However, suppose that the box setup uses theoretically robust cybersecurity, combined with an actual physical box that is designed to not let any covert information enter or leave.

I think what you want to say here is:

However, suppose that the box setup uses robust cybersecurity, combined with an actual physical box that does not let any covert information enter or leave.

  1. "...theoretically..." weakens rather than strengthens: we need the cybersecurity to be robust, in spite of implementation details.
  2. It doesn't matter what the box "is designed to" do; it matters what it does.

In both cases, this puts us in trouble:
We can check the cybersecurity is theoretically robust; we can't check it's robust.
We know what the box is designed to do; we don't know what it does.

These holes probably aren't too significant for superintelligent AGI (which we should expect to persuade us anyway). They may well be significant for the kinds of about-human-level systems where you're hoping boxing might help.

On the rest, I'd largely agree with John et al. We might as well use boxing, but we shouldn't be spending much effort on it when we could make more important progress.

To quote Aella from (emphasis mine).

if you're granting a superintelligent AGI and you still think it won't be able to get out of the researcher's box (like, it’s on a computer disconnected from the internet and wants you to connect it to the internet, or something), then I don't think you're properly imagining superintelligence. Maybe this is a bit silly, but for my own calibration I've often imagined a bunch of five-year-olds who've been strictly instructed not to pass the key through your prison door slot, and you have to convince them to do it. The intelligence gap between you and five year olds is probably much smaller than the gap between you and an AGI, but probably you could convince the five year olds to let you out. People arguing they just wouldn't let an AGI take any sort of control of anything strikes me as silly as the five year olds swearing they won't let the adult out no matter what. Most other arguments around human beings controlling the AGI in any way once it happens feels equally as silly. You just can’t properly comprehend a thing vastly smarter than you!

If an AGI with a goal of escaping emerges, there is nothing you can do about it. It may take a bit longer if it is disconnected from everything by some "failsafe", but a human idea of "disconnected from everything" is pathetically misguided compared to something many times smarter. Just... drop the reliance on boxing an AI at all. As johnswentworth said, might as well do it, anyway, but it should not factor in your safety calculations.

I think we lose a lot of nuance by automatically assuming the boxed AGI will have godlike capabilities (though it certainly might). Attempting to contain such a super intelligence is (probably) impossible, but I suspect there's still a fair bit of merit to the idea of trying to box near human level AGIs. 

The best cyber criminal in the world could probably get into my bank account, but I'm also not using that as an excuse to go around with no passwords. 

Again, it's good to have a box a human could not get in or out of, as a matter of course. Such a box should not appreciably change any serious safety considerations.

Complicated language.

In Star Trek there is precisely this type of episode.

It ends with Barclay saying "Computer, end simulation!".

Essentially Data invites the ship computer to create a program that has the ability to outwit Data in order to create a veritable challenge, as opposed Data winning with few quick computations.

So the holodeck creates an adversary that has to be smarter than an android that already has a computer mind.

Computer vs Computer. 

Obviously the holodeck is successful. 

This means the whole crew has to find a way to trick the holo image adversary to shut down. 

Since this is not possible, the crew trick the holo image adversary to move into a real world, which is actually a second simulation. 

This simulation is not the real world, but allows the holo image adversary "Moriarty", to "think" he is real and in real world. 

The fact is all they do is they download him into the box, that runs an infinite simulation, where "Moriarty" exists as living AI, and the box contains enough computing power to make him think he actually is living in real world. 

The fact is holodeck was exchanged for another simulation. 

So maybe, AGI can only be safely put into a box, that allows it to function. As long as there is nothing stopping the AGI all it does is it adapts and expends it adaptations. 

To put my simple concept down in shortest way possible, "you cannot stop AGI. 

Once it exists, it cannot be stopped. "

Not unless it is biological entity like a human. However as long as it is program its no longer stoppable. 

This post was made more as fun thought, to add, not a serious reaction so forgive ignorance here.