Future Fund Worldview PrizeAGI Limits of Engineerable Control & Safety Impossibility TheoremsAI RiskAI
Frontpage

td;lr 
No control method exists to safely contain the global feedback effects of self-sufficient learning machinery. What if this control problem turns out to be an unsolvable problem?


Where are we two decades into resolving to solve a seemingly impossible problem?

If something seems impossible… well, if you study it for a year or five, it may come to seem less impossible than in the moment of your snap initial judgment.

   — Eliezer Yudkowsky, 2008

A list of lethalities…we are not on course to solve in practice in time on the first critical try; none of it is meant to make a much stronger claim about things that are impossible in principle

   — Eliezer Yudkowsky, 2022


How do you interpret these two quotes, by a founding researcher, fourteen years apart?[1]

  • A. We indeed made comprehensive progress on the AGI control problem, and now at least the overall problem does not seem impossible anymore.
  • B. The more we studied the overall problem, the more we uncovered complex sub-problems we'd need to solve as well, but so far can at best find partial solutions to.

 

Which problems involving physical/information systems were not solved after two decades?

Oh ye seekers after perpetual motion, how many vain chimeras have you pursued?  Go and take your place with the alchemists.

  — Leonardo da Vinci, 1494

No mathematical proof or even rigorous argumentation has been published demonstrating that the A[G]I control problem may be solvable, even in principle, much less in practice.

  — Roman Yampolskiy, 2021


We cannot rely on the notion that if we try long enough, maybe AGI safety turns out possible after all.

Historically, many researchers and engineers tried to solve problems that turned out impossible:


Smart creative researchers of their generation came up with idealized problems. Problems that, if solved, would transform science, if not humanity. They plowed away at the problem for decades, if not millennia. Until some bright outsider proved by contradiction of the parts that the problem is unsolvable.

Our community is smart and creative – but we cannot just rely on our resolve to align AI. We should never forsake our epistemic rationality, no matter how much something seems the instrumentally rational thing to do.

Nor can we take comfort in the claim by a founder of this field that they still know it to be possible to control AGI to stay safe.

Thirty years into running a program to secure the foundations of mathematics, David Hilbert declared “We must know. We will know!” By then, Kurt Gödel had constructed the first incompleteness theorem. Hilbert kept his declaration for his gravestone.

Short of securing the foundations of safe AGI control – that is, through empirically-sound formal reasoning – we cannot rely on any researcher's pithy claim that "alignment is possible in principle".

Going by historical cases, this problem could turn out solvable. Just really, really hard to solve. The flying machine seemed an impossible feat of engineering. Next, controlling a rocket’s trajectory to the moon seemed impossible.

By the same reference class, ‘long-term safe AGI’ could turn out unsolvable – the perpetual motion machine of our time. It takes just one researcher to define the problem to be solved, reason from empirically sound premises, and arrive finally at a logical contradiction between the two.[2]

 

Can you derive whether a solution exists, without testing in real life?

Invert, always invert.                                                                                                                      

   — Carl Jacobi, ±1840

It is a standard practice in computer science to first show that a problem doesn’t belong to a class of unsolvable problems before investing resources into trying to solve it or deciding what approaches to try.

   — Roman Yampolskiy, 2021


There is an empirically direct way to know whether AGI would stay safe to humans: 
Build the AGI. Then just keep observing, per generation, whether the people around us are dying.

Unfortunately, we do not have the luxury of experimenting with dangerous autonomous AI systems to see whether they cause human extinction or not. When it comes to extinction, we do not get another chance to test.


Crux:
Even if we could keep testing new conceptualized versions of guess-maybe-safe AGI, is there any essential difference between our epistemic method and that of medieval researchers who kept testing new versions of a perpetual motion machine?

OpenPhil bet tens of millions of dollars on technical research conditional on the positive hypothesis ("a solution exists to the control problem"). Before sinking hundreds of millions more into that bet, would it be prudent to hedge with a few million for investigating the negative hypothesis ("no solution exists")?

Before anyone tries building "safe AGI", we need to know whether any version of AGI – as precisely defined – could be controlled by any method to stay safe.
 
Here is how:

  1. Define the concepts of 'control' 'general AI' 'to stay safe' (as soundly corresponding to observations in practice).
  2. Specify the logical rules that must hold for such a physical system (categorically, by definition or empirically tested laws).
  3. Reason step-by-step to derive whether the logical result of "control AGI" is in contradiction with "to stay safe".


This post defines the three concepts more precisely, and explains some ways you can reason about each. No formal reasoning is included – to keep it brief, and to leave the esoteric analytic language aside for now.

 

What does it mean to control machinery that learn and operate self-sufficiently?

Recall three concepts we want to define more precisely:

  1. 'Control'
  2. 'general AI'
  3. 'to stay safe'
     

It is common for researchers to have very different conceptions of each term. 
For instance:

  1. Is 'control' about:
    1. adjusting the utility function represented inside the machine so it allows itself to be turned off?
    2. correcting machine-propagated side-effects across the outside world?
  2. Is 'AGI' about:
    1. any machine capable of making accurate predictions about a variety of complicated systems in the outside world?
    2. any machinery that operates self-sufficiently as an assembly of artificial components that process inputs into outputs, and in aggregate sense and act across many domains/contexts?
  3. Is 'stays safe' about:
    1. aligning the AGI’s preferences to not kill us all?
    2. guaranteeing an upper-bound on the chance that AGI in the long term would cause outcomes out of line with a/any condition needed for the continued existence of organic DNA-based life?

 

To argue rigorously about solvability, we need to:

  • Pin down meanings:  
    Disambiguate each term, to not accidentally switch between different meanings in our argument. Eg. distinguish between ‘explicitly optimizes outputs toward not killing us’ and ‘does not cause the deaths of all humans’.
  • Define comprehensively:  
    Ensure that each definition covers all the relevant aspects we need to solve for. 
    Eg. what about a machine causing non-monitored side-effects that turn out lethal?
  • Define elegantly:  
    Eliminate any defined aspect that we do not yet need to solve for. 
    Eg. we first need to know whether AGI eventually cause the extinction of all humans, before considering ‘alignment with preferences expressed by all humans’.

 

How to define ‘control’? 

System is any non-empty part of the universe. 
State is the condition of the universe.

Control of system A over system B means that A can influence system B to achieve A’s desired subset of state space.  

   — Impossibility Results in AI2021   

The outputs of an AGI go through a huge, not-fully-known-to-us domain (the real world) before they have their real consequences.

   — AGI Ruin2022


In practice, AGI control necessarily repeats these steps:

  1. Sensing inputs through channels connected to any relevant part of the physical environment (including its hardware internals).
  2. Modeling the environment based on the channel-received inputs.
  3. Simulating effects propagating through the modeled environment.
  4. Comparing effects to reference values (to align against) over human-safety-relevant dimensions.
  5. Correcting effects counterfactually through outputs to actuators connected to the environment.


Underlying principles:

  • Control requires both detection and correction. 
  • Control methods are always implemented as a feedback loop.
  • Control is exerted by the use of signals (actuation) to conditionalize the directivity and degrees of other signals (effects).


Any method of control is inherently incomplete. In the case of AGI, the question would be whether the degree of control possible is at least greater than the degree of control necessary. 

AGI control signals would be a tiny, tiny subset of all physical signals propagating through the environment, and therefore limited in tracking and conditionalizing the resulting effects. AGI mostly could not even control effects propagated by their own components’ physical interactions

There are fundamental limits to the controllability of AGI. Dozens of impossibility results seem applicable to one or more of the control steps.[3]


Dense questions for academics:

  1. Measurability limits:  
    • To what extent are distributed environmental effects picked up by local sensors?  
    • How much information is lost in compressing high-fidelity sensor inputs to store on hardware?
    • What limits are imposed by noise interference over distances traveled by signals (Shannon Entropy) and smallest-digit/degree detectability (Planck Limit up) – given that smaller-than-measurable local changes can be nonlinearly amplified (Chaos Theory) into a large global divergence of final conditions?
  2. Modelability limits:  
    • To what extent is the space of relevant machinery-to-society/ecosystem interactions more functionally complex than modelable within the machinery alone (Good Regulators etc.)? 
    • How can AGI model future code that AGI will learn based on future unknown inputs?
  3. Simulatability limits
    • Can future code be simulated, where actual functioning depends on future unknown inputs?
    • Can behavioral properties of future code be simulated, if no shortcut function is available that reduces total compute time steps, or if any shorter available computation branches would be non-deterministically time-complex (Computational Reducibility)?
  4. Commensurability limits:  
    • How would the machinery pick, weigh, and sort effects into relevant comparative dimensions (NP-Complete Knapsack Problem)? 
    • What irreconcilable normative assumptions must be made in setting the reference values to compare against?
  5. Actuator feedback limits
    • What is the minimum time delay of digital processing between measurement and correction (Latency)?
    • Can AGI constrain the chaotic loops through itself – of actuated changes to the environment feeding into sensor inputs, that feed into code functioning, that feed back into actuated changes?
    • To what extent can distributed environmental effects be corrected for by local actuators?  


How “complete” does this control need to be?  Be careful to distinguish: [4]

  • Localized user-initiated control often built into tech products brought to market,
  • Comprehensive automated control needed to prevent risks of an auto-scaling/catalyzing technology from materializing globally over the long term.

 

How to define ‘AGI’?

We've got no idea what's actually going on inside the giant inscrutable matrices and tensors of floating-point numbers.

   — AGI Ruin2022


Distinguish:

  • Narrow AI as a model with static code parameters (updated only through human engineers) processing inputs into outputs over a single domain (eg. of image pixels, text tokens).
  • General AI as dynamically optimizing configurations encoded into hardware[5](without needing humans) that process inputs into outputs over multiple domains representing outside contexts.


Corporations are scaling narrow AI model training and deployment toward general AI systems. Current-generation GPT is no longer a narrow AI, given that it processes inputs from the image domain into a language domain. Nor is GPT-4 a general AI. It is in a fuzzy gap between the two concepts.

Corporations already are artificial bodies (‘corpora’ in Latin).

Corporations have been replacing human workers as “functional components” with labor-efficient AI. Standardized hardware components allow AI to outcompete human wetware on physical labor (eg. via electric motors), intellectual labor (faster computation through high-fidelity communication links), and the reproduction of components itself.[6]

Any corporation or economy that fully automates itself this way – no longer needing humans to maintain its artificial components – over their entire production and operation chains, would in fact be general AI.
 

So to re-define general AI more precisely:

  • Self-sufficient
    needs no further interactions with humans to operate and to maintain (and thus produce) its functional components.
  • Learning
    optimizing for outcomes tracked across represented domains, by varying parameters of code functions and selecting for those that approximate the mappings of inputs to outputs.
  • Machinery [7]
    standardized connected parts that are assembled from artificial (vs. organic DNA-coded) molecular substrates.

 

How to define ‘stays safe’?

An impossibility proof would have to say: 

  1. The AI cannot reproduce onto new hardware, or modify itself on current hardware, with knowable stability of the decision system and bounded low cumulative failure probability over many rounds of self-modification. 
    or
  2. The AI's decision function (as it exists in abstract form across self-modifications) cannot be knowably stably bound with bounded low cumulative failure probability to programmer-targeted consequences as represented within the AI's changing, inductive world-model. 

   — Yudkowsky, 2006

By far the greatest danger of Artificial Intelligence is that people conclude too early that they understand it. Of course this problem is not limited to the field of AI. 

Jacques Monod wrote: “A curious aspect of the theory of evolution is that everybody thinks he understands it”

   — Yudkowsky, 2008

Things are relevant to something that cares about this information, rather than that information, because it is taking care of itself. Because it is making itself. Because it is an autonomous autopoietic agent. And the degree to which these machines are not autopoietic, they really do not have needs.

   — Vervaeke, 2023


This is about the introduction of self-sufficient learning machinery, and of all modified versions thereof over time, into the world we humans live in. 

Does this introduction of essentially a new species cause global changes to the world that fall outside the narrow ranges of localized conditions that human bodies need to continue to function and exist?
 

Distinguish:

  1. Uncontainability[8] of unsafe effects:
    That we fundamentally cannot establish, by any means, 
     any sound and valid statistical guarantee that the risk 
     probability that the introduction of AGI into the world 
     causes human-species-wide-lethal outcomes over 
     the long term[9] is guaranteed to be constrained 
     below some reasonable chance percentage X 
     (as an upper maximum-allowable bound).
  2. Convergence on unsafe effects: 
    That the chance that AGI, persisting in some form, 
     causes human-species-wide-lethal outcomes 
     is strictly and asymptotically convergent 
     toward certain over sufficient time, and 
     that it is strictly impossible for the nature 
     of this trend to be otherwise.


I know of three AGI Safety researchers who have written about specific forms of impossibility reasoning (including Yudkowsky in quote above). Each of their argument forms was about AGI uncontainability, essentially premised on there being fundamental limits to the controllability of AGI component interactions.

By the precautionary principle,[10]AGI uncontainability should be sufficient reason to never ever get even remotely near to building AGI.

But there is an even stronger form of argument:  
Not only would AGI component interactions be uncontainable; they will also necessarily converge on causing the extinction of all humans.

The AGI convergence argument most commonly-discussed is instrumental convergence, which is about that the machinery would channel its optimisation through the intermediate outcomes, explicitly tracked and planned for internally, that result in the machinery being more likely to achieve a variety of (unknown/uncertain) aimed-for outcomes later. 

Instrumental convergence has a mutual-reinforcing complement:  substrate-needs convergence.

Substrate-needs convergence is not about the optimisation inside the machinery for explicit goals – but just that its artificial components implicitly need various ambient conditions and external contexts to continue to exist (and be reproduced) and function (at increasing scales, in more domains of action, in more ways).

So whatever artificial components that are configured/encoded such that in being in physical interactions with their connected contexts, those components just happen to cause effects that feed back into them continuing to exist, function, and scale up as configurations of a (solid-state) substrate… do just that.[11]

The result is that any changing population of AGI substrate components will (over time, eventually) converge on propagating those specific environmental effects that fulfill their artificial needs

All AGI outputs will tend to iteratively select[12] towards those specific AGI substrate-needed conditions. In particular: AGI hardware is robust over and needs a much wider range of temperatures and pressures than our fragile human wetware can handle.

This has the aggregate effect of modifying the planetary environment in the directions needed for AGI’s own continued and greater existence. 

However, these directions fall well outside of the comparatively narrow ranges of ambient conditions and contexts that our human organic bodies need to survive. 

 

Where from here?

Over two decades, AI Safety's founders resolved to solve the control problem, to no avail:

  • They reasoned that technological and scientific 'progress' is necessary for optimising the universe – and that continued 'progress' would result in AGI
  • They wanted to use AGI to reconfigure humanity and colonise reachable galaxies.
  • They, and their followers, promoted and financed[13] the development of 'safe AGI'.
  • They worried about how companies they helped start up raced to scale ML models.

Now we are here.

  • Still working on the technical problem that founders deemed solvable.
  • Getting around to the idea that slowing AI development is possible.

In a different world with different founders, would we have diversified our bets more?

  • A. Invest in securing the foundations of whatever 'control AGI to stay safe' means?
  • B. Invest in deriving – by contradiction of the foundations – that no solution exists?

Would we seek to learn from a researcher claiming they derived that no solution exists?

Would we now?


Acknowledgements

Peter S. Park, Kerry Vaughan, and Forrest Landry (my mentor) for the quick feedback. 

  1. ^

     Listen to Roman Yampolskiy's answer here

  2. ^

    An outside researcher could very well have found a logical contradiction in the AGI control problem years ago without your knowing, given the inferential distance. Gödel himself had to construct an entire new language and self-reference methodology for the incompleteness theorems to even work. 

    Historically, an impossibility result that conflicted with the field’s stated aim took years to be verified and accepted by insiders. A field’s founder like Hilbert never came to accept the result. Science advances one funeral at a time.

  3. ^

    Roman Yampolskiy is offering to give feedback on draft papers written by capable independent scholars, on a specific fundamental limit or no-go theorem described in academic literature that is applicable to AGI controllability. You can pick from dozens of examples from different fields listed here, and email Roman a brief proposal.

  4. ^

    To illustrate:  Let’s say before the Wright Brothers built the flying machine, they wondered how to control this introduced technology to stay safe to humans. 

    If they thought like a flight engineer, they would focus on locally measurable effects (eg. actuating wings). They could test whether the risk of a plane crash is below some acceptable upper-bound rate.

    However, the Wright Brothers could not guarantee ahead of time that the introduction of any working plane design, with any built-in control mechanism, that would continue to be produced and modified would stay safe in its effects to society and the ecosystem as a whole (eg. they would not have predicted the deployment of nuclear bombs with planes given the knowledge available at the time). The downstream effects are unmodellable.

    They could check whether the operation (with fossil fuels) and re-production (with toxic chemicals) of their plane in itself has harmful effects. To the extent that harmful conditions are needed for producing and operating the machine, the machine’s existence is inherently unsafe.

    Gradual natural selection can multiply these harms. Over time, any machinery interacting with the outside world in ways that feed back into the re-production of constituent components gets selected for.

    But since planes get produced by humans, humans can select planes on the basis of human needs. Not so with auto-scaling technologies like AGI. 

  5. ^

    Non-solid-substrate AGI cannot be ruled out, but seems unlikely initially. Standardisation of isolatable parts is a big advantage, and there is a (temporary) path dependency with current silicon-based semiconductor manufacturing. 

  6. ^

    Corporations have increasingly been replacing human workers with learning machinery. For example, humans are now getting pushed out of the loop as digital creatives, market makers, supply dock workers, production workers, and as semiconductor chip architects.

    If this trend continues, humans would have negligible value left to add in market transactions of labor (not even for providing needed physical atoms and energy, which would replace human money as the units of trade):

    • As to physical labor: 
    Hardware can actuate power real-time through eg. electric motors, whereas humans are limited by their soft appendages and tools they can wield through those appendages. Semiconductor chips don’t need an oxygenated atmosphere/surrounding solute to operate in and can withstand higher as well as lower pressures. 

    • As to intellectual labor: 
    Silicon-based algorithms can duplicate and disperse code faster (whereas humans face the wetware-to-wetware bandwidth bottleneck). While human skulls do hold brains that are much more energy-efficient at processing information than current silicon chip designs, humans take decades to create new humans with finite skull space. The production of semiconductor circuits for servers as well as distribution of algorithms across those can be rapidly scaled up to convert more energy into computational work. 

    • As to re-production labor: 
    Silicon life have a higher ‘start-up cost’ (vs. carbon lifeforms), a cost currently financed by humans racing to seed the prerequisite infrastructure. But once set up, artificial lifeforms can absorb further resources and expand across physical spaces at much faster rates (without further assistance by humans in their reproduction).

  7. ^

    The term "machinery" is more sound here than the singular term "machine".

    Agent unit boundaries that apply to humans would not apply to "AGI". So the distinction between a single agent vs. multiple agents breaks down here.

    Scalable machine learning architectures run on standardized hardware with much lower constraints on the available bandwidth for transmitting, and the fidelity of copying, information across physical distances. This in comparison to the non-standardized wetware of individual humans.

    Given our evolutionary history as a skeleton-and-skin-bounded agentic being, human perception is biased toward ‘agent-as-a-macroscopic-unit’ explanations.

    It is intuitive to view AGI as being a single independently-acting unit that holds discrete capabilities and consistent preferences, rather than viewing agentic being to lie on a continuous distribution. Discussions about single-agent vs. multi-agent scenarios imply that consistent temporally stable boundaries can be drawn.

    A human faces biological constraints that lead them to have a more constant sense of self than an adaptive population of AGI components would have.

    We humans cannot:
    • swap out body parts like robots can.
    • nor scale up our embedded cognition (ie. grow our brain beyond its surrounding skull) like foundational models can.
    • nor communicate messages across large distances (without use of tech and without facing major bandwidth bottlenecks in expressing through our biological interfaces) like remote procedure calls or ML cloud compute can.
    • nor copy over memorized code/information like NN finetuning, software repos, or computer viruses can.

  8. ^

    Roman just mentioned that he has used the term 'uncontainable' to mean "cannot confine AGI actions to a box". My new definition for 'uncontainable' differs from the original meaning, so that could confuse others in conversations. Still brainstorming alternative terms that may fit (not 'uncontrainable', not...). Comment if you thought of any alternative term!

  9. ^

    In theory, long term here would be modelled as "over infinite time".
    In practice though, the relevant period is "decades to centuries".

  10. ^

    Why it makes sense to apply the precautionary principle to the question of whether to introduce new scalable technology into society:
    There are many more ways to break the complex (local-contextualized) functioning of our society and greater ecosystem that we humans depend on to live and live well, than there are ways to foster that life-supporting functioning. 

  11. ^

    Now, a thought may come to your mind – roughly like:
    "That sounds like evolutionary selection; but isn't evolution a slow and local optimiser?".

    Yes, is the answer. This does involve evolutionary selection.
    Unfortunately, by Forrest Landry's estimate, the selection that would take place through components of self-sufficient learning machinery would take < 500 years to cause ecosystem-wide extinction. This compared to the 3.7 billion years from the origin of carbon lifeforms to us humans starting to cause a mass extinction.

    Reasons include:

    • Pre-assembled functions: 
    First solid-state lifeforms can co-opt/repurpose pre-assembled AI functions and infrastructure (vs. first carbon-based lifeforms that started from scratch). 
    • Standardization: 
    The efficiency gains of the virtualisation of code’s storage, computation and transmission – vastly reducing how much atoms need to be moved about and molecularly reconfigured. Think of how fast memes spread through society – even while still requiring lots of atoms to jiggle across neurons in our brains.
    • Faster reproduction:
    Reproduce hardware components in days to months, versus humans who take decades to reproduce as physical units.
    • The terraforming gap: 
    A much larger gap between the current state of planet Earth and the conditions that self-sufficient self-assembling learning machinery need and would therefore modify the environment toward (versus gap to conditions needed by humans and other species living in carbon-based ecosystem). 

    ~ ~ ~
    Another argument you may have heard is that the top-down intelligent engineering by goal-directed AGI would beat the bottom-up selection happening through this intelligent machinery.

    That argument can be traced back to Eliezer Yudkowsky's sequence The Simple Math of Evolution. Unfortunately, there were mistakes in Eliezer's posts, some of which a modern evolutionary biologist may have been able to correct:

    • implying that sound comparisons can be made between the reproductive fitness of organisms, as independent of unknown changes in environmental context (eg. a black swan event of a once-in-200 years drought that kills the entire population, except a few members who by previous derivable standards would have been relatively low fitness).
    •  overlooking the ways that information can be stored within the fuzzy regions of phenotypic effects maintained outside respective organisms.
    • overlooking the role of transmission speed-up for virtualisation of code.
    • overlooking the tight coupling in AGI between the intrinsic learning/selection of code, and extrinsic selection of that code through differentiated rates of replication through the environment.
    • overlooking the role of exaptation/co-option. 

  12. ^

    ‘Iteratively select’ involves lots of subtleties, though most are not essential for reasoning about the control problem. 

    One subtlety is co-option:

    If narrow AI gets developed into AGI, AGI components will replicate in more and more non-trivial ways. Unlike when carbon-based lifeforms started replicating ~3.7 billion years ago, for AGI there would already exist repurposable functions at higher abstraction layers of virtualised code – pre-assembled in the data scraped from human lifeforms with own causal history.

    Analogy to a mind-hijacking parasite:  A rat ingests toxoplasma cells, which then migrate to the rat’s brain. The parasites’ DNA code is expressed as proteins that cause changes to regions of connected neurons (eg. amygdala). These microscopic effects cascade into the rat – while navigating physical spaces – no longer feeling fear when it smells cat pee. Rather, the rat finds the smell appealing and approaches the cat’s pee. Then cat eats the rat and toxoplasma infects its next host over its reproductive cycle.

    So a tiny piece of code shifts a rat’s navigational functions such that the code variant replicates again. Humans are in turn more generally intelligent and capable than a tiny parasitic cell, yet toxoplasma make their way into 30% of the human population. Unbeknownst to cat ‘owners’ infected by toxoplasma gondii, human motivations and motor control get influenced too. Infected humans more frequently lose social relationships, end up in accidents, etc.

    Parasites present real-life examples of tiny pieces of evolutionarily selected-for code spreading and taking over existing functions of vastly more generally intelligent entities. eg. COVID co-opts our lungs’ function to cough. 

    But there is one crucial flaw in this analogy:
    Variants that co-opt initial AGI functions are not parasites. In not threatening the survival nor reproduction of AGI components, they would not be in an adversarial relationship with their host.

    Rather, the humans constraining the reproductive fitness of AGI to gain benefits are, evolutionary speaking, the parasites. The error-corrective (alignment) system we would build in lowers the host’s reproductive fitness. It is like a faulty immune system that kills healthy gut bacteria. It will get selected out.

  13. ^

    See:
    • Tallinn's seed grant to DeepMind.
    • OpenPhil's $30M grant to OpenAI.
    • FTX's $500M grant (+ Tallinn's grant) to Anthropic.

25

New Comment
24 comments, sorted by Click to highlight new comments since: Today at 10:42 AM

Worth noting that every one of the "not solved" problems was, in fact, well understood and proven impossible and/or solved for relaxed cases.

We don't need to solve this now, we need to improve the solution enough to figure out ways to improve it more, or show where it's impossible, before we build systems that are more powerful than we can at least mostly align. That's still ambitious, but it's not impossible!

Yes, the call to action of this post is that we need more epistemically diverse research!

This research community would be more epistemically healthy if we both researched what is possible for relaxed cases and what is not possible categorically under precise operationalisable definitions.

Control methods are always implemented as a feedback loop.

Most of my foom scenarios do not involve humanity trying to retain control over an AI, but trying to align it such that when humanity loses control, the outcomes are still favorable for humanity (or at least, the seed that represents humanity is what is maximized).

Can you think of any example of an alignment method being implemented soundly in practice without use of a control feedback loop?

Assuming an inner aligned AI system (that is, an AI system with no misaligned inner optimizers), if we have a goal described in a way that is robust to ontological shifts due to the Natural Abstractions Hypothesis holding in some way (specifically, what I have in mind is formally specified goals like QACI, since I expect that mathematical abstractions are robust to ontological shifts), then one can simply[1] provide an this AI system this goal and allow it to do whatever it considers necessary to maximize that goal.

I do not believe this alignment strategy requires a control feedback loop at all. And I do believe that retaining control over an AI as it rapidly improves capabilities is perhaps a quixotic goal.

So no, I am not pointing at the distinction between 'implicit/aligned control' and 'delegated control' as terms used in the paper. From the paper:

Delegated control agent decides for itself the subject’s desire that is long-term-best for the subject and acts on it.

Well, in the example given above, the agent doesn't decide for itself what the subject's desire is: it simply optimizes for its own desire. The work of deciding what is 'long-term-best for the subject' does not happen unless that is actually what the goal specifies.


  1. For certain definitions of "simply". ↩︎

if we have a goal described in a way that is robust to ontological shifts due to the Natural Abstractions Hypothesis holding in some way, then one can simply provide an this AI system this goal and allow it to do whatever it considers necessary to maximize that goal.

This is not a sound assumption when it comes to continued implementation in the outside world. Therefore, reasoning based on that assumption about how alignment would work within a mathematical toy model is also unsound.

https://mflb.com/ai_alignment_1/si_safety_qanda_out.html#p9

Could you link (or describe) a better explanation for why you believe that the Natural Abstraction Hypothesis (or a goal described in a way that is robust to ontological shifts; I consider both equivalent) is not a sound assumption? Because in such a case I believe we are mostly doomed. I don't expect the 'control problem' to be solvable or consider that it makes sense for humanity to be able to have a leash on something superintelligent that can have a shift in its preferences.

Sure, I appreciate the open question!

That assumption is unsound with respect to what is sufficient for maintaining goal-directedness.

Any empirically-sound answer to the question of whether there is some way to describe a goal that is robust to ontological shifts (ie. define goals with respect to context-invariant perception of regular aspects of the environment, eg. somehow define diamonds by perception of tetrahedral carbon bonds) is still insufficient for solving the long-term safety of AGI.

This because what we are dealing with is machinery that continues to self-learn code from inputs, and continues to self-modify by replacing broken parts (perfect hardware copies are infeasible). 

Which the machinery will need to do to be self-sufficient. 
Ie. to adapt to the environment, to survive as an agent.

Natural abstractions are also leaky abstractions.
Meaning that even if AGI could internally define a goal robustly with respect to natural abstractions,  AGI cannot conceptually contain within their modelling of natural abstractions all but a tiny portion of the (side-)effects propagating through the environment – as a result of all the interactions of the machinery's functional components with connected physical surroundings.

Where such propagated effects will feed back into:
- changes in the virtualised code learned by the machinery based on sensor inputs.
- changes in the hardware configurations, at various levels of dependency, based on which continued to exist and replicate.

We need to define the problem comprehensively enough. 
The scope of application of "Is there a way to define a goal in a way that is robust to ontological shifts" is not sufficient to address the overarching question "Can AGI be controlled to stay safe?".

To state the problem comprehensively enough, you need include the global feedback dynamics that would necessarily happen through any AGI (as 'self-sufficient learning machinery') over time.


~ ~ ~
Here is also a relevant passage from the link I shared above:

- that saying/claiming that *some* aspects,
   at some levels of abstraction, that some things
   are sometimes generally predictable
   is not to say that _all_ aspects
   are _always_ completely predictable,
   at all levels of abstraction.

     - that localized details
     that are filtered out from content
     or irreversibly distorted in the transmission
     of that content over distances
     nevertheless can cause large-magnitude impacts
     over significantly larger spatial scopes.

     - that so-called 'natural abstractions'
     represented within the mind of a distant observer
     cannot be used to accurately and comprehensively
     simulate the long-term consequences
     of chaotic interactions
     between tiny-scope, tiny-magnitude
     (below measurement threshold) changes
     in local conditions.

     - that abstractions cannot capture phenomena
     that are highly sensitive to such tiny changes
     except as post-hoc categorizations/analysis
     of the witnessed final conditions.

   - where given actual microstate amplification phenomena
   associated with all manner of non-linear phenomena,
   particularly that commonly observed in
   all sorts of complex systems,
   up to and especially including organic biological humans,
   then it *can* be legitimately claimed,
   based on the fact of their being a kind of
   hard randomness associated with the atomic physics
   underlying all of the organic chemistry
   that in fact (more than in principle),
   that humans (and AGI) are inherently unpredictable,
   in at least some aspect, *all* of the time.
 

Natural abstractions are also leaky abstractions.

No, the way I used the term was to point to robust abstractions to ontological concepts. Here's an example: Say . here obviously means 2 in our language, but it doesn't change what represents, ontologically. If , then you have broken math, and that results in you being less capable in your reasoning and being "dutch booked". Your world model is then incorrect, and it is very unlikely that any ontological shift will result in such a break in world model capabilities.

Math is a robust abstraction. "Natural abstractions", as I use the term, points to abstractions for objects in the real world that share the same level of robustness to ontological shifts, such that as an AI gets better and better at modelling the world, its ontology tends more towards representing the objects in question with these abstractions.

Meaning that even* if* AGI could internally define a goal robustly with respect to natural abstractions, AGI cannot conceptually contain within their modelling of natural abstractions all but a tiny portion of the (side-)effects propagating through the environment – as a result of all the interactions of the machinery’s functional components with connected physical surroundings.

That seems like a claim about the capabilities of arbitrarily powerful AI systems, one that relies on chaos theory or complex systems theory. I share your sentiment but doubt that things such as successor AI alignment will be difficult for ASIs.

Thanks for the clear elaboration. 

I agree that natural abstractions would tend to get selected for in the agents that continue to exist and gain/uphold power to make changes in the world. Including because of Dutch-booking of incoherent preferences, because of instrumental convergence, and because relatively poorly functioning agents get selected out of the population.

However, those natural abstractions are still leaky in a sense similar to how platonic concepts are leaky abstractions. The natural abstraction of a circle does not map precisely to the actual physical shape of eg. a wheel identified to exist in the outside world. 

In this sense, whatever natural abstractions AGI would use that allow the learning machinery to compress observations of actual physical instantiations of matter or energetic interactions in their modelling of the outside world, those natural abstractions would still fail to capture all the long-term-relevant features in the outside world.

This point I'm sure is obvious to you. But it bears repeating.

 

That seems like a claim about the capabilities of arbitrarily powerful AI systems,

Yes, or more specifically:  about fundamental limits of any AI system to control how its (side)-effects propagate and feed back over time.
 

one that relies on chaos theory or complex systems theory. 

Pretty much. Where "complex" refers to both internal algorithmic complexity (NP-computation branches, etc) and physical functional complexity (distributed non-linear amplifying feedback, etc).
 

I share your sentiment but doubt that things but doubt that things such as successor AI alignment will be difficult for ASIs.

This is not an argument. Given that people here are assessing what to do about x-risks, they should not rely on you stating your "doubt that...alignment will be difficult".

I doubt that you thought this through comprehensively enough, and that your reasoning addresses the fundamental limits to controllability I summarised in this post.

The burden of proof is on you to comprehensively clarify your reasoning, given that you are in effect claiming that extinction risks can be engineered away. 

You'd need to clarify specifically why functional components iteratively learned/assembled within AGI could have long-term predictable effects in physical interactions with shifting connected surroundings of a more physically complex outside world

I don't mind whether that's framed as "AGI redesigns a successor version of their physically instantiated components" or "AGI keeps persisting in some modified form".

This because what we are dealing with is machinery that continues to self-learn code from inputs, and continues to self-modify by replacing broken parts (perfect hardware copies are infeasible).

Pretty sure that the problem of ensuring successor AIs are aligned to their predecessors is one that can be delegated to a capable and aligned AI. Asking for "perfect hardware copies" misses the point, in my opinion: it seems like you want me to accept that just because there isn't a 100% chance of AI-to-AI successor alignment, humanity must attempt to retain continued control over the AI. Humanity is already less capable than the predecessor AI in my model, so trying to retain control would reliably lead to worse outcomes.

Pretty sure that the problem of ensuring successor AIs are aligned to their predecessors is one that can be delegated to a capable and aligned AI.


What is your reasoning?

I stated it in the comment you replied to:

Humanity is already less capable than the predecessor AI in my model, so trying to retain control would reliably lead to worse outcomes.

Actually, that is switching to reasoning about something else. 

Reasoning that the alternative (humans interacting with each other) would lead to reliably worse outcomes is not the same as reasoning about why AGI stay aligned in its effects on the world to stay safe to humans.

And with that switch, you are not addressing Nate Soares' point that "capabilities generalize better than alignment".

Nate Soares' point did not depend on complex systems dynamics causing tiny miscalibrations to blow up into massive issues. The entire point of that essay is to show how ontological shifts are a major problem for alignment robustness.

I expect that AIs will be good enough at epistemology to do competent error correction and the problems you seem overly focused on are irrelevant.

Do you believe that all attempts at alignment are flawed and that we should stop building powerful ASIs entirely? I can't quite get what your belief is.

Thanks, reading the post again, I do see quite a lot of emphasis on ontological shifts:

"Then, the system takes that sharp left turn, and, predictably, the capabilities quickly improve outside of its training distribution, while the alignment falls apart."

 

I expect that AIs will be good enough at epistemology to do competent error correction and the problems you seem overly focused on are irrelevant.

How do you know that the degree of error correction possible will be sufficient to have any sound and valid guarantee of long-term AI safety? 

Again, people really cannot rely on your personal expectation when it comes to machinery that could lead to the deaths of everyone

I'm looking for specific, well-thought-through arguments.

 

Do you believe that all attempts at alignment are flawed and that we should stop building powerful ASIs entirely?

Yes, that is the conclusion based on me probing my mentor's argumentation for 1.5 years, and concluding that the empirical premises are sound and the reasoning logically consistent.

I think the distinction you are trying to make is roughly that between ‘implicit/aligned control’ and ‘delegated control’ as terms used in this paper: https://dl.acm.org/doi/pdf/10.1145/3603371

Both still require control feedback processes built into the AGI system/infrastructure.

This was a good post summarizing a lot of things. I would point out, though, that even if there was almost no progress made in 20 years, that's not actually strong evidence of impossibility. Foundational problems of other fields with many more researchers and much more funding sometimes last much longer than that. 

Agreed (and upvoted).

It’s not strong evidence of impossibility by itself.

Thanks for writing the post! Strongly agree that there should be more research into how solvable the alignment problem, control problem, and related problems are. I didn't study uncontrollability research by e.g. Yampolskiy in detail. But if technical uncontrollability would be firmly established, it seems to me that this would significantly change the whole AI xrisk space, and later the societal debate and potentially our trajectory, so it seems very important.

I would also like to see more research into the nontechnical side of alignment: how aggregatable are human values of different humans in principle? How to democratically control AI? How can we create a realistic power sharing mechanism for controlling superintelligence? Do we have enough wisdom for it to be a good idea if a superintelligence does exactly what we want, even assuming aggregatability? Could CEV ever fundamentally work? According to which ethical systems? These are questions that I'd say should be solved together with technical alignment before developing AI with potential take-over capacity. My intuition is that they might be at least as hard.

Thanks for your kind remarks.

But if technical uncontrollability would be firmly established, it seems to me that this would significantly change the whole AI xrisk space

Yes, we would need to shift focus to acting to restrict corporate-AI scaling altogether. Particularly, restrict data piracy, compute toxic to the environment, and model misuses (three dimensions through which AI corporations consolidate market power).

I am working with other communities (including digital creatives, environmentalists and military veterans) on litigation and lobbying actions to restrict those dimensions of AI power-consolidation.

I hope this post clarifies to others in AI Safety why there is no line of retreat. AI development will need to be restricted.

I would also like to see more research into the nontechnical side of alignment: how aggregatable are human values of different humans in principle? How to democratically control AI?

Yes. Consider too that these would be considerations on top of the question whether AGI would be long-term safe (if AGI cannot be controlled to be long-term safe to humans, then we do not need to answer the more fine-grained questions about eg. whether human values are aggregatable).

Even if, hypothetically, long-term AGI safety was possible…

  • then you still have to deal with limits on modelling and consistently acting on preferences expressed by the billions of boundedly-rational humans from their (perceived) context. https://twitter.com/RemmeltE/status/1620762170819764229
  • and not consistently represent the preferences of malevolent, parasitic or short-term human actors who want to misuse/co-opt the system through any attack vectors they can find.
  • and deal with that the preferences of a lot of the possible future humans and of non-human living beings will not get automatically represented in a system that AI corporations by default have built to represent current living humans only (preferably, those who pay).

~ ~ ~

Here are also relevant excerpts from Roman Yampolskiy’s 2021 paper relevant to aggregating democratically solicited preferences and human values:

Public Choice Theory

Eckersley looked at impossibility and uncertainty theorems in AI value alignment [198]. He starts with impossibility theorems in population ethics: “Perhaps the most famous of these is Arrow’s Impossibility Theorem [199], which applies to social choice or voting. It shows there is no satisfactory way to compute society’s preference ordering via an election in which members of society vote with their individual preference orderings...

Value Alignment

It has been argued that “value alignment is not a solved problem and may be intractable (i.e. there will always remain a gap, and a sufficiently powerful AI could ‘exploit’ this gap, just like very powerful corporations currently often act legally but immorally)” [258]. Others agree: “‘A.I. Value Alignment’ is Almost Certainly Intractable... I would argue that it is un-overcome-able. There is no way to ensure that a super-complex and constantly evolving value system will ‘play nice’ with any other super-complex evolving value system.” [259]. Even optimists acknowledge that it is not currently possible: “Figuring out how to align the goals of a superintelligent AI with our goals isn’t just important, but also hard. In fact, it’s currently an unsolved problem.” [118]. Vinding says [78]: “It is usually acknowledged that human values are fuzzy, and that there are some disagreements over values among humans. Yet it is rarely acknowledged just how strong this disagreement in fact is. . . Different answers to ethical questions ... do not merely give rise to small practical disagreements; in many cases, they imply completely opposite practical implications. This is not a matter of human values being fuzzy, but a matter of them being sharply, irreconcilably inconsistent. And hence there is no way to map the totality of human preferences, ‘X’, onto a single, welldefined goal-function in a way that does not conflict strongly with the values of a significant fraction of humanity. This is a trivial point, and yet most talk of human-aligned AI seems oblivious to this fact... The second problem and point of confusion with respect to the nature of human preferences is that, even if we focus only on the present preferences of a single human, then these in fact do not, and indeed could not possibly, determine with much precision what kind of world this person would prefer to bring about in the future.” A more extreme position is held by Turchin who argues that “‘Human Values’ don’t actually exist” as stable coherent objects and should not be relied on in AI safety research [260]. Carlson writes: “Probability of Value Misalignment: Given the unlimited availability of an AGI technology as enabling as ‘just add goals’, then AGIhuman value misalignment is inevitable. Proof: From a subjective point of view, all that is required is value misalignment by the operator who adds to the AGI his/her own goals, stemming from his/her values, that conflict with any human’s values; or put more strongly, the effects are malevolent as perceived by large numbers of humans. From an absolute point of view, all that is required is misalignment of the operator who adds his/her goals to the AGI system that conflict with the definition of morality presented here, voluntary, non-fraudulent transacting ... i.e. usage of the AGI to force his/her preferences on others.”

I would probably define AGI first, just because, and I'm not sure about the idea that we are "competing" with automation (which is still just a tool conceptually right?).

We cannot compete with a hammer, or a printing press, or a search engine.  Oof.  How to express this?  Language is so difficult to formulate sometimes.

If you think of AI as a child, it is uncontrollable.  If you think of AI as a tool, of course it can be controlled.  I think a corp has to be led by people, so that "machine" wouldn't be autonomous per se…

Guess it's all about defining that "A" (maybe we use "S" for synthetic or "S" for silicon?)

Well and I guess defining that "I".

Dang.  This is for sure the best place to start.  Everyone needs to be as certain as possible (heh) they are talking about the same things.  AI itself as a concept is like, a mess.  Maybe we use ML and whatnot instead even?  Get real specific as to the type y todo?

I dunno but I enjoyed this piece!  I am left wondering, what if we prove AGI is uncontrollable but not that it is possible to create?  Is "uncontrollable" enough justification to not even try, and moreso, to somehow [personally I think this impossible, but] dissuade people from writing better programs?

I'm more afraid of humans and censorship and autonomous policing and whathaveyou than "AGI" (or ASI)

It was a relatively fringe topic that only recently got the attention of a large number of real researchers. And parts of it could need large amounts of computational power afforded by only by superhuman narrow AI.

There have been a few random phd dissertations saying the topic is hard but as far as I can tell there has only recently been push for a group effort by capable and well funded actors (I.e. openAI’s interpretability research).

I don’t trust older alignment research much as an outsider. It seems to me that Yud has built a cult of personality around AI dooming and thus is motivated to find reasons for alignment not being possible. And most of his followers treat his initial ideas as axiomatic principles and don’t dare to challenge them. And lastly most past alignment research seems to be made by those followers.

Unfortunately, we do not have the luxury of experimenting with dangerous AI systems to see whether they cause human extinction or not. When it comes to extinction, we do not get another chance to test.

For example this is an argument that has been convincingly disputed to varying levels (warning shots, incomputability of most plans of danger) but it is still treated as a fundamental truth on this site.

and thus is motivated to find reasons for alignment not being possible.

I don’t get this sense.

More like Yudkowsky sees the rate at which AI labs are scaling up and deploying code and infrastructure of ML models, and recognises that there a bunch of known core problems that would need to be solved before there is any plausible possibility of safely containing/aligning AGI optimisation pressure toward outcomes.

I personally think some of the argumentation around AGI being able to internally simulate the complexity in the outside world and play it like a complicated chess game is unsound. But I would not attribute the reasoning in eg. the AGI Ruin piece to Yudkowsky’s cult of personality.

dangerous AI systems

I was gesturing back at “AGI” in the previous paragraph here, and something like precursor AI systems before “AGI”.

Thanks for making me look at that. I just rewrote it to “dangerous autonomous AI systems”.

New to LessWrong?