Program Coordinator of AI Safety Camp.


Bias in Evaluating AGI X-Risks
Developments toward Uncontrollable AI
Why Not Try Build Safe AGI?


Excellent article. I appreciate how you clarify that Anthropic's "Responsible Scaling Policy" is a set-up that allows for safety-washing. We would be depending on their well-considered good intentions, rather than any mechanism to hold them accountable.

Have you looked into how system safety engineers (eg. medical device engineers) scope the uses of software, such to be able to comprehensively design, test, and assess the safety of the software?

Operational Design Domains scope the use of AI in self-driving cars. Tweeted about that here.

I guess that comes down to whether a future AI can predict or control future innovations of itself indefinitely.


That's a key question. You might be interested in this section on limits of controllability.

Clarifying questions:
1. To what extent can AI predict the code they will learn from future unknown inputs, and how that code will subsequently interact with then connected surroundings of the environment?

2. To what extent can AI predict all the (microscopic) modifications that will result from all the future processes involved in the future re-production of hardware components?

I personally think pessimistic vs. optimistic misframes it, because it frames a question about the world in terms of personal predispositions.

I would like to see reasoning.

Your reasoning in the comment thread you linked to is: “history is full of cases where people dramatically underestimated the growth of scientific knowledge, and its ability to solve big problems”

That’s a broad reference-class analogy to use. I think it holds little to no weight as to whether there would be sufficient progress on the specific problem of “AGI” staying safe over the long-term.

I wrote why that specifically would not be a solvable problem.

Thanks for the thoughts! Some critical questions:

Natural selection requires variation. Information theory tells us that all information is subject to noise and therefore variation across time.

Are you considering variations introduced during learning (as essentially changes to code, that can then be copied). Are you consider variations introduced by microscopic changes to the chemical/structural configurations of the maintained/produced hardware?

However, we can reduce error rates to arbitrarily low probabilities using coding schemes.

Claude Shannon showed this to be the case for a single channel of communication. How about when you have many possible routing channels through which physical signals can leak to and back from the environment?

If you look at existing networked system architectures, does the near-zero error rates you can correct toward at the binary level (eg. with use of CRC code) also apply at higher layers of abstraction (eg. in detecting possible trojan horse adversarial attacks)?

If there is no variation then there is no natural selection.

This is true. Can there be no variation introduced into AGI, when they are self-learning code and self-maintaining hardware in ways that continue to be adaptive to changes within a more complex environment?

In abstract terms, evolutionary dynamics require either a smooth adaptive landscape such that incremental changes drive organisms towards adaptive peaks…

Besides point-change mutations, are you taking into account exaptation, as the natural selection for shifts in the expression of previous (learned) functionality? Must exaptation, as involving the reuse of functionality in new ways, involve smooth changes in phenotypic expression?

…and/or unlikely leaps away from local optima into attraction basins of other optima.

Are the other attraction basins instantiated at higher layers of abstraction? Are any other optima approached through selection across the same fine-grained super-dimensional landscape that natural selection is selective across? If not, would natural selection “leak” around those abstraction layers, as not completely being pulled into the attraction basins that are in fact pulling across a greatly reduced set of dimensions? Put a different way, can natural selection pull side-ways on the dimensional pulls of those other attraction basins?

I believe that natural selection requires a population of "agents" competing for resources. If we only had a single AI system then there is no competition and no immediate adaptive pressure.

I get how you would represent it this way, because that’s often how natural selection gets discussed as applying to biological organisms.

It is not quite thorough in terms of describing what can get naturally selected for. For example, within a human body (as an “agent”) there can be natural selection across junk DNA that copies itself across strands, or virus particles, or cancer cells. At that microscopic level though, the term “agent” would lose its meaning if used to describe some molecular strands.

At the macroscopic level of “AGI”, the single vs. multiple agents distinction would break down, for reasons I described here.

Therefore, to thoroughly model this, I would try describe natural selection as occurring across a population of components. Those components would be connected and co-evolving, and can replicate individually (eg. as with viruses replacing other code) or as part of larger packages or symbiotic processes of replication (eg. code with hardware). For AGI, they would all rely on somewhat similar infrastructure (eg. for electricity and material replacement) and also need somewhat similar environmental conditions to operate and reproduce.

Other dynamics will be at play which may drown out natural selection…Other dynamics may be at play that can act against natural selection.

Can the dynamic drown out all possible natural selection over x shortest-length reproduction cycles? Assuming the “AGI” continues to exist, could any dynamics you have in mind drown out any and all interactions between components and surrounding physical contexts that could feed back into their continued/increased existence?

We see existence-proofs of this in immune responses against tumours and cancers. Although these don't work perfectly in the biological world, perhaps an advanced AI could build a type of immune system that effectively prevents individual parts from undergoing runaway self-replication.

Immune system responses were naturally selected for amongst organisms that survived.

Would such responses also be naturally selected for in “advanced AI” such that not the AI but the outside humans survive more? Given that bottom-up natural selection by nature selects for designs across the greatest number of possible physical interactions (is the most comprehensive), can alternate designs built through faster but more narrow top-down engineering actually match or exceed that fine-grained extent of error detection and correction? Even if humans could get “advanced AI” to build in internal error detection and correction mechanisms that are kind of like an immune system, would that outside-imposed immune system withstand natural selection while reducing the host’s rates of survival and reproduction?

~ ~ ~

Curious how you think about those questions. I also passed on your comment to my mentor (Forrest) in case he has any thoughts.

not focused enough on forward-chaining to find the avenues of investigation which actually allow useful feedback

Are you mostly looking for where there is useful empirical feedback?  
That sounds like a shot in the dark.

Big breakthroughs open up possibilities that are very hard to imagine before those breakthroughs

A concern I have:
I cannot conceptually distinguish these continued empirical investigations of methods to build maybe-aligned AGI, from how medieval researchers tried to build perpetual motion machines. It took sound theory to finally disprove the possibility once and for all that perpetual motion machines were possible.

I agree with Charbel-Raphaël that the push for mechanistic interpretability is in effect promoting the notion that there must be possibilities available here to control potentially very dangerous AIs to stay safe in deployment. It is much easier to spread the perception of safety, than to actually make such systems safe. 

That, while there is no sound theoretical basis for claiming that scaling mechanistic interpretability could form the basis of such a control method. Nor for that any control method could keep "AGI" safe.

Rather, mechint is fundamentally limited in the extent it could be used to safely control AGI. 
See posts:

  1. The limited upside of interpretability by Peter S. Park
  2. Why mechanistic interpretability does not and cannot contribute to long-term AGI safety by me 

Besides theoretical limits, there are plenty of practical arguments (as listed in Charbel-Raphaël's post) for why scaling the utilisation of mechint would be net harmful.

So no rigorous basis for that the use of mechint would "open up possibilities" to long-term safety. 
And plenty of possibilities for corporate marketers – to chime in on mechint's hypothetical big breakthroughs.

In practice, we may help AI labs again – accidentally – to safety-wash their AI products.

Great paraphrase!


no matter how good their control theory, and their ability to monitor and intervene in the world? 

This. There are fundamental limits to what system-propagated effects the system can control. And the portion of own effects the system can control decreases as the system scales in component complexity.

Yet, any of those effects that feed back into the continued/increased existence of components get selected for. 

So there is a fundamental inequality here. No matter how "intelligent" the system is at pattern-transformation internally, it cannot intervene on all but a tiny portion of (possible) external evolutionary feedback on its constituent components.

Your position is that even if today's AI could be given bio-friendly values, AI would still be the doom of biological life in the longer run, because (skipping a lot of details) machine life and biological life have incompatible physical needs, and once machine life exists, darwinian processes will eventually produce machine life that overruns the natural biosphere. (You call this "substrate-needs convergence"

This is a great paraphrase btw.

Hello :)

For my part, I agree that pressure from substrate needs is real

Thanks for clarifying your position here.

Can't such an instinct and such a culture resist the pressure from substrate needs, if the AIs value and protect them enough?

No, unfortunately not. To understand why, you would need to understand how “intelligent” processes that necessarily involve the use of measurement and abstraction cannot conditionalise the space of possible interactions between machine components and connected surroundings – sufficiently, to not feed back into causing environmental effects that feed back into the continued or re-assembled existence of the components.

I think your arguments are underestimating what a difference intelligence makes to possible ecological and evolutionary dynamics

I have thought about this, and I know my mentor Forrest has thought about this a lot more.

For learning machinery that re-produce their own components, you will get evolutionary dynamics across the space of interactions that can feed back into the machinery’s assembled existence.

Intelligence has limitations as an internal pattern-transforming process, in that it cannot track nor conditionalise all the outside evolutionary feedback.

Code does not intrinsically know how it got selected for. But code selected through some intelligent learning process can and would get evolutionarily exapted for different functional ends.

Notably, the more information-processing capacity, the more components that information-processing runs through, and the more components that can get evolutionarily selected for.

In this, I am not underestimating the difference that “general intelligence” – as transforming patterns across domains – would make here. Intelligence in machinery that store, copy and distribute code at high-fidelity would greatly amplify evolutionary processes.

I suggest clarifying what you specifically mean with “what a difference intelligence makes”. This so intelligence does not become a kind of “magic” – operating independently of all other processes, capable of obviating all obstacles, including those that result from its being.

superintelligence makes even aeon-long highly artificial stabilizations conceivable - e.g. by the classic engineering method of massively redundant safeguards that all have to fail at once, for something to go wrong

We need to clarify the scope of application of this classic engineering method. Massive redundancy works for complicated systems (like software in aeronautics) under stable enough conditions. There is clarity there around what needs to be kept safe and how it can be kept safe (what needs to error detected and corrected for).

Unfortunately, the problem with “AGI” is that the code and hardware would keep getting reconfigured to function in new complex ways that cannot be contained by the original safeguards. That applies even to learning – the point is to internally integrate patterns from the outside world that were not understood before. So how are you going to have learning machinery anticipate how they will come to function differently once they learned patterns they do not understand / are unable to express yet?

we had someone show up (@spiritus-dei) making almost the exact opposite of your arguments: AI won't ever choose to kill us because, in its current childhood stage, it is materially dependent on us (e.g. for electricity), and then, in its mature and independent form, it will be even better at empathy and compassion than humans are.

Interesting. The second part seems like a claim some people in E/Accel would make.

The response is not that complicated: once the AI is no longer materially dependent on us, there are no longer dynamics of exchange there that would ensure they choose not to kill us. And the author seems to be confusing what lies at the basis of caring for oneself and others – coming to care for involves self-referential dynamics being selected for.

In my experience, jumping between counterexamples drawn from current society does not really contribute to inquiry here. Such counterexamples tend to not account for essential parts of the argument that must be reasoned through together. The argument is about self-sufficient learning machinery (not about sacred cows or teaching children).

It would be valuable for me if you could go though the argumentation step-by-step and tell me where a premise seems unsound or there seems to be a reasoning gap.

Now, onto your points.

the first AIs

To reduce ambiguity, suggest replacing with “the first self-sufficient learning machinery”.

simple evolutionary pressure will eventually lead

The mechanism of evolution is simple. However, evolutionary pressure is complex.

Be careful not to equivocate the two. That would be like saying you could predict everything about what a stochastic gradient descent algorithm will select for across parameters selected on the basis of inputs everywhere from the environment.

lead some of their descendants to destroy the biosphere in order to make new parts and create new habitats for themselves.

This part is overall a great paraphrase.

One nitpick: notice how “in order to” either implies or slips in explicit intentionality again. Going by this podcast, Elizabeth Anscombe’s philosophy of intentions described intentions as chains of “in order to” reasoning.

I proposed the situation of cattle in India, as a counterexample to this line of thought.

Regarding sacred cows in India, this sounds neat, but it does not serve as a counterargument. We need to think about evolutionary timelines for organic human lifeforms over millions of years, and Hinduism is ~4000 years old. Also, cows share a mammal ancestor with us, evolving on the basis of the same molecular substrates. Whatever environmental conditions/contexts we humans need, cows almost completely need too.

Crucially however humans evolve to change and maintain environmental conditions also tends to correspond with what conditions cows need (however, human tribes have not been evolutionarily selected for to deal with issues at the scale of eg. climate change). That would not be the case for self-sufficient learning machinery.

Crucially there is a basis for symbiotic relationships of exchange that benefit both the reproduction of cows and humans. That would not be the case between self-sufficient learning machinery and humans.

There is some basis for humans as social mammals to relate with cows. Furthermore, religious cultural memes that sprouted out over a few thousand years also don’t have to be evolutionarily optimal across the board for the reproduction of their hosts (even as religious symbols like of cows do increase that by enabling humans to act collectively). Still, people milk cows in India, and some slaughter and/or export cows there as well. But when humans eat meat, they don’t keep growing beyond adult size. Conversely, some self-sufficient learning machinery sub-population that extract from our society/ecosystem at the cost of our lives can keep doing so to keep scaling in their constituent components (with shifting boundaries of interaction and mutual reproduction).

There is no basis for selection for the expression of collective self-restraint in self-sufficient learning machinery as you describe. Even if there was such a basis, hypothetically, collective self-restraint would need to occur at virtually 100% rates across the population of self-sufficient learning machinery to not end up leading to the deaths of all humans.

~ ~ ~

Again, I find quick dismissive counterexamples unhelpful for digging into the arguments. I have had dozens of conversations on substrate-needs convergence. In the conversations where my conversation partner jumped between quick counterarguments, almost none were prepared to dig into the actual arguments. Hope you understand why I won’t respond to another counterexample.

Yes,  AIs haven't evolved to have those features, but the point of alignment research is to give them analogous features by design.


It's unintuitive to convey this part:

In the abstract, you can picture a network topology of all possible AGI component connections (physical signal interactions). These connections span the space of greater mining/production/supply infrastructure that is maintaining of AGI functional parts.  Also add in the machinery connections with the outside natural world.

Then, picture the nodes and possible connections change over time, as a result of earlier interactions with/in the network.

That network of machinery comes into existence through human engineers, etc, within various institutions selected by market forces etc, implementing blueprints as learning algorithms, hardware set-ups, etc, and tinkering with those until they work.

The question is whether before that network of machinery becomes self-sufficient in their operations, the human engineers, etc, can actually build in constraints into the configured designs, in such a way that once self-modifying (in learning new code and producing new hardware configurations), the changing machinery components are constrained in their propagated effects across their changing potential signal connections over time, such that component-propagated effects do not end up feeding back in ways that (subtly, increasingly) increase the maintained and replicated existence of those configured components in the network.


Human beings, both individually and collectively, already provide numerous examples of how dangerous incentives can exist, but can nonetheless be resisted or discouraged.

Humans are not AGI. And there are ways AGI would be categorically unlike humans that are crucial to the question of whether it is possible for AGI to stay safe to humans over the long term.  

Therefore, you cannot swap out "humans" with "AGI" in your reasoning by historical analogy above, and expect your reasoning to stay sound.  This is an equivocation. 

Please see point 7 above.


The argument from substrate incentives (3, 7) is complementary to the argument from population, in that it provides a motive for the AIs to come and despoil Earth. 

Maybe it's here you are not tracking the arguments.

These are not substrate "incentives", nor do they provide a "motive".

Small dinosaurs with hair-like projections on their front legs did not have an "incentive" to co-opt the changing functionality of those hair-like projections into feather-like projections for gliding and then for flying. Nor were they provided a "motive" with respect to which they were directed in their internal planning toward growing those feather-like projections. 

That would make the mistake of presuming evolutionary teleology – that there is some complete set of pre-defined or predefinable goals that the lifeform is evolving toward.

I'm deliberate in my choice of words when I write "substrate needs".


At best, they are arguments for practical unsolvability, not absolute in-principle logical unsolvability. If they were my arguments, I would say that they show making AI to be unwise, and hubristic, and so on. 

Practical unsolvability would also be enough justification to do everything we can do now to restrict corporate AI development.

I assume you care about this problem, otherwise you wouldn't be here :)  Any ideas / initiatives you are considering to try robustly work with others to restrict further AI development?

Load More