Remmelt

Program Coordinator of AI Safety Camp.

Sequences

Bias in Evaluating AGI X-Risks
Developments toward Uncontrollable AI
Why Not Try Build Safe AGI?

Comments

Yes, the call to action of this post is that we need more epistemically diverse research!

This research community would be more epistemically healthy if we both researched what is possible for relaxed cases and what is not possible categorically under precise operationalisable definitions.

Thanks, reading the post again, I do see quite a lot of emphasis on ontological shifts:

"Then, the system takes that sharp left turn, and, predictably, the capabilities quickly improve outside of its training distribution, while the alignment falls apart."

 

I expect that AIs will be good enough at epistemology to do competent error correction and the problems you seem overly focused on are irrelevant.

How do you know that the degree of error correction possible will be sufficient to have any sound and valid guarantee of long-term AI safety? 

Again, people really cannot rely on your personal expectation when it comes to machinery that could lead to the deaths of everyone

I'm looking for specific, well-thought-through arguments.

 

Do you believe that all attempts at alignment are flawed and that we should stop building powerful ASIs entirely?

Yes, that is the conclusion based on me probing my mentor's argumentation for 1.5 years, and concluding that the empirical premises are sound and the reasoning logically consistent.

Actually, that is switching to reasoning about something else. 

Reasoning that the alternative (humans interacting with each other) would lead to reliably worse outcomes is not the same as reasoning about why AGI stay aligned in its effects on the world to stay safe to humans.

And with that switch, you are not addressing Nate Soares' point that "capabilities generalize better than alignment".

Pretty sure that the problem of ensuring successor AIs are aligned to their predecessors is one that can be delegated to a capable and aligned AI.


What is your reasoning?

Thanks for the clear elaboration. 

I agree that natural abstractions would tend to get selected for in the agents that continue to exist and gain/uphold power to make changes in the world. Including because of Dutch-booking of incoherent preferences, because of instrumental convergence, and because relatively poorly functioning agents get selected out of the population.

However, those natural abstractions are still leaky in a sense similar to how platonic concepts are leaky abstractions. The natural abstraction of a circle does not map precisely to the actual physical shape of eg. a wheel identified to exist in the outside world. 

In this sense, whatever natural abstractions AGI would use that allow the learning machinery to compress observations of actual physical instantiations of matter or energetic interactions in their modelling of the outside world, those natural abstractions would still fail to capture all the long-term-relevant features in the outside world.

This point I'm sure is obvious to you. But it bears repeating.

 

That seems like a claim about the capabilities of arbitrarily powerful AI systems,

Yes, or more specifically:  about fundamental limits of any AI system to control how its (side)-effects propagate and feed back over time.
 

one that relies on chaos theory or complex systems theory. 

Pretty much. Where "complex" refers to both internal algorithmic complexity (NP-computation branches, etc) and physical functional complexity (distributed non-linear amplifying feedback, etc).
 

I share your sentiment but doubt that things but doubt that things such as successor AI alignment will be difficult for ASIs.

This is not an argument. Given that people here are assessing what to do about x-risks, they should not rely on you stating your "doubt that...alignment will be difficult".

I doubt that you thought this through comprehensively enough, and that your reasoning addresses the fundamental limits to controllability I summarised in this post.

The burden of proof is on you to comprehensively clarify your reasoning, given that you are in effect claiming that extinction risks can be engineered away. 

You'd need to clarify specifically why functional components iteratively learned/assembled within AGI could have long-term predictable effects in physical interactions with shifting connected surroundings of a more physically complex outside world

I don't mind whether that's framed as "AGI redesigns a successor version of their physically instantiated components" or "AGI keeps persisting in some modified form".

Sure, I appreciate the open question!

That assumption is unsound with respect to what is sufficient for maintaining goal-directedness.

Any empirically-sound answer to the question of whether there is some way to describe a goal that is robust to ontological shifts (ie. define goals with respect to context-invariant perception of regular aspects of the environment, eg. somehow define diamonds by perception of tetrahedral carbon bonds) is still insufficient for solving the long-term safety of AGI.

This because what we are dealing with is machinery that continues to self-learn code from inputs, and continues to self-modify by replacing broken parts (perfect hardware copies are infeasible). 

Which the machinery will need to do to be self-sufficient. 
Ie. to adapt to the environment, to survive as an agent.

Natural abstractions are also leaky abstractions.
Meaning that even if AGI could internally define a goal robustly with respect to natural abstractions,  AGI cannot conceptually contain within their modelling of natural abstractions all but a tiny portion of the (side-)effects propagating through the environment – as a result of all the interactions of the machinery's functional components with connected physical surroundings.

Where such propagated effects will feed back into:
- changes in the virtualised code learned by the machinery based on sensor inputs.
- changes in the hardware configurations, at various levels of dependency, based on which continued to exist and replicate.

We need to define the problem comprehensively enough. 
The scope of application of "Is there a way to define a goal in a way that is robust to ontological shifts" is not sufficient to address the overarching question "Can AGI be controlled to stay safe?".

To state the problem comprehensively enough, you need include the global feedback dynamics that would necessarily happen through any AGI (as 'self-sufficient learning machinery') over time.


~ ~ ~
Here is also a relevant passage from the link I shared above:

- that saying/claiming that *some* aspects,
   at some levels of abstraction, that some things
   are sometimes generally predictable
   is not to say that _all_ aspects
   are _always_ completely predictable,
   at all levels of abstraction.

     - that localized details
     that are filtered out from content
     or irreversibly distorted in the transmission
     of that content over distances
     nevertheless can cause large-magnitude impacts
     over significantly larger spatial scopes.

     - that so-called 'natural abstractions'
     represented within the mind of a distant observer
     cannot be used to accurately and comprehensively
     simulate the long-term consequences
     of chaotic interactions
     between tiny-scope, tiny-magnitude
     (below measurement threshold) changes
     in local conditions.

     - that abstractions cannot capture phenomena
     that are highly sensitive to such tiny changes
     except as post-hoc categorizations/analysis
     of the witnessed final conditions.

   - where given actual microstate amplification phenomena
   associated with all manner of non-linear phenomena,
   particularly that commonly observed in
   all sorts of complex systems,
   up to and especially including organic biological humans,
   then it *can* be legitimately claimed,
   based on the fact of their being a kind of
   hard randomness associated with the atomic physics
   underlying all of the organic chemistry
   that in fact (more than in principle),
   that humans (and AGI) are inherently unpredictable,
   in at least some aspect, *all* of the time.
 

Thanks for your kind remarks.

But if technical uncontrollability would be firmly established, it seems to me that this would significantly change the whole AI xrisk space

Yes, we would need to shift focus to acting to restrict corporate-AI scaling altogether. Particularly, restrict data piracy, compute toxic to the environment, and model misuses (three dimensions through which AI corporations consolidate market power).

I am working with other communities (including digital creatives, environmentalists and military veterans) on litigation and lobbying actions to restrict those dimensions of AI power-consolidation.

I hope this post clarifies to others in AI Safety why there is no line of retreat. AI development will need to be restricted.

I would also like to see more research into the nontechnical side of alignment: how aggregatable are human values of different humans in principle? How to democratically control AI?

Yes. Consider too that these would be considerations on top of the question whether AGI would be long-term safe (if AGI cannot be controlled to be long-term safe to humans, then we do not need to answer the more fine-grained questions about eg. whether human values are aggregatable).

Even if, hypothetically, long-term AGI safety was possible…

  • then you still have to deal with limits on modelling and consistently acting on preferences expressed by the billions of boundedly-rational humans from their (perceived) context. https://twitter.com/RemmeltE/status/1620762170819764229
  • and not consistently represent the preferences of malevolent, parasitic or short-term human actors who want to misuse/co-opt the system through any attack vectors they can find.
  • and deal with that the preferences of a lot of the possible future humans and of non-human living beings will not get automatically represented in a system that AI corporations by default have built to represent current living humans only (preferably, those who pay).

~ ~ ~

Here are also relevant excerpts from Roman Yampolskiy’s 2021 paper relevant to aggregating democratically solicited preferences and human values:

Public Choice Theory

Eckersley looked at impossibility and uncertainty theorems in AI value alignment [198]. He starts with impossibility theorems in population ethics: “Perhaps the most famous of these is Arrow’s Impossibility Theorem [199], which applies to social choice or voting. It shows there is no satisfactory way to compute society’s preference ordering via an election in which members of society vote with their individual preference orderings...

Value Alignment

It has been argued that “value alignment is not a solved problem and may be intractable (i.e. there will always remain a gap, and a sufficiently powerful AI could ‘exploit’ this gap, just like very powerful corporations currently often act legally but immorally)” [258]. Others agree: “‘A.I. Value Alignment’ is Almost Certainly Intractable... I would argue that it is un-overcome-able. There is no way to ensure that a super-complex and constantly evolving value system will ‘play nice’ with any other super-complex evolving value system.” [259]. Even optimists acknowledge that it is not currently possible: “Figuring out how to align the goals of a superintelligent AI with our goals isn’t just important, but also hard. In fact, it’s currently an unsolved problem.” [118]. Vinding says [78]: “It is usually acknowledged that human values are fuzzy, and that there are some disagreements over values among humans. Yet it is rarely acknowledged just how strong this disagreement in fact is. . . Different answers to ethical questions ... do not merely give rise to small practical disagreements; in many cases, they imply completely opposite practical implications. This is not a matter of human values being fuzzy, but a matter of them being sharply, irreconcilably inconsistent. And hence there is no way to map the totality of human preferences, ‘X’, onto a single, welldefined goal-function in a way that does not conflict strongly with the values of a significant fraction of humanity. This is a trivial point, and yet most talk of human-aligned AI seems oblivious to this fact... The second problem and point of confusion with respect to the nature of human preferences is that, even if we focus only on the present preferences of a single human, then these in fact do not, and indeed could not possibly, determine with much precision what kind of world this person would prefer to bring about in the future.” A more extreme position is held by Turchin who argues that “‘Human Values’ don’t actually exist” as stable coherent objects and should not be relied on in AI safety research [260]. Carlson writes: “Probability of Value Misalignment: Given the unlimited availability of an AGI technology as enabling as ‘just add goals’, then AGIhuman value misalignment is inevitable. Proof: From a subjective point of view, all that is required is value misalignment by the operator who adds to the AGI his/her own goals, stemming from his/her values, that conflict with any human’s values; or put more strongly, the effects are malevolent as perceived by large numbers of humans. From an absolute point of view, all that is required is misalignment of the operator who adds his/her goals to the AGI system that conflict with the definition of morality presented here, voluntary, non-fraudulent transacting ... i.e. usage of the AGI to force his/her preferences on others.”

if we have a goal described in a way that is robust to ontological shifts due to the Natural Abstractions Hypothesis holding in some way, then one can simply provide an this AI system this goal and allow it to do whatever it considers necessary to maximize that goal.

This is not a sound assumption when it comes to continued implementation in the outside world. Therefore, reasoning based on that assumption about how alignment would work within a mathematical toy model is also unsound.

https://mflb.com/ai_alignment_1/si_safety_qanda_out.html#p9

I think the distinction you are trying to make is roughly that between ‘implicit/aligned control’ and ‘delegated control’ as terms used in this paper: https://dl.acm.org/doi/pdf/10.1145/3603371

Both still require control feedback processes built into the AGI system/infrastructure.

Can you think of any example of an alignment method being implemented soundly in practice without use of a control feedback loop?

Load More