Remmelt

Research Coordinator of area "Do Not Build Uncontrollable AI" for AI Safety Camp.
 

See explainer on why AGI could not be controlled enough to stay safe:
https://www.lesswrong.com/posts/xp6n2MG5vQkPpFEBH/the-control-problem-unsolved-or-unsolvable

 

Sequences

Bias in Evaluating AGI X-Risks
Developments toward Uncontrollable AI
Why Not Try Build Safe AGI?

Comments

This answer will sound unsatisfying:  

If a mathematician or analytical philosopher wrote a bunch of squiggles on a whiteboard, and said it was a proof, would you recognise it as a proof? 

  • Say that unfamiliar new analytical language and means of derivation are used (which is not uncommon for impossibility proofs by contradiction, see Gödel's incompleteness theorems and Bell's theorem). 
  • Say that it directly challenges technologists' beliefs about their capacity to control technology, particularly their capacity to constrain a supposedly "dumb local optimiser":  evolutionary selection.
  • Say that the reasoning is not only about a formal axiomatic system, but needs to make empirically sound correspondences with how real physical systems work.
  • Say that the reasoning is not only about an interesting theoretical puzzle, but has serious implications for how we can and cannot prevent human extinction.


This is high stakes.

We were looking for careful thinkers who had the patience to spend time on understanding the shape of the argument, and how the premises correspond with how things work in reality.  Linda and Anders turned out to be two of these people, and we did three long calls so far (first call has an edited transcript).

I wish we could short-cut that process. But if we cannot manage to convey the overall shape of the argument and the premises, then there is no point to moving on to how the reasoning is formalised. 

I get that people are busy with their own projects, and want to give their own opinions about what they initially think the argument entails. And, if the time they commit to understanding the argument is not at least 1/5 of the time I spend on conveying the argument specifically to them, then in my experience we usually lack the shared bandwidth needed to work through the argument. 
 

  • Saying, "guys, big inferential distance here" did not help. People will expect it to be a short inferential distance anyway. 
  • Saying it's a complicated argument that takes time to understand did not help. A smart busy researcher did some light reading, tracked down a claim that seemed "obviously" untrue within their mental framework, and thereby confidently dismissed the entire argument. BTW, they're a famous research insider, and we're just outsiders whose response got downvoted – must be wrong right?
  • Saying everything in this comment does not help. It's some long-assessed plea for your patience.
    If I'm so confident about the conclusion, why am I not passing you the proof clean and clear now?! 
    Feel free to downvote this comment and move on.
     

Here is my best attempt at summarising the argument intuitively and precisely, still prompting some misinterpretations by well-meaning commenters. I feel appreciation for people who realised what is at stake, and were therefore willing to continue syncing up on the premises and reasoning, as Will did:
 

The core claim is not what I thought it was when I first read the above sources and I notice that my skepticism has decreased as I have come to better understand the nature of the argument.

would anything like SNC apply if tech labs were somehow using bioengineering to create creatures to perform the kinds of tasks that would be done by advanced AI?

In that case, substrate-needs convergence would not apply, or only apply to a limited extent.

There is still a concern about what those bio-engineered creatures, used in practice as slaves to automate our intellectual and physical work, would bring about over the long-term.

If there is a successful attempt by them to ‘upload’ their cognition onto networked machinery, then we’re stuck with the substrate-needs convergence problem again.

Also, on the workforce, there are cases where, they were traumatized psychologically and compensated meagerly, like in Kenya. How could that be dealt with?


We need funding to support data workers, engineers, and other workers exploited or misled by AI corporations to unionise, strike, and whistleblow.

The AI data workers in Kenya started a union, and there is a direct way of supporting targeted action by them. Other workers' organisations are coordinating legal actions and lobbying too. On seriously limited budgets.

I'm just waiting for a funder to reach out and listen carefully to what their theories of change are.

The premise is based on alignment not being enough, so I operate on the premise of an aligned ASI, since the central claim is that "even if we align ASI it may still go wrong".


I can see how you and Forrest ended up talking past each other here.  Honestly, I also felt Forrest's explanation was hard to track. It takes some unpacking. 

My interpretation is that you two used different notions of alignment... Something like:

  1. Functional goal-directed alignment:  "the machinery's functionality is directed toward actualising some specified goals (in line with preferences expressed in-context by humans), for certain contexts the machinery is operating/processing within"
      vs.
  2. Comprehensive needs-based alignment:  "the machinery acts in comprehensive care for whatever all surrounding humans need to live, and their future selves/offsprings need to live, over whatever contexts the machinery and the humans might find themselves". 

Forrest seems to agree that (1.) is possible to built initially into the machinery, but has reasons to think that (2.) is actually physically intractable. 

This is because (1.) only requires localised consistency with respect to specified goals, whereas (2.) requires "completeness" in the machinery's components acting in care for human existence, wherever either may find themselves.


So here is the crux:

  1. You can see how (1.) still allows for goal mispecification and misgeneralisation.  And the machinery can be simultaneously directed toward other outcomes, as long as those outcomes are not yet (found to be, or corrected as being) inconsistent with internal specified goals.
     
  2. Whereas (2.) if it were physically tractable, would contradict the substrate-needs convergence argument.  
     

When you wrote "suppose a villager cares a whole lot about the people in his village...and routinely works to protect them" that came across as taking something like (2.) as a premise. 

Specifically, "cares a whole lot about the people" is a claim that implies that the care is for the people in and of themselves, regardless of the context they each might (be imagined to) be interacting in. Also, "routinely works to protect them" to me implies a robustness of functioning in ways that are actually caring for the humans (ie. no predominating potential for negative side-effects).

That could be why Forrest replied with "How is this not assuming what you want to prove?"

Some reasons:

  1. Directedness toward specified outcomes some humans want does not imply actual comprehensiveness of care for human needs. The machinery can still cause all sorts of negative side-effects not tracked and/or corrected for by internal control processes.
  2. Even if the machinery is consistently directed toward specified outcomes from within certain contexts, the machinery can simultaneously be directed toward other outcomes as well. Likewise, learning directedness toward human-preferred outcomes can also happen simultaneously with learning instrumental behaviour toward self-maintenance, as well as more comprehensive evolutionary selection for individual connected components that persist (for longer/as more).
  3. There is no way to assure that some significant (unanticipated) changes will not lead to a break-off from past directed behaviour, where other directed behaviour starts to dominate.
    1. Eg. when the "generator functions" that translate abstract goals into detailed implementations within new contexts start to dysfunction – ie. diverge from what the humans want/would have wanted.
    2. Eg. where the machinery learns that it cannot continue to consistently enact the goal of future human existence.
    3. Eg. once undetected bottom-up evolutionary changes across the population of components have taken over internal control processes.
  4. Before the machinery discovers any actionable "cannot stay safe to humans" result, internal takeover through substrate-needs (or instrumental) convergence could already have removed the machinery's capacity to implement an across-the-board shut-down.
  5. Even if the machinery does discover the result before convergent takeover, and assuming that "shut-down-if-future-self-dangerous" was originally programmed in, we cannot rely on the machinery to still be consistently implementing that goal. This because of later selection for/learning of other outcome-directed behaviour, and because the (changed) machinery components could dysfunction in this novel context.  


To wrap it up:

The kind of "alignment" that is workable for ASI with respect to humans is super fragile.  
We cannot rely on ASI implementing a shut-down upon discovery.

Is this clarifying?  Sorry about the wall of text. I want to make sure I'm being precise enough.

I agree that point 5 is the main crux:

The amount of control necessary for an ASI to preserve goal-directed subsystems against the constant push of evolutionary forces is strictly greater than the maximum degree of control available to any system of any type.

To answer it takes careful reasoning. Here's my take on it:

  • We need to examine the degree to which there would be necessarily changes to the connected functional components constituting self-sufficient learning machinery (as including ASI) 
    • Changes by learning/receiving code through environmental inputs, and through introduced changes in assembled molecular/physical configurations (of the hardware). 
    • Necessary in the sense of "must change to adapt (such to continue to exist as self-sufficient learning machinery)," or "must change because of the nature of being in physical interactions (with/in the environment over time)."
  • We need to examine how changes to the connected functional components result in shifts in actual functionality (in terms of how the functional components receive input signals and process those into output signals that propagate as effects across surrounding contexts of the environment).
  • We need to examine the span of evolutionary selection (covering effects that in their degrees/directivity feed back into the maintained/increased existence of any functional component).
  • We need to examine the span of control-based selection (the span covering detectable, modellable simulatable, evaluatable, and correctable effects).

Actually, looks like there is a thirteenth lawsuit that was filed outside the US.

A class-action privacy lawsuit filed in Israel back in April 2023.

Wondering if this is still ongoing: https://www.einpresswire.com/article/630376275/first-class-action-lawsuit-against-openai-the-district-court-in-israel-approved-suing-openai-in-a-class-action-lawsuit

That's an important consideration. Good to dig into.
 

I think there are many instances of humans, flawed and limited though we are, managing to operate systems with a very low failure rate.

Agreed. Engineers are able to make very complicated systems function with very low failure rates. 

Given the extreme risks we're facing, I'd want to check whether that claim also translates to 'AGI'.

  • Does how we are able to manage current software and hardware systems to operate correspond soundly with how self-learning and self-maintaining machinery ('AGI') control how their components operate?
     
  • Given 'AGI' that no longer need humans to continue to operate and maintain own functional components over time, would the 'AGI' end up operating in ways that are categorically different from how our current software-hardware stacks operate? 
     
  • Given that we can manage to operate current relatively static systems to have very low failure rates for the short-term failure scenarios we have identified, does this imply that the effects of introducing 'AGI' into our environment could also be controlled to have a very low aggregate failure rate – over the long term across all physically possible (combinations of) failures leading to human extinction?

     

to spend extra resources on backup systems and safety, such that small errors get actively cancelled out rather than compounding.

This gets right into the topic of the conversation with Anders Sandberg. I suggest giving that a read!

Errors can be corrected out with high confidence (consistency) at the bit level. Backups and redundancy also work well in eg. aeronautics, where the code base itself is not constantly changing.

  • How does the application of error correction change at larger scales? 
  • How completely can possible errors be defined and corrected for at the scale of, for instance:
    1. software running on a server?
    2. a large neural network running on top of the server software?
    3. an entire machine-automated economy?
  • Do backups work when the runtime code keeps changing (as learned from new inputs), and hardware configurations can also subtly change (through physical assembly processes)?

     

Since intelligence is explicitly the thing which is necessary to deliberately create and maintain such protections, I would expect control to be easier for an ASI.

It is true that 'intelligence' affords more capacity to control environmental effects.

Noticing too that the more 'intelligence,' the more information-processing components. And that the more information-processing components added, the exponentially more degrees of freedom of interaction those and other functional components can have with each other and with connected environmental contexts. 

Here is a nitty-gritty walk-through in case useful for clarifying components' degrees of freedom.

 

 I disagree that small errors necessarily compound until reaching a threshold of functional failure.

For this claim to be true, the following has to be true: 

a. There is no concurrent process that selects for "functional errors" as convergent on "functional failure" (failure in the sense that the machinery fails to function safely enough for humans to exist in the environment, rather than that the machinery fails to continue to operate).  

Unfortunately, in the case of 'AGI', there are two convergent processes we know about:

  • Instrumental convergence, resulting from internal optimization:
    code components being optimized for (an expanding set of) explicit goals.
     
  • Substrate-needs convergence, resulting from external selection: 
    all components being selected for (an expanding set of) implicit needs.
     

Or else – where there is indeed selective pressure convergent on "functional failure" – then the following must be true for the quoted claim to hold:

b. The various errors introduced into and selected for in the machinery over time could be detected and corrected for comprehensively and fast enough (by any built-in control method) to prevent later "functional failure" from occurring.

This took a while for me to get into (the jumps from “energy” to “metabolic process” to “economic exchange” were very fast).

I think I’m tracking it now.

It’s about metabolic differences as in differences in how energy is acquired and processed from the environment (and also the use of a different “alphabet” of atoms available for assembling the machinery).

Forrest clarified further in response to someone’s question here:

https://mflb.com/ai_alignment_1/d_240301_114457_inexorable_truths_gen.html

Note:  
Even if you are focussed on long-term risks, you can still whistleblow on eggregious harms caused by these AI labs right now.  Providing this evidence enables legal efforts to restrict these labs. 

Whistleblowing is not going to solve the entire societal governance problem, but it will enable others to act on the information you provided.

It is much better than following along until we reached the edge of the cliff.

Are you thinking of blowing the whistle on something in between work on AGI and getting close to actually achieving it?


Good question.  

Yes, this is how I am thinking about it. 

I don't want to wait until competing AI corporations become really good at automating work in profitable ways, also because by then their market and political power would be entrenched. I want society to be well-aware way before then that the AI corporations are acting recklessly, and should be restricted.

We need a bigger safety margin.  Waiting until corporate machinery is able to operate autonomously would leave us almost no remaining safety margin.

There are already increasing harms, and a whistleblower can bring those harms to the surface.  That in turn supports civil lawsuits, criminal investigations, and/or regulator actions.

Harms that fall roughly in these categories – from most directly traceable to least directly traceable:

  1. Data laundering (what personal, copyrighted and illegal data is being copied and collected en masse without our consent).
  2. Worker dehumanisation (the algorithmic exploitation of gig workers;  the shoddy automation of people's jobs;  the criminal conduct of lab CEOs)
  3. Unsafe uses (everything from untested uses in hospitals and schools, to mass disinformation and deepfakes, to hackability and covered-up adversarial attacks, to automating crime and the kill cloud, to knowingly building dangerous designs).
  4. Environmental pollution (research investigations of data centers, fab labs, and so on)



For example: 

  1. If an engineer revealed authors' works in the datasets of ChatGPT, Claude, Gemini or Llama that would give publishers and creative guilds the evidence they need to ramp up lawsuits against the respective corporations (to the tens or hundreds). 
    1. Or if it turned out that the companies collected known child sexual abuse materials (as OpenAI probably did, and a collaborator of mine revealed for StabilityAI and MidJourney).
  2. If the criminal conduct of the CEO of an AI corporation was revealed
    1. Eg. it turned out that there is a string of sexual predation/assault in leadership circles of OpenAI/CodePilot/Microsoft.
    2. Or it turned out that Satya Nadella managed a refund scam company in his spare time.
  3. If managers were aware of the misuses of their technology, eg. in healthcare, at schools, or in warfare, but chose to keep quiet about it.
     

Revealing illegal data laundering is actually the most direct, and would cause immediate uproar.  
The rest is harder and more context-dependent.  I don't think we're at the stage where environmental pollution is that notable (vs. the fossil fuel industry at large), and investigating it across AI hardware operation and production chains would take a lot of diligent research as an inside staff member.

Load More