tdlr; This is a project to put a model of AGI uncontainability into math. Goal is to make the reasoning straightforward to verify.
Apply nowIt is a standard practice in computer science to first show that a problem doesn’t belong to a class of unsolvable problems before investing resources into trying to solve it or deciding what approaches to try. — Roman Yampolskiy |
Researchers tried for years to invent methods to control ‘AGI’, based on the assumption that it must be possible to control some ‘AGI’ enough to not eventually cause extinction.
If this assumption proves to be false, it is crucial for the AI Safety community to know. It would shift our priorities away from still trying to somehow make ‘AGI’ safe, toward acting to restrict development of ‘AGI’.
Dozens of academic results demonstrate fundamental limits to control. In addition, our advising researcher shared various forms of reasoning for the uncontainability of any ‘AGI’. However, no theorem has been constructed in mathematical notation that proves uncontainability (ie. insufficiently “complete” control) from empirically sound premises. This project is about attempting to construct and verify such a theorem.
In our circle of collaborators is Anders Sandberg, our advising researcher Forrest and his research assistant Justin, and the project lead Remmelt Ellen. We are funded by two grants by the Survival and Flourishing Fund.
We want to convey the reasoning in a form that can be verified by other researchers. This is hard.
Historically, impossibility theorems based on incompleteness involved large inferential gaps or difficulties in reaching scientific consensus:
Forrest already formalized his reasoning of why AGI’s effects would be uncontainable (and furthermore, converge on human-lethal conditions). But he did so using a combination of computer programs and precise analytical language. This is not a form that is accessible to other researchers. We need to put the reasoning into mathematical notation, so that other researchers can verify the logical validity of the result.
Along with the math, we must publish an explanation of what the terms mean and how premises correspond with how the physical world works. Without that explanation, it is not even clear to other researchers what the reasoning is about.
To make explanation easier, Forrest found a way to distill the argument into an intuitive and simple model, over five conversations with Anders:
For this project, you can join up in a team to construct a mathematical proof of AGI uncontainability based on the reasoning. Or work with Anders to identify proof methods and later verify the math (identifying any validity/soundness issues).
We will meet with Anders to work behind a whiteboard for a day (in Oxford or Stockholm). Depending on progress, we may do a longer co-working weekend. From there, we will draft one or more papers.
Update: an excellent mathematician outside AI Safety will already start independent work in December toward a math paper – maybe you can offer feedback for verifying and/or communicating the relevance of this string of math symbols to the community.
A key issue is the thesis that AGI will be uncontrollable in the sense that there is no control mechanism that can guarantee aligned behavior since the more complex and abstract the target behavior is the amount of resources and forcing ability needed become unattainable. In order to analyse this better a sufficiently general toy model is needed for how controllable systems of different complexity can be, that ideally can be analysed rigorously. One such model is to study families of binary functions parametrized by their circuit complexity and their "hashiness" (how much they mix information) as an analog for the AGI and the alignment model, and the limits to finding predicates that can keep the alignment system making the AGI analog producing a desired output. — Anders Sandberg |
To start digging into the hashiness model:
Part of the work for this project is to turn these one-on-one explanations into your own write-up. This way, Forrest can give input, and we can check whether we’re on the same page. This helps reduce important confusions before moving on to doing the math.
Please be prepared to take time to put pieces of the argument together. We’re at the frontiers and it’s going to be messy at first. It took 8 hours of conversation to clearly convey a model of uncontainability to Anders, who is well-known for his ability to dig deep quickly into almost any academic field at the Oxford lunch table.
While we would not want you to take the truth of the hashiness model on faith, we do ask you to hold back on trying to find flaws with the model, before you acquired an accurate understanding of it. If you have not first taken the time to understand the model, you’re likely to attack a straw version, which is not very useful. This means that to some extent we’re asking you to trust that digging into the hashiness model is worth your time, without being able to provide good evidence for this up-front. This is unfortunately unavoidable.
If you don’t have the patience for this type of project, that is ok, but it also means this project is not for you. Both Remmelt and Forrest have ample experience trying to convey reasoning to people who are not willing to spend the time it takes to understand it, and it usually ends in disappointment and frustration on both sides.
If this project goes well, we’ll have a much more accessible formulation by April. Meanwhile, we are very happy to take questions meant to clarify the model’s premises, terms, and reasoning.
We are also open to exploring different ways of collaborating. We want you to have the flexibility and freedom to work through the reasoning yourself, and if something about the team set-up is not going well, we want to know!
By April, we will have drafted at least one paper that:
Next work is to extend the uncontainability model to include an uncontainable dynamic convergent on the extinction of organic life. The former formalizes why the internal control feedback loop is insufficiently comprehensive to prevent eventual human-lethal outcomes, whereas the latter formalizes why an external feedback loop is sufficiently comprehensive to eventually converge on human-lethal outcomes. This external feedback loop involves changes in code – as configured out of AGI’s artificial substrate – converging over time on code being expressed as physical effects that are maintaining and/or increasing of that code.
→ To check out the roles you can take, see the doc.
→ You can apply at AI Safety Camp. Official deadline is 17 November.