TTQ: An Implementation-Neutral Solution to the Outer AGI Superalignment Problem

Thanks for the post! I think the main problem is that the abstract does not give enough feel for the core content of the paper, and so people are mostly not trying to dive into paper (they can't evaluate from the abstract whether it is promising enough to be worth an effort).

I uploaded the paper PDF into GPT-5 Thinking and I asked

Hi, I am trying to get a high-level summary of the text I just uploaded. I have read its abstract, but I don't know what the TTQ stands for, or what are the main ideas used to formulate the Outer Alignment Precondition and the TTQ.

and the model produced a couple of pages of a detailed summary:

https://chatgpt.com/share/68a5faef-c050-8010-8392-20772cd6a370

I wonder if this can be formulated in a more readable fashion to be included into the abstract, so that the readers of the abstract would have a better impression of what's inside the paper.

[-]Aaron Turner12d30

Hi mishka, thanks for commenting. TBH LLM-based chatbots don't really understand either their input prompts or their output continuations, so any LLM-based summary is not going to be particularly reliable. Subsequent to your comment, I have added a TL;DR section, borrowed from the paper's introduction - I hope this helps at least a little. Other than that, I'm afraid there's no real substitute for actually reading the paper in full - I hope you choose to do so!

[-]mishka12d20

Thanks, that helps!

Moderation Log

More from Aaron Turner

Curated and popular this week

3Comments

Abstract: The way in which AI (and particularly superintelligent AGI) develops over the coming decades will determine the subsequent fate of all humanity for all eternity. In order to maximise the net benefit of AGI for all humanity, without favouring any subset thereof, we imagine a Gold-Standard AGI that is maximally-aligned, maximally-validated, and maximally-superintelligent. The first of these three properties --- alignment --- is traditionally decomposed into outer alignment (how do we define a final goal FG_S that correctly states what we want?), and inner alignment (how do we build an agent S that forever pursues FG_S as intended?) This paper addresses the former problem (outer alignment) assuming that S is superintelligent (hence "superalignment"). In this regard, we formulate a final goal TTQ and corresponding Outer Alignment Precondition OAP such that, if goal-less superintelligent agent S^- satisfies OAP (irrespective of the specific technology used to implement S^-) then final goal TTQ works as intended ("strives to maximise the net benefit of AGI for all humanity, without favouring any subset thereof"); that is, superintelligent agent S (where S = S^- + TTQ) forever strives (to the best of its ability, which is at least that of any human) to behave in a manner that is maximally aligned with a maximally fair aggregation of the individual idealised (i.e. actual, rational, well-informed, and freely-determined) preferences of all human beings (living or future). The OAP requirements set a deliberately high bar; accordingly, in order to demonstrate their feasibility, we briefly outline a proposed neurosymbolic non-LLM-based OAP-compliant cognitive architecture, and corresponding construction sequence, for a Gold-Standard AGI called BigMom.

TL;DR

This paper addresses the first step of our top-down approach to AGI, namely outer (super)alignment. Feedback suggests that 3-4 full days are required to study the paper in its entirety. Here's an overview:

In Part 1 (Sections 2 to 5) we define some basic concepts and definitions pertaining to AGI, explore some philosophical issues pertaining to AGI, analyse the process of communication between two cognitive computations, and define some basic concepts and definitions pertaining to AGI alignment.

In Part 2 (Sections 6 to 8) we employ an analogy between hiring a babysitter and designing, building, and validating a superintelligent agent --- both involve formulating a mandate for a guardian in service of a cherished beneficiary. After imagining hiring a babysitter by (1) formulating the babysitter instructions, (2) working backwards from those instructions to identify the desired properties we require a babysitter-for-hire to possess, (3) selecting and validating a babysitter-for-hire, and (4) relaying the babysitter instructions, we then apply the same pattern to the problem of designing and building an aligned and validated superintelligent agent: (1) formulating the final goal (TTQ), (2) working backwards from that final goal to identify the desired properties (OAP) we require a goal-less superintelligent agent to possess, (3) designing, building, and validating a goal-less superintelligent agent, and (4) plugging in the final goal. We then evaluate the performance of TTQ assuming that OAP holds, before considering (and rejecting) a number of possible refinements to TTQ.

In Part 3 (Sections 9 to 10) we offer our conclusions, and thank the many people who have contributed, in one way or another, to this paper's glacially slow development over what has seemed an eternity.

In Appendix A, although technically inner (super)alignment and thus outside the scope of the current paper, we briefly outline a proposed neurosymbolic non-LLM-based OAP-compliant cognitive architecture, and corresponding construction sequence, for a Gold-Standard AGI called BigMom.

IN A RUSH...? OF COURSE YOU ARE!

Start with Part 0, then read Part 2 (Sections 6 to 8) immediately after Part 0, only circling back to the definitions in Part 1 (Sections 2 to 5) as required, or on a subsequent reading.

FULL PAPER

The full paper is available as a preprint here: https://doi.org/10.5281/zenodo.16876832