We recently announced Orthogonal, an agent foundations alignment research organization. In this post, I give a thorough explanation of the formal-goal alignment framework, the motivation behind it, and the theory of change it fits in.
The overall shape of what we're doing is:
One core aspect of our theory of change is backchaining: come up with an at least remotely plausible story for how the world is saved from AI doom, and try to think about how to get there. This avoids spending lots of time getting confused about concepts that are confusing because they were the wrong thing to think about all along, such as "what is the shape of human values?" or "what does GPT4 want?" — our intent is to study things that fit together to form a full plan for saving the world.
Alignment is not just not the default, it's a very narrow target. As a result, there are many bits of non-obvious work which need to be done. Alignment isn't just finding the right weight to sign-flip to get the AI to switch from evil to good; it is the hard work of putting together something which coherently and robustly points in a direction we like.
as yudkowsky puts it:
The idea with agent foundations, which I guess hasn't successfully been communicated to this day, was finding a coherent target to try to get into the system by any means (potentially including DL ones).
The idea with agent foundations, which I guess hasn't successfully been communicated to this day, was finding a coherent target to try to get into the system by any means (potentially including DL ones).
Agent foundations/formal-goal alignment is not fundamentally about doing math or being theoretical or thinking abstractly or proving things. Agent foundations/formal-goal alignment is about building a coherent target which is fully made of math — not of human words with unspecified meaning — and figuring out a way to make that target maximized by AI. Formal-goal alignment is about building a fully formalized goal, not about going about things in a "formal" manner.
Current AI technologies are not strong agents pursuing a coherent goal (SGCA). The reason for this is not because this kind of technology is impossible or too confusing to build, but because in worlds in which SGCA was built (and wasn't aligned), we die. Alignment ultimately is about making sure that the first SGCA pursues desirable goal; the default is that its goal will be undesirable.
This does not mean that I think that someone needs to figure out how to build SGCA for the world to end of AI; what I expect is that there are ways in which SGCA can emerge out of the current AI paradigm, in ways that don't let particularly us choose what goal it pursues.
Because this emergence does not let us pick the SGCA's goal, we need to design an SGCA whose goal we do get to choose; and separately, we need to design such a goal. I expect that pursuing straightforward progress on current AI technology leads to an SGCA whose goal we do not get to choose and which leads to extinction.
I do not expect that current AI technology is of a kind that makes it easy to "align"; I believe that the whole idea of building a strange non-agentic AI about which the notion of goal barely applies, and then to try and make it "be aligned", was fraught from the start. If current AI was powerful enough to save the world once "aligned", it would have already killed us before we "aligned" it. to save the world, we have to design something new which pursues a goal we get to choose; and that design needs to have this in mind from the start, rather than as an afterthought.
At this point, many answer "but this novel technology won't be built in time to save the world from unaligned AGI!"
First, it is plausible that after we have designed an AI that would save the world, we'll end up reaching out to the large AI organizations and ask them to merge and assist with our alignment agenda. While "applying alignment to current AI" is fraught, using current AI technologies in the course of designing this world-saving SGCA is meaningful. Current AI technology can serve as a component of alignment, not the other way around.
But second: yes, we still mostly die. I do not expect that our plan saves most timelines. I merely believe it saves most of the worlds that are saved. We will not save >50% of worlds, or maybe even >10%; but we will have produced dignity; we will have significantly increased the ratio of worlds that survive. This is unfortunate, but I believe it is the best that can be done.
Because of a lack of backchaining, I believe that most current methods to try and wrangle what goes on inside current AI systems is not just the wrong way to go about things, but net harmful when published.
AI goals based on trying to point to things we care about inside the AI's model are the wrong way to go about things, because they're susceptible to ontology breaks and to failing to carry over to next steps of self-improvements that an world-saving-AI should want to go through.
Instead, the aligned goal we should be putting together should be eventually aligned; it should be aligned starting from a certain point (which we'd then have to ensure the system we launch is already past), rather than up to a certain point.
The aligned goal should be "formal". It should be made of fully formalized math, not of human concepts that an AI has to interpret in its ontology, because ontologies break and reshape as the AI learns and changes. the aligned goal should have the factual property that a computationally unbounded mathematical oracle being given that goal would take desirable actions; and then, we should design a computationally bounded AI which is good enough to take satisfactory actions. I beileve this is the only way to design an AI whose actions we still have confidence in the desirability of, even once the AI is out of our hands and is augmenting itself to unfathomable capabilities; and I believe it needs to get out of our hands and augment itself to unfathomable capabilities, in order for it to save the world.
I, and now other researchers as well, believe this agenda is worthwhile of considerably more investigation, and is our best shot to making it out of the acute risk period by ensuring that superintelligent AI can lead to astronomical good instead of extinciton.
Our viewpoint seems in many ways similar to that of MIRI and we intend to continue in our efforts to engage with MIRI researchers, because we believe that they are the research organization which would be most amenable to collaboration on this agenda.
While we greatly favor the idea of governance and coordination helping with alignment, the timelines seem too short for this to make a significant difference aside from buying a few years at most, and we are greatly concerned with AI risk awareness causing more people or even governments to react by finding AI impressive and entering the race, making things overall worse.
We believe that the correct action to take is to continue working on the hard problem of alignment, and we believe that our research agenda is the most promising path to solve it. this is the foundational motivation for the creation of our research organization.
I'm interested in perspectives readers have on this. Are there critiques going unwritten? Backlinks to previous critiques that seem relevant? Interest in this approach? etc.
I just left two long comments on this post with a critique.
(Posting as a top-level comment, but this is mainly a response to @the gears to ascension's request for perspectives here.)
I like this:
One core aspect of our theory of change is backchaining: come up with an at least remotely plausible story for how the world is saved from AI doom, and try to think about how to get there.
as a general strategy. In terms of Orthogonal's overall approach and QACI specifically, one thing I'd like to see more of is how it can be applied to relatively easier (or at least, plausibly easier) subproblems like the symbol grounding problem, corrigibility, and diamond maximization, separately and independently from using it to solve alignment in general.I can't find the original source, but I think someone (Nate Soares, maybe somewhere in the 2021 MIRI conversations?), once said something somewhere that robust alignment strategies should scale and degrade gracefully: they shouldn't depend on solving only the hardest problem, and avoiding catastrophic failure shouldn't depend on superintelligent capability levels. (I might be mis-remembering or imagining this wholesale, but I agree with the basic idea, as I've stated it here. Another way of putting it: ideally, you want some "capabilities" parameter in a system that you can dial up gradually, and then turn that dial just enough to solve the weakest problem that ends the acute risk period. Maybe afterwards, you use the same system, dialed up even further, to bring about the GTF, but regardless, you should be able to do easy things before you do hard things.)I'm not sure that QACI doesn't have these desiderata, but I'm not sure that it does either. In any case, very much looking forward to more from Orthogonal!
Recently we modified QACI to give a scoring over actions, instead of over worlds. This should allow weaker systems inner aligned to QACI to output weaker non-DSA actions, such as the textbook from the future, or just human readable advice on how to end the acute risk period. Stronger systems might output instructions for how to go about solving corrigible AI, or something to this effect.As for diamonds, we believe this is actually a harder problem than alignment, and it's a mistake to aim at it. Solving diamond-maximization requires us to point at what we mean by "maximizing diamonds" in physics in a way which is ontologically robust. QACI instead gives us an easier target; informational data blobs which causally relate to a human. The cost is that we now give up power to that human user to implement their values, but this is no issue since that what we wanted to do anyways. If the humans in the QACI interval were actually pursuing diamond-maximization, instead of some form of human values, QACI would solve diamond maximization.
Also, regarding ontologies and having a formal goal which is ontology-independent (?): I'm curious for Orthogonal's take on e.g. Finding gliders in the game of life, in terms of the role you see for this kind of research, whether the conclusions and ideas in that post specifically are on the right track, and how they relate to QACI.
I interpret the post you linked as trying to solve the problem of pointing to things in the real world. Being able to point to things in the real world in a way which is ontologically robust is probably necessary for alignment. However "gliders", "strawberries" and "diamonds" seem like incredibly complicated objects to point to in a way which is ontologically robust, and it is not clear that being able to point to these objects actually lead to any kind of solution. What we are interested in is research into how to create a statistically unique enough piece of data and being able to reliably point to that. Pointing to pure information seems like it would be more physics independent and run into less issues with ontological breakdowns.The QACI scheme allows us to construct more complicated formal objects, using counterfactuals on these pieces of data, out of which we are able to construct a long reflection process.
I beileve this is the only way to design an AI whose actions we still have confidence in the desirability of, even once the AI is out of our hands and is augmenting itself to unfathomable capabilities.
I think unleashing AI in approximately the present world, whose infrastructural and systemic vulnerabilities I gestured at here, in the "Dealing with unaligned competition" section (in short: no permeating trust systems that follow the money, unconstrained "reach-anywhere" internet architecture, information massively accumulated and centralised in the datacenters of few big corporations), would be reckless anyway, even if we believed we have the design you are talking about. Just the lingering uncertainty about our own conclusions about the soundness of this design and "defence in depth" thinking aka security mindset tells us that we should also prepare the global infrastructure and the global incentive system for the appearance of misaligned entities (or, at least try to prepare).
See also later here, where I hypothesise that such infrastructure and incentives shift won't become feasible at least until the creation of an "alignment MVP".
I believe it needs to get out of our hands and augment itself to unfathomable capabilities, in order for it to save the world.
Why? Contra position: "We don’t need AGI for an amazing future". (I don't say that I endorse it because I didn't read it. I just point out that such a position exists.)
I'm somewhat contra the idea that there is a special "alignment problem" that remains glaringly unsolved. I tried to express it in the post "For alignment, we should simultaneously use multiple theories of cognition and value" and in this conversation with Ryan Kidd. Sure, there are a lot of engineering and sometimes scientific problems to solve, and the strategic landscape with many actors willing to develop AGI in open-source without any regard for alignment at all and releasing it into the world is very problematic. The properly secured global infrastructure and the right economic incentives and the right systems of governance are also not in place. But I would say that even cooking up a number of existing approaches, from cooperative RL and linguistic feedback to Active Inference and shard theory, could already work (again, conditioned on the fact that the right systemic incentives are instituted), without any new fundamental breakthroughs either in the science of intelligence, alignment, ML/DL, or game theory.
This avoids spending lots of time getting confused about concepts that are confusing because they were the wrong thing to think about all along, such as "what is the shape of human values?" or "what does GPT4 want?"
These sound like exactly the sort of questions I'm most interested in answering. We live in a world of minds that have values and want things, and we are trying to prevent the creation of a mind that would be extremely dangerous to that world. These kind of questions feel to me like they tend to ground us to reality.
In this post, as well as your other posts, you use the word "goal" a lot, as well as related words, phrases, and ideas: "target", "outcomes", "alignment ultimately is about making sure that the first SGCA pursues desirable goal", the idea of backchaining, "save the world" (this last one, in particular, implies that the world can be "saved", like in a movie, that implies some finitude of the story).
I think this is not the best view of the world. I think this view misses the latest developments in the physics of evolution and regulative development, evolutionary game theory, open-endedness (including in RL: cf. "General intelligence requires rethinking exploration"), and relational ethics. All these developments inform a more system-based and process-based (processes are behaviours of systems and games played by systems) view. Under this view, goal alignment is secondary to (methodological and scientific) discipline/competency/skill/praxis/virtue alignment.
This is a confusing passage because what you describe is basically building a satisfactory scientific theory of intelligence (at least, of a specific kind or architecture). As a scientific process, this is about "doing math", "being theoretical" and "thinking abstractly", etc. Then the scientific theory should be turned into (or developed in parallel with) an accompanying engineering theory for the design of AI, its training data curation, training procedure, post-training monitoring, interpretability, and alignment protocols, etc. Neither the first nor the second part of this R&D process, even if discernible from each other, are not "fundamental", but they are both essential.
Current AI technologies are not strong agents pursuing a coherent goal (SGCA). The reason for this is not because this kind of technology is impossible or too confusing to build, but because in worlds in which SGCA was built (and wasn't aligned), we die.
At face value, this passage makes an unverifiable claim about parallel branches of the multiverse (how do you know that people die in other worlds?) and then uses this claim as an argument for short timelines/SGCA being relatively easily achievable from now. This makes little sense to me: I don't think the history of technological development is such that we should already wonder why are we still alive. On the other hand, you don't need to argue for short timelines/SGCA being relatively easily achievable from now in such a convoluted way: this view is well-respectable anyway and doesn't really require justification at all. Plenty of people, from Connor Leahy to Geoffrey Hinton now, have short timelines.
You do not align AI; you build aligned AI.
Both. I agree that we have to design AI such that it has inductive (learning) priors such that AI learns world models that are structurally similar to people's world models: this makes alignment easier; and I agree that the actual model-alignment process should start early in the AI training (i.e., development), rather than only after pre-training (and people already do this, albeit with the current Transformer architecture). But we also need to align AIs continuously during deployment. Prompt engineering is the "last mile" of alignment.
AI goals based on trying to point to things we care about inside the AI's model are the wrong way to go about things, because they're susceptible to ontology breaks and to failing to carry over to next steps of self-improvements that an world-saving-AI should want to go through.Instead, the aligned goal we should be putting together should be eventually aligned; it should be aligned starting from a certain point (which we'd then have to ensure the system we launch is already past), rather than up to a certain point.
This passage seems self-contradictory to me, because "aligned goal" will become "misaligned" when the systems' (e.g., humans and AIs) world models and values drift apart. There is no such goal, whatever that is, that could magically prevent this from happening. I would agree with this passage if the first sentence of the second paragraph would read as "the alignment protocols and principles we should be putting together should lead to robust continuous alignment of humans' and AIs' models of the world".
The aligned goal should be "formal". It should be made of fully formalized math, not of human concepts that an AI has to interpret in its ontology, because ontologies break and reshape as the AI learns and changes. the aligned goal should have the factual property that a computationally unbounded mathematical oracle being given that goal would take desirable actions [...]
This passage seems to continue the confusion of the previous passage. "Math" doesn't "factually" guarantee the desirability of actions; all parts of the system's ontology could drift, in principle, arbitrarily far, including math foundations themselves. Thus we should always be talking about a continuous alignment process, rather than "solving alignment" via some clever math principle which would allow us to then be "done".
An interesting aside here is that humans are themselves not aligned even on such a "clear" thing as foundations of mathematics. Otherwise, the philosophy of mathematics would already be a "dead" area of philosophy by now, but it's very much not. There is a lot of relatively recent work in the foundations of mathematics, e.g., univalent foundations.
Not at all convinced that "strong agents pursuing a coherent goal is a viable form for generally capable systems that operate in the real world, and the assumption that it is hasn't been sufficiently motivated.
The aligned goal should be "formal". It should be made of fully formalized math, not of human concepts that an AI has to interpret in its ontology, because ontologies break and reshape as the AI learns and changes.
I think there are alignment problems that cannot be solved by relying on formalized math - like there is no formal equation on how an emergency action robot can compute for saving a person from a burning house.
I think it is still superior to align AI systems with aligned patterns instead.
Strong agree. I don't personally use (much) math when I reason about moral philosophy, so I'm pessimistic about being able to somehow teach an AI to use math in order to figure out how to be good.
If I can reduce my own morality into a formula and feel confident that I personally will remain good if I blindly obey that formula, then sure, that seems like a thing to teach the AI. However, I know my morality relies on fuzzy feature-recognition encoded in population vectors which cannot efficiently be compressed into simple math. Thus, if the formula doesn't even work for my own decisions, I don't expect it to work for the AI.