Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Epistemic status: I’m sure these plans have advantages relative to other plans. I'm not sure they're adequate to actually work, but I think they might be.

With good enough alignment plans, we might not need coordination to survive. If alignment taxes are low enough, we might expect most people developing AGI to adopt them voluntarily. There are two alignment plans that seem very promising to me, based on several factors, including ease of implementation, and applying to fairly likely default paths to AGI. Neither has received much attention. I can’t find any commentary arguing that they wouldn't work, so I’m hoping to get them more attention so they can be considered carefully and either embraced or rejected.

Even if these plans[1] are as promising as I think now, I’d still give p(doom) in the vague 50% range. There is plenty that could go wrong.[2]

There's a peculiar problem with having promising but untested alignment plans: they're an excuse for capabilities to progress at full speed ahead. I feel a little hesitant to publish this piece for that reason, and you might feel some hesitation about adopting even this much optimism for similar reasons. I address this problem at the end.

The plans

Two alignment plans stand out among the many I've found. These seem more specific and more practical than others. They are also relatively simple and obvious plans for the types of AGI designs they apply to. They have received very little attention since being proposed recently. I think they deserve more attention.

The first is Steve Byrnes’ Plan for mediocre alignment of brain-like [model-based RL] AGI. In this approach, we evoke a set of representations in a learning subsystem, and set the weights from there to the steering or critic subsystems. For example, we ask the agent to "think about human flourishing" and then freeze the system and set high weights between the active units in the learning system/world model and the steering system/critic units. The system now ascribes high value to the distributed concept of human flourishing. (at least as it understands it). Thus, the agent's knowledge is used to define a goal we like. 

This plan applies to all RL systems with a critic subsystem, which includes most powerful RL systems.[3] RL agents (including loosely brain-like systems of deep networks) seem like one very plausible route to AGI. I personally give them high odds of achieving AGI if language model cognitive architectures (LMCAs) don’t achieve it first.

The second promising plan might be called natural language alignment, and it applies to language model cognitive architectures and other language model agents. The most complete writeup I'm aware of is mine. This plan similarly uses the agent's knowledge to define goals we like. Since that sort of agent's knowledge is defined in language, this takes the form of stating goals in natural language, and constructing the agent so that its system of self-prompting results in taking actions that pursue those goals. Internal and external review processes can improve the system's ability to effectively pursue both practical and alignment goals.

John Wentworth's plan How To Go From Interpretability To Alignment: Just Retarget The Search is similar. It applies to a third type of AGI, a mesa-optimizer that emerges through training. It proposes using interpretability methods to identify the representations of goals in that mesa-optimizer; identifying representations of what we want the agent to do; and pointing the former at the latter. This plan seems more technically challenging, and I personally don't think an emergent mesa-optimizer in a predictive foundation model is a likely route to AGI. But this plan shares many of the properties that make the previous two promising, and should be employed if mesa-optimizers become a plausible route to AGI.

The first two approaches are explained in a little more detail in the linked posts above, and Steve's is also described in more depth in his # [Intro to brain-like-AGI safety] 14. Controlled AGI. But that's it. Both of these are relatively new, so they haven't received a lot of criticism or alternate explanations yet.

Why these plans are promising

By "promising alignment plans", I mean I haven't yet found a compelling argument for why they wouldn't work. Further debunking and debugging of these plans are necessary. They apply to the two types of AI that seem to currently lead the race for AGI: RL agents and Language Model Agents (LMAs). These plans address gears-level models of those types of AGI. They can be complemented with methods like scalable oversight, boxing, interpretability, and other alignment strategies.

These two plans have low alignment taxes in two ways. They apply to AI approaches most likely to lead to AGI, so they don't require new high-effort projects. They also have low implementation costs in terms of both design and computational resources, when compared to a system optimized for sheer capability.

Both of these plans have the advantages of operating on the steering subsystem that defines goals, and using the AGI's understanding to define those goals. That's only possible if you can pause training at para-human level, at which the system has a nontrivial understanding of humans, language, and the world, but isn't yet dangerously capable of escaping. Since deep networks train relatively predictably (at least prior to self-directed learning or self-improvement), this requirement seems achievable. This may be a key update in alignment thinking relative to early assumptions of fast takeoff.

Limitations and future directions

They’re promising, but these plans aren’t flawless. They primarily create an initial loose alignment. Whether they're durable in a fully autonomous, self-modifying and continuously learning system (The alignment stability problem) remains to be addressed. This seems to be the case with all other alignment approaches I know of for network-based agents. Alex Turner's A shot at the diamond-alignment problem convinced me that reflective stability will stabilize a single well-defined, dominant goal, but the proof doesn't apply to distributed or multiple goals. MIRI is rumored to be working on this issue; I wish they'd share with the rest of us, but absent that, I think we need more minds on the problem.

There's are two other important limitations of aligning language model agents. One is the Waluigi effect. Language models may simulate hostile characters in the course of efficiently performing next-word prediction. Such hostile simulacra may provide answers that are wrong in malicious directions. This is a more pernicious problem than hallucination, because it is not necessarily improved in more capable language models. There are possible remedies,[4] but this problem needs more careful consideration. 

There are also concerns that language models do not accurately represent their internal states in their utterances. They may use steganography, or otherwise mis-report their train of thought. These issues are discussed more detail in The Translucent Thoughts Hypotheses and Their Implications, discussion threads there, and other posts.

Those criticisms are suggest possible failure, but not likely failure. This isn't guaranteed to work. But the perfect is the enemy of the good.[5] Plans like these seem like our best practical hope to me. At the least, they seem worth further analysis.

There's a peculiar problem with actually having good alignment plans: they might provide an excuse for people to call for full speed ahead. If those plans turn out to not work well enough, that would be disastrous.  But I think it's important to be clear and honest, particularly within the community you're trying to cooperate with. And the potential seems worth the risk. Effective and low-tax plans would reduce the need for difficult or impossible coordination. Balancing publicly working on promising plans against undue optimism is a complex strategic issue that deserves explicit attention.

I have yet to find any arguments for why these plans are unlikely to work. I believe in many arguments for the least forgiving take on alignment, but none make me think these plans are a priori likely to fail. The existence of possible failure points doesn't seem like an adequate reason to dismiss them. There's a good chance that one of these general plans will be used. Each is an obvious plan for one of the AGI approaches that seem to currently be in the lead.  We might want to analyze these plans carefully before they're attempted. 
 

  1. ^

    I think these are worth the word "plans" while most other things we discuss are called "approaches". It would be easy to change my mind. These are plans the way "we'll build a bridge of steel beams" is an initial plan, but not an adequate one.

  2. ^

    Some things that could go wrong, outside of specific problems with these plans: sloppy implementation; operational inadequacy; outer alignment mistakes; alignment instability; and multipolar conflict disasters. 

    Promising alignment plans are only a start toward survival.

  3. ^

    In reinforcement learning, a critic system evaluates actions taken by an actor, and provides the actual training signal. In that sense, it is a steering subsystem that defines the system's goals and behavior.

    The actor, or policy, is often mistaken for the full RL agent. Simple RL systems do not employ critics, but all powerful RL systems I'm aware of employ some form of critic. The actor subsystem takes as input the state and outputs the best action according to its current learned policy.

    The critic subsystem is trained to produce an estimate of value, the total future reward resulting from a state or action-in-a-state. The critic provides the reinforcement signal to the actor for a policy gradient (usually backpropagation) adjustment. Thus, the critic system "interprets" the reinforcement signal from the environment, and more directly determines the system's behavior (and its metaphorical "values", a word that arguably means the same thing in the RL and human domains.

    Critic systems can also be used to estimate the value of a state to perform tree search. For instance, AlphaZero performs a limited tree search of plays and expected counterplays, and selects the move leading to the best board game state, as estimated by the critic subsystem. Humans go one step farther and create distal goals like "start a business", probably based on our critic subsystem's evaluations of those imagined future states. I explore those different types of critic-based goal direction in my steering subsystems post.

  4. ^

    In Internal independent review for language model agent alignment I proposed a review process of sampling from new instances to avoid persistent Waluigi effects and counteract their influence. Shane Legg appeared to propose the same general idea in his recent interview with Dwarkesh Patel. The intuition is that many independent LLM calls review plans and thoughts for alignment, rare Waluigi effects should be overridden. Tamera Lanham has proposed externalized reasoning oversight, in which humans assisted by AI review those natural language trains of thought. This would aid in detecting malicious insertions into the tree of thoughts. The possibility of malicious simulacra remains a major concern for this plan, but without empirical estimates and more refined evaluation, this doesn't seem like an adequate reason to dismiss the approach.

  5. ^

    The perfect is the enemy of the good in most projects. AGI alignment needs to work on the first try, and it has other unique difficulties. But a plan with guaranteed success that will take four decades is probably not going to work, given the difficulty of coordination. This is a much deeper topic, since the search for a formally guaranteed solution is a major tenet, spoken or unspoken, of much alignment work. I'd like to address this at greater length, and get thoughts on this aspect of the problem.

New Comment
9 comments, sorted by Click to highlight new comments since: Today at 3:35 AM

One somewhat obvious thing to do with ~human-level systems with "initial loose alignment" is to use them to automate alignment research (e.g. the superalignment plan). I think this kind of two-step plan is currently the best we have, and probably by quite some margin. Many more details for why I believe this in these slides and in this AI safety camp '24 proposal.

I made time to go through your slides. You appear to be talking about LLMs, not language model agents. That's what I'm addressing.

If you can align those, using them to align a different type of AGI would be a bit beside the point in most scenarios (maybe they'd progress so slowly that another type would overtake them before they pulled off a pivotal act using LMA AGI).

I don't see a barrier to LMAs achieving full, agentic AGI. And I think they'll be so useful and interesting that they'll inevitably be made pretty efficiently.

I don't quite understand why others don't agree that this will happen. Perhaps I'll write a post question asking why this is.

Agree, I do mostly discuss LLMs, but I think there's significant overlap in aligning LLMs and LMAs.

Also agree that LMAs could scale all the way, but I also think once you get ~human-level automated alignment research, its likely applicability to other types of systems too (than LMAs and LLMs) should still be a nice bonus.

Thanks so much for publishing this. It’s so refreshing to read about alignment plans that people think might work, rather than just reading about why alignment is putatively hard or why naive approaches to the problem would fail.

Demis Hassabis has publicly stated that Google DeepMind’s upcoming Gemini model will be some sort of combination of an RL agent and an LLM, but AFAIK he hasn’t given more details than that. I’m very curious to see what they’ve come up with.

Thanks! I also would really like to see more work directly on alignment plans, directed at types of AGI we're likely to actually get on the current historical path.

I looked at your link; I don't think Demis said it would include RL; the article author mentioned RL but Demis didn't seem to.

He said:

At a high level you can think of Gemini as combining some of the strengths of AlphaGo-type systems with the amazing language capabilities of the large models

What do you think he meant by "AlphaGo-type systems"? I could be wrong, but I interpreted that as a reference to RL.

I missed that. I agree that firmly implies the use of RL.

Just to play devil’s advocate, here’s an alternative: When there are multiple plausible tokens, maybe Gemini does multiple branching roll-outs for all of them, and then picks the branch that seems best (somehow or other).

That would be arguably consistent with saying “some of the strengths of AlphaGo-type systems”, in the sense that AlphaGo also did multiple roll-outs at inference time (and training time for that matter). But it wouldn’t entail any extra RL.

If this is true (a big “if”!), my vague impression is that it’s an obvious idea that has been tried lots of times but has been found generally unhelpful for LLMs. Maybe they found a way to make it work? Or maybe not but they’re doing it anyway because it sounds cool? Or maybe this whole comment is way off. I’m very far from an expert on this stuff.

That's plausible. I had thought of another possible use of the rollout part of AlphaZero: trees of thought. That hasn't been shown to help outside of particularly decomposable problems, and it's pretty compute and time hungry, so that also doesn't seem that useful.