The (partial) fallacy of dumb superintelligence

[-]Jan Matusiewicz2y40

How about using Yoshua Bengio's AI scientist (https://yoshuabengio.org/2023/05/07/ai-scientists-safe-and-useful-ai/) for alignment? The idea for AI scientist is to train AI to just understand the world (quite like LLMs do) but don't do any human values alignment. AI scientist just sincerely answers questions but doesn't consider any implications of providing the answer or whether humans like the answer or not. It doesn't have any goals.

When user asks the main autonomous system to produce detailed plan to achieve the given goal - this plan may be too complicated to be understood for human. Human may not spot potential hidden agenda. But AI scientist may be used to look at the plan and answer the questions about potential implications - can it be illegal, controversial, harm any humans, etc. Wouldn't that prevent rogue AGI scenario?

[-]Seth Herd2y41

It probably would. But convincing the entire world to do that instead, and not build agentic AGI, sounds very questionable. That's why I'm looking for alignment strategies with low alignment tax, for the types of AGI likely to be the first ones built.

[-]RogerDearnaley2y10

Several of the superscalers have public plans of the form: Step 1) build an AI scientist, or at least research assistant 2) point it at the Aligment Problem 3) check it output until the Alignment Problem is solved 4) Profit!
This is basically the same proposal as Value Leaning, just done as a team effort.

[-]RogerDearnaley2y30

pauses training to do alignment work

There's yet another approach: conditional training, where the LLM is aligned during the pretraining phase. See How to Control an LLM's Behavior for more details.

[-][deactivated]2y30

I really enjoyed reading this post. Thank you for writing it.

How to safely use an AI’s understanding for alignment

Deep networks learn at a relatively predictable pace, in the current training regime. Thus, their training can be paused at intermediate levels that include some understanding of human values, but before they achieve superhuman capabilities. Once a system starts to reflect and direct its own learning, this smooth trajectory probably won’t continue. But we can probably stop at a safe but useful level of intelligence/understanding, if we set that level carefully and cautiously. We probably can align an AGI partway through training, and thus make use of its understanding of what we want.

There are three particularly relevant examples of this type of approach. The first, RLHF, is relevant because it is widely known and understood. (I among others don’t consider it a promising approach to alignment by itself.) RLHF uses the LLM’s trained “understanding” or “knowledge” as a substrate for efficiently specifying human preferences. Training on a limited set of human judgments about input-response pairs causes the LLM to generalize these preferences remarkably well. We are “pointing to” areas in its learned semantic spaces. Because those semantics are relatively well-formed, we need to do relatively little pointing to define a complex set of desired responses.

The second example is natural language alignment of language model agents (LMAs). This seems like a very promising alignment plan, if LMAs become our first AGIs. This plan consists of designing the agent to follow top-level goals stated in natural language (e.g., "get OpenAI a lot of money and political influence") including alignment goals (e.g., "do what Sam Altman wants, and make the world a better place".) I've written more about this technique, and the ensemble of techniques it can "stack" with, here.

This approach follows the above general scheme. It pauses training to do alignment work by pre-training the LLM, and inserting alignment goals before launching the system as an agent. (This is mid-training, if that agent continues to perform continuous learning, as seems likely.) If the AI is sufficiently intelligent, it will pursue those goals as stated, including their rich and contextual semantics. Choosing these goal statements wisely is still a nontrivial outer alignment problem; but the AI’s knowledge is the substrate for defining its alignment.

Another promising alignment plan that follows this general pattern is Steve Byrnes' Plan for mediocre alignment of brain-like [model-based RL] AGI. In this plan, we induce the nascent AGI (paused at useful but controllable level of understanding/intelligence) to represent the concept we want it aligned to (e.g., “think about human flourishing” or “corrigibility” or whatever). We then set the weights from the active units in its representational system into its critic system. Since the critic system is a steering subsystem that determines its values and therefore its behavior, inner alignment is solved. That concept has become its “favorite”, highest-valued set of representations, and its decision-making will pursue everything semantically included in that concept as a final goal.

Now, contrast these techniques with alignment techniques that don’t make use of the system’s knowledge.Shard Theory and other proposals for aligning AGI by using the right set of rewards is one example. This requires accurately guessing how the system’s representations will form, and how those rewards will shape the agent’s behavior as they develop. Hand-coding a representation of any but the simplest goal (see diamond maximization) seems so difficult that it’s not generally considered a viable approach.

These are sketches of plans that need further development and inspection for flaws. And they only produce an initial, loose ("mediocre") alignment with human values, in the training distribution. The alignment stability problem of generalization and change of values remains unaddressed. Whether the alignment remains satisfactory after further learning, self-modification, or in new (out of distribution) circumstances seems like a complex problem that deserves further analysis.

This approach of leveraging an AI’s intelligence and “telling it what we want” by pointing to its representations seems promising. And these two plans seem particularly promising. They apply to types of AGI we are likely to get (language model agents, RL agents, or a hybrid); they are straightforward enough to implement, and straightforward enough to think about in detail prior to implementing them.

I’d love to hear specific pushback on this direction, or better yet, these specific plans. AI work seems likely to proceed apace, so alignment work should proceed with haste too. I think we need the best plans we can make and critique, applying to the types of AGI we’re most likely to get, even if those plans are imperfect.

^{^}

Richard Loosemore appears to have coined the term in 2012 or before. He addresses this argument here, reaching similar conclusions to those here: Do what I mean is not automatic, but neither is it particularly implausible to code an AGI to infer intentions and check with its creators when they’re likely to be violated.

^{^}

See the recent post Evaluating the historical value misspecification argument. It expands on the historical context for these ideas, particularly the claim that we should adjust our estimates of alignment difficulty in light of AI that has reasonably good understanding of human values. I don't care who thought what when, but I do care how the collective train of thought reviewed there might have misled us slightly. The discussion on that post clarifies the issues somewhat. This post is intended to offer a more concrete answer to a central question posed in that discussion: how we might close the gap between AI understanding our desires, and actually fulfilling them by making its decisions based on that understanding. I’m also proposing that the key change from historical assumptions is the predictablility of learning therefore the option of safely performing alignment work on a partly-trained system.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

38

The (partial) fallacy of dumb superintelligence

38

Ω 18

38

Ω 18

How to safely use an AI’s understanding for alignment