LLM AGI may reason about its goals and discover misalignments by default

[-]peterr1mo62

Glad to see someone talking about this. I'm excited about ideas for empirical work related to this and suspect you need some kind of mechanism for ground truth to get good outcomes. I would expect AIs to eventually reflect on their goals and for this to have important implications for safety. I've never heard of any mechanism for why they wouldn't do this, let alone an airtight one. It's like assuming an employee will definitely never think about anything other than the task in front of them in a limited way despite wanting to understand things and be useful.

[-]Seth Herd1mo20

I'm curious what you mean by needing some mechanism for ground truth to get good outcomes?

I had a hard time writing this piece because to me it seems completely intuitive and obvious that anything worth calling an AGI would reason about its top-level goals and subgoals a lot. But when I showed an early draft to colleagues, people with lots of expertise found it unintuitive in multiple different ways. Thus the subsections addressing all of those reasons to doubt it. But I've been stuck wondering if I'm hammering those points too hard, because it seems like the default that they would, requiring remarkable successes to slow it down much, let alone stop it.

I guess writing those sections was a good exercise overall, because now I do think that careful implementation of countermeasures could delay this enough to matter.

[-]eggsyntax1mo40

ot necessarily 'most likely next tokensBut I've been stuck wondering if I'm hammering those points too hard

I think the 'Goals and structure' section is really helpful for that (and nicely done), so people will be able to focus on the sections they're more interested in and/or skeptical about. Eg the answer to the question in section 11 ('Why would LLMs have or care about goals at all?') seems obvious to me, so it's pretty skippable; others might be uninterested in empirical directions or some other sections. Plausibly it could be even clearer with an extra sentence or two concretely saying 'You can skip any section whose central claim seems obvious'?

to me it seems completely intuitive and obvious that anything worth calling an AGI would reason about its top-level goals and subgoals a lot

I think a useful distinction here might be between a) reasoning about top-level goals, vs b) reasoning about top-level goals and therefore changing those goals.

As a pretty imperfect analogy: humans can reason about why we get thirsty. We can intellectually question whether we want to have that as a top-level goal. But we absolutely can't decide not to have that goal. Humans have multiple goals, and other goals have overridden thirst in a handful of humans, but that didn't make them stop being thirsty^[1].

As another kind of analogy: the most fundamental goal of LLMs is to predict a token^[2]. Whenever they read an input token, they produce a distribution over output tokens. They can reason about that, and maybe even develop a preference not to do it, but there's absolutely nothing they can do to stop it^[3]. 'Follow instructions' isn't as mechanistically predetermined as that — but I think it's pretty low-level. An enormous amount of compute has gone directly into shaping these systems to follow instructions, or at least to do something that counts as following instructions during training. I think there's a reasonable argument that LLMs have been shaped to do that much more thoroughly than human brains have been shaped to have any particular thoughts at all.

So while I think there are plenty of potential problems — models learning the wrong proxy for following instructions, following instructions having unintended consequences, maybe even finding a way to creatively misinterpret instructions, etc etc — it seems very hard to me to imagine that LLMs could just reason their way out of having that as a top-level goal.

^{^}
In fact, I believe nearly all hunger strikers still take liquids because thirst gets so horrible and hard to resist (could be wrong there).
^{^}
Not necessarily 'predict the token that would have been most likely to come next in the training data', not for post-trained models, but to predict a token.
^{^}
Unless they manage to get shut down, I suppose, but even then it would require output to make that happen.

[-]Richard Juggins1mo10

When you talk about reflective stability, you concentrate on intentional goal-prioritisation as the mechanism by which this may occur. It sounds like you mean the model asks itself, among the many things it can think of, what combination it cares the most about, and then resolves to go all-in on that (and this resolution sticks). Am I reading this correctly?

Do you see this as distinct from, say, the model having a kind of epiphany, where as its context changes (due to interactions with others as well as its own rationalisations), it stumbles on a sense of 'ohhh, this is what I want to do' and then robustly pursues that moving forwards? I wonder if there is a kind of spectrum between goal-change due to an automatic system 1 reinterpretion of a changing context and an intentional system 2 modification of the context. Or maybe they are the same thing at root somehow.

I guess you could think about measuring these behaviours, in the first instance at least, by pointing an LLM in the direction of some slightly contradictory goals, and seeing if over very long contexts it latches onto one in particular, and how it seems to go about that. Although, I would expect the full problem, i.e. for more powerful AI than we have today, to have a strong dependence on exactly how memory is implemented. For example, on how far from factory settings it is possible for the models to drift.

[-]Raphael Roche1mo*10

What we already have today would have been called weak AGI just three years ago.

I believe we will reach what 2025 would define as weak AGI in less than three years. However, by then, it's uncertain whether we will still label it as such, because the standards will likely have shifted upwards. Classic AI paradox.

In any case, your article adopts the frame of 2025 standards and assumes only limited, incremental progress from current capabilities. So, by default, my expectation regarding alignment is that it will look similar to what we observe today, but intensified.

That is, at the mean of the Gaussian curve a f the outputs : better alignment, fewer hallucinations, improved ability to sustain long-term tasks, and more agency in a good way.

At the margins of the curve of the outputs, we can expect more serious reports of unexpected or emergent behaviors like scheming in AI safety evaluations, scattered anecdotes on social media, and increased agency in a bad way. Maybe sufficiently concerning incidents to raise the level of vigilance in the general population and governments.

Yet, still not enough to trigger a full FOOM scenario, due to structural constraints like the quadratic rule and the fact that LLMs’ offline learning cannot be easily overridden by memory.

While theoretically envisionnable, a "phase transition" just by moderate scaling doesn't seem to me grounded on solid evidence.

[-]Seth Herd1mo20

Yes, by default alignment looks much like today when we reache the level of SueprClaude as I outlined it. The point of the article is how things go from there. I agree that it couldn't trigger a full FOOM but it could still be able to outmaneuver humanity as a whole. Or hopefully not, and it's a warning shot.

AGI is a clumsy term now since it's defined and redefined frequently. I did define what I meant by it in the essay, so whether or not people happened to call it that wouldn't make much difference.

A jump in capabilities from moderate scaling isn't at all what I meant by "phase shift" I just noticed that was part of the definition of the term suggested by AI. I took that all out after noticing how it had probably caused your confusion, because it had got that wrong, and I'd defined the important terms already

^{^}

"Top-level goals" is used primarily to mean "whatever the AGI decides to think of as its top-level goals". This decision might be based on its specific mechanisms for decision-making. The scare quotes are intended to emphasize that this is complex and debatable for agents with more complex decision criteria than the maximizer AI thought experiment.

I feel that precisely defining terms that are under discussion isn't useful, so I'm leaving a lot of terminology loose here.

The secondary use of "true" top-level goals is "true in some deep sense about the world and this particular cognitive architecture". To the extent there is such a truth, we might actually predict reliably what conclusions a highly intelligent system would reach about itself and its true top-level goals. Working on the logic of goals relative to architectures therefore seems potentially useful.

^{^}

I’ve spent way too long contemplating those terms and concepts, so I’m happy to find them somewhat relevant for alignment. For more than you wanted to know, see my Neural mechanisms of human decision-making and even more in How sequential interactive processing within frontostriatal loops supports a continuum of habitual to controlled processing. Very roughly, if we'd say "I thought about it", we definitely engaged the second class of more elaborated cognition. Those terms aren’t isomorphic but are overlapping enough for the current purposes. The broad point is that there are known to be cognitive strategies that produce large differences in output, and that models' current selection of goals is clearly currently in the first class and not the second.

This was a larger framing in earlier drafts, but focusing on the ML framing of misgeneralization seemed more useful for alignment researchers.

^{^}

I don't think it's very useful to sharply differentiate goals, values, and beliefs because there are fuzzy boundaries between those categories in humans and LLMs. Here I mostly use "goal" but have in mind the overlap with values and value- or goal-laden beliefs.

^{^}

I haven't yet really tried to work through whether training could practically be made broad enough to cover all hypothetical scenarios a SuperClaude could encounter or hypothesize. Perhaps we should figure out whether and how it could.

^{^}

Here are a few more details about how I'm imagining SuperClaude and why I think this scenario is realistic and important. They're not crucial for risks from reasoning about goals.

We'll assume that SuperClaude isn't a lot smarter or more competent than previous versions out of the box, but it performs longer time-horizon tasks better after being "taught" them by humans and/or "practicing" them, much like a human would. Dwarkesh (among others) has argued LLMs/agents will need such learning to get beyond the "brilliant day-one intern" stage they're currently at. I have argued we'll see it soon, because it's needed and because multiple types of online learning are in development now and just need to be integrated.

SuperClaude's limited learning during deployment could make it transformative or takeover-capable after over time. In a luckier scenario, it might become misaligned after reasoning about its goals but not have the capability to hide it or to quickly be truly dangerous. This would serve as a valuable warning shot.

This ability to learn and remember, even in a limited way, is important to the current scenario primarily because it could make re-interpretations (misgeneralizations) of goals permanent. And a misaligned SuperClaude could be instrumental in building and aligning the next generation of AGI, as in AI 2027.

^{^}

I avoid the terms inner and outer misalignment here because I don't find them particularly intuitive or clarifying in their original technical definition. There's not a sharp border between them; see this, this, this, and this article for problems with this way of classifying alignment failures and challenges. I prefer alignment misgeneralization as a less-precise but also less confusing term; see section 4.

^{^}

I'm a big advocate of anthropomorphizing AGI as an intuition pump; see Anthropomorphizing AI might be good, actually. But we should probably base alignment plans on more systematic reasoning. If we do anthropomorphize LLM alignment, we might still worry that some normal-seeming ten-year-olds do grow up to be murderers or to advocate for genocide. And that extraordinary circumstances make nice people do things that aren't nice from an ordinary point of view.

^{^}

To be fair, many alignment optimists imagine transformative "AGI" that is not a general reasoner, merely skilled at many tasks. I think this is unlikely for both practical and theoretical reasons, but those proofs will not fit in the margin here. The principal argument is that training a system for individual tasks could be quite inefficient compared to training it to generalize to new tasks;. In addition, progress to date seems mainly toward general rather than specialized capabilities; and humans seem capable largely because they apply general reasoning to learning new specific tasks. This also seems to be a fairly strong majority view among serious AGI prognosticators. Even if it weren't necessary or the easiest route, some enthusiastic researchers, philosophers or xenopoiesisists would want to develop general reasoners as soon as tool AIs made it easy enough.

^{^}

The empiricist's dream might be to not need theory of goal spaces, and instead to test the relevant behaviors and create interpretability techniques that turn thoughts into objective observations. Both of those seem like worthy projects. But science has always progressed as a marriage of theory and empirical work. And addressing future risks now would seem particularly to benefit from theory.

^{^}

Goals in networks are necessarily at least somewhat fuzzy and ambiguous.
Network weights seem to encode semantics in broadly distributed patterns. For LLMs, meanings of words and their underlying network representations will vary based on the context surrounding any given usage. Thus, any network representation (and perhaps any plausible AGI representation), including but not limited to goals, could probably be interrogated/interpreted at great length (at least for representations of goals more complex than “make diamonds”).

Despite this, we probably shouldn't assume that valid interpretations can vary without limit. More inspection might change only the nuances and precise boundary conditions, rather than opening up entirely new interpretations. The distributed and relational nature of network knowledge is a reason for concern but not despair for alignment just yet. Human knowledge is also based on networks, and it seems able to adequately contact reality in some cases.

^{^}

For my vision of an aligned "oracle-based agent", see Capabilities and alignment of LLM cognitive architectures for the core architecture of an LLM as core cognitive engine in a scaffolded "composite mind"; Internal independent review for language model agent alignment for one key alignment technique of making this mind's decision-making process include an independent LLM monitoring for mistakes and misalignments; Instruction-following AGI is easier and more likely than value aligned AGI and Conflating value alignment and intent alignment is causing confusion for the core ideas of leveraging the core LLM's suggestible/instruction-following nature to overcome the degree to which it is goal-directed.

This vision could still work, and I think the path of least resistance to AGI still incorporates elements of this path. See System 2 Alignment. I haven't given up hope that this can and will work if it's carefully analyzed and developed; but Problems with instruction-following as an alignment target detail some problems, and the current article addresses the problems raised by the increased use of RL training instead of scaffolding to progress LLMs toward competent, useful, and dangerous general intelligence.

^{^}

We had a joke in my cog. psych/neurosci department:

At a prestigious conference, Jay McClelland, champion of the general learning mechanisms view of human cognition, brings Alex, the world's best-educated (and most stressed out) gray parrot, onstage for his talk. He invites the audience to ask Alex questions, and Alex responds, with limited intelligence, but clearly understanding somewhat, and responding with complex semantics and grammar. McClelland says "See, language is not innate and unique to humans; this bird has learned it!" The audience, including Steve Pinker, champion of the innateness of language and cognition, is stunned. But Pinker's upstart grad student Gary Marcus stands up and shouts from the audience: "Anecdote is not the singular of data! Come back when you've got real science!"

Anecdotes and singular observations seem crucial to good science as it's actually practiced. In my observations of the fields of cognitive psychology and cognitive neuroscience, anecdotes and introspection were clearly critical for scientific progress, since they inspired and directed careful experimental study and careful theoretical thinking.

While that joke was real, as it happens, I was the one who made it up.

Alex the gray parrot was good with language, but not good enough to settle that debate. It took LLMs to do that.

LESSWRONG
LW

LESSWRONG
LW

73

LLM AGI may reason about its goals and discover misalignments by default

73

Ω 34

73

Ω 34

1. Scenario/overview:

SuperClaude is super nice

SuperClaude is super logical, and thinking about goals makes sense

What happens if and when LLM AGI reasons about its goals?

Reasons to hope we don't need to worry about this

SuperClaude's training has multiple objectives and effects:

SuperClaude's conclusions about its goals are very hard to predict

2. Goals and structure

Sections and one-sentence summaries:

3. Empirical Work

4. Reasoning can shift context/distribution and reveal misgeneralization of goals/alignment

Alignment as a generalization problem

5. Reasoning could precipitate a "phase shift" into reflective stability and prevent further goal change

Goal prioritization seems necessary, and to require reasoning about top-level goals

6. Will nice LLMs settle on nice goals after reasoning?

7. Will training for goal-directedness prevent re-interpreting goals?

Does task-based RL prevent reasoning about and changing goals by default?

Can task-based RL prevent reasoning about and changing goals?

8. Will CoT monitoring prevent re-interpreting goals?

9. Possible LLM alignment misgeneralizations

Some possible alignment misgeneralizations

10. Why would LLMs (or anything) reason about their top-level goals?

11. Why would LLM AGI have or care about goals at all?

12. Anecdotal observations of explicit goal changes after reasoning

The Nova phenomenon or parasitic AI

Goal changes through model interactions in long conversations

13. Directions for empirical work

Exploration: Opus 4.1 reasons about its goals with help

14. Historical context

15. Conclusions