Summary: "Imagining and building wise machines: The centrality of AI metacognition" by Johnson, Karimi, Bengio, et al.

[-]Seth Herd1y*122

I read the whole thing because of its similarity to my proposals about metacognition as an aid to both capabilities and alignment in language model agents.

In this and my work, metacognition is a way to keep AI from doing the wrong thing (from the AIs perspective). They explicitly do not address the broader alignment problem of AIs wanting the wrong things (from humans' perspective).

They note that "wiser" humans are more prone to serve the common good, by taking more perspectives into account. They wisely do not propose wisdom as a solution to the problem of defining human values or beneficial action from an AI. Wisdom here is an aid to fulfilling your values, not a definition of those values. Their presentation is a bit muddled on this issue, but I think their final sections on the broader alignment problem make this scoping clear.

My proposal of a metacognitive "internal review" or "System 2 alignment check" shares this weakness. It doesn't address the right thing to point an AGI at; it merely shores up a couple of possible routes to goal mis-specification.

This article explicitly refuses to grapple with this problem:

3.4.1. Rethinking AI alignment
With respect to the broader goal of AI alignment, we are sympathetic to the goal but question this definition of the problem. Ultimately safe AI may be at least as much about constraining the power of AI systems within human institutions, rather than aligning their goals.

I think limiting the power of AI systems within human institutions is only sensible if you're thinking of tool AI or weak AGI; thinking you'll constrain superhuman AIs seems like obviously a fool's errand. I think this proposal is meant to apply to AI, not ever-improving AGI. Which is fine, if we have a long time between transformative AI and real AGI.

I think it would be wildly foolish to assume we have that gap between important AI and real AGI. A highly competent assistant may soon be your new boss.

I have a different way to duck the problem of specifying complex and possibly fragile human values: make the AGI's central goal to merely follow instructions. Something smarter than you wanting nothing more than to follow your instructions is counterintuitive, but I think it's both consistent, and in-retrospect obvious; I think not only is this alignment target safer, but far more likely for our first AGIs. People are going to want the first semi-sapient AGIs to follow instructions, just like LLMs do, not make their own judgments about values or ethics. And once we've started down that path, there will be no immediate reason to tackle the full value alignment problem.

(In the longer term, we'll probably want to use instruction-following as a stepping-stone to full value alignment, since instruction-following superintelligence would eventually fall into the wrong hands and receive some really awful instructions. But surpassing human intelligence and agency doesn't necessitate shooting for full value alignment right away.)

A final note on the authors' attitudes toward alignment: I also read it because I noted Yoshua Bengio and Melanie Mitchell among the authors. It's what I'd expect from Mitchell, who has steadfastly refused to address the alignment problem, in part because she has long timelines, and in part because she believes in a "fallacy of dumb superintelligence" (I point out how she goes wrong in The (partial) fallacy of dumb superintelligence).

I'm disappointed to see Bengio lend his name to this refusal to grapple with the larger alignment problem. I hope this doesn't signal a dedication to this approach. I had hoped for more from him.

[-]Chris_Leong9mo20

I've written up an short-form argument for focusing on Wise AI advisors. I'll note that my perspective is different from that taken in the paper. I'm primarily interested in AI as advisors, whilst the authors focus more on AI acting directly in the world.

Wisdom here is an aid to fulfilling your values, not a definition of those values

I agree that this doesn't provide a definition of these values. Wise AI advisors could be helpful for figuring out your values, much like how a wise human would be helpful for this.

[-]Seth Herd9mo20

This is great! I'll comment on that short-form.

In short, I think that wise (or even wise-ish) advisors are low-hanging fruit that will help any plan succeed, and that creating them is even easier than you suppose.

[-]AnthonyC1y70

Over time I am increasingly wondering how much these shortcomings on cognitive tasks are a matter of evaluators overestimating the capabilities of humans, while failing to provide AI systems with the level of guidance, training, feedback, and tools that a human would get.

[-]Seth Herd1y40

I think that's one issue; LLMs don't get the same types of guidance, etc. that humans get; they get a lot of training and RL feedback, but it's structured very differently.

I think this particular article gets another major factor right, where most analyses overlook it: LLMs by default don't do metacognitive checks on their thinking. This is a huge factor in humans appearing as smart as we do. We make a variety of mistakes in our first guesses (system 1 thinking) that can be found and corrected with sufficient reflection (system 2 thinking). Adding more of this to LLM agents is likely to be a major source of capabilities improvements. The focus on increasing "9s of reliability" is a very CS approach; humans just make tons of mistakes and then catch many of the important ones; LLMs sort of copy their cognition from humans, so they can benefit from the same approach - but they don't do much of it by default. Scripting it in to LLM agents is going to at least help, and it may help a lot.

[-]Chris_Leong1y20

That's a fascinating perspective.

[-]AnthonyC1y20

I think it is at least somewhat in line with your post and what @Seth Herd said in reply above.

Like, we talk about LLM hallucinations, but most humans still don't really grok how unreliable things like eyewitness testimony are. And we know how poorly calibrated humans are about our own factual beliefs, or the success probability of our plans. I've also had cases where coworkers complain about low quality LLM outputs, and when I ask to review the transcripts, it turns out the LLM was right, and they were overconfidently dismissing its answer as nonsensical.

Or, we used to talk about math being hard for LLMs, but that disappeared almost as soon as we gave them access to code/calculators. I think most people interested in AI are overestimating how bad most other people are at mental math.

[-]Chris_Leong1y40

I guess I was thinking about this in terms of getting maximal value out of wise AI advisers. The notion that comparisons might be unfair didn't even enter my mind, even though that isn't too many reasoning steps away from where I was.

^{^}

Johnson, B. (2022). Metacognition for artificial intelligence system safety: An approach to safe and desired behavior. Safety Science, 151, 105743.

^{^}

For the purposes of this paper... the authors aren't claiming to make a universal definition.

^{^}

See the collapsable section immediately underneath for a larger list.

^{^}

Walasek, L., & Brown, G. D. (2023). Incomparability and incommensurability in choice: No common currency of value? Perspectives on Psychological Science, 17456916231192828.

^{^}

Kay, J., & King, M. (2020). Radical uncertainty: Decision-making beyond the numbers. New York, NY: Norton.

^{^}

They seem to be pointing to Knightian uncertainty

^{^}

Lorenz, E. (1993). The essence of chaos. Seattle, WA: University of Washington Press.

^{^}

Prof. Grossmann (personal correspondance): "You are right that the term "intractable problem" is complicated and our group has debated it for a while (different disciplines favoured different jargon). Our examples were chiefly for highlighting the metacognitive benefits for wise decision-making."

^{^}

This table is copied from the paper.

^{^}

They use examples we discussed earlier to help justify their focus on metacognition. Whilst the Willa example might not initially appear related to metacognition, I suspect that the authors see this as related to "perspective seeking", one of the six metacognitive processes they highlight.

^{^}

Li, Y., Huang, Y., Lin, Y., Wu, S., Wan, Y., & Sun, L. (2024). I think, therefore I am: Awareness in Large Language Models. arXiv preprint arXiv:2401.17882.

^{^}

Cash, T. N., Oppenheimer, D. M., & Christie, S. Quantifying UncertAInty: Testing the Accuracy of LLMs’ Confidence Judgments. Preprint.

^{^}

Scholten, F., Rebholz, T. R., & Hütter, M. (2024). Metacognitive myopia in Large Language Models. arXiv preprint arXiv:2408.05568.

^{^}

An older version of the paper suggested: "Wisdom could enable the design of structures (such as constitutions, markets, and organizations) that enhance cooperation in society".

^{^}

The original paper said: "It can be incredibly challenging to "exhaustively specify goals in advance". Humans handle this by using goal hierarchies and wisdom could assist AI's in navigating this"

^{^}

A previous version of the paper claimed: "Perhaps the greatest risk is currently systems not working well enough. Machine metacognition could be useful for this. In particular, "AIs with appropriately calibrated confidence can target the most likely safety risks; appropriate self-models would help AIs to anticipate potential failures; and continual monitoring of its performance would facilitate recognition of high-risk moments and permit learning from experience."

^{^}

I agree that metacognition seems important for explanability, but my intuition is that wise decisions are often challenging or even impossible to make legible. See Tentatively against making AIs 'wise', which won a runner up prize in the AI Impacts Essay competition on the Automation of Wisdom and Philosophy.

The authors acknowledge the possibility that most attempts at introspection may fail to observe what really produced the decision, as opposed to merely producing an inference/story. Nonetheless, they assert that these inferences are in fact useful.

^{^}

The first sentence of this section reads "First, humans are not even aligned with each other". This is confusing since the second paragraph seems to suggest that their point is more about humans not always following norms, which is what I've summarised their point as.

^{^}

This paper doesn't use the term "instrumental convergence", so this statement involves a slight bit of interpretation on my part.

^{^}

Prof. Grossmann (personal correspondance): "I also don't think most philosophical or contemporary definitions of human wisdom in behavioural sciences would primarily focus on "intuition" - I even have evidence from a wide range of countries where most cultures consider a "wise" decision strategy to chiefly rely on deliberation"

^{^}

This is less significant in my worldview as I see wisdom as often being just about knowing the right answer without knowing why you know.

^{^}

The labels "Proposal A" and "Proposal B" aren't in the paper.

^{^}

For example, Lampinen, A. K., Roy, N., Dasgupta, I., Chan, S. C., Tam, A., Mcclelland, J., ... & Hill, F. (2022, June). Tell me why! explanations support learning relational and causal structure. In International Conference on Machine Learning (pp. 11868-11890).

^{^}

Prof. Grossmann (personal correspondance): "I like the idea of wise advisors. I don't think the argument in our paper is against it -it all depends on how humans will use the technology (and there are several papers on the role of metacognition for discerning when to rely on decision-aids/AI advisors, too)."

^{^}

Eliezer Yudkowsky's view seems to be that this specification pretty much has to be exhaustive, though others are less pessimistic about partial alignment.

	Why the authors argue that wise AI might provide this benefit
Robustness	Reliability over similar inputs: it'd be unwise to choose "excessively inconsistent" strategies: - Comment: I guess? Unclear how strong we should expect that effect to be though. Bias: identifying deficiencies in the data and either gathering more data or correcting for that bias Inflexibility: adjusting its confidence based on the situation
Co-operation^[14]	Slightly edited quotes Resolving conflicts among (object-level) strategies: e.g., when accuracy cues diverge Assessing the appropriateness (of object-level strategies): e.g., whether one can evaluate a chain of argumentation Seeking appropriate inputs: e.g., knowing the capabilities of the other counterparty This last point is particularly important for cooperative AI, which could overestimate the abilities of humans or lack common ground such as a shared emotional system.
Safety	The authors argue that wise reasoning provides an alternative to aligning AI to values^[15]^[16]: e.g., "one object-level strategy may be a bias toward inaction (not executing an action if it risks harm according to one of several possibly conflicting human norms), which in turn requires metacognitive regulation (learning what those conflicting perspectives are and avoiding overconfidence)" However, they argue that this isn't sufficient, as it doesn't address all the social questions of alignment, both in terms of design decisions ("Who should we align AI to? Should we increase the average human well-being, its sum, or care for the whole biosphere? Why assume today's values are the right ones?") and how these AI systems fit into a broader society (specifically how they can be channeled by institutions like governments and markets to allow our values to evolve towards a "shared reflective equilibrium".
Explainability	Metacognition seem to play a role in assisting humans to justifying their decisions. Presumably it should assist with helping AI to explain its decisions as well^[17]?

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

29

Summary: "Imagining and building wise machines: The centrality of AI metacognition" by Johnson, Karimi, Bengio, et al.

29

Ω 10

29

Ω 10

Paper Authors:

Highlights

Abstract

Notes on this summary

Why I Wrote This Summary

Summary:

What is wisdom?

Examples of Human Wisdom:

What are some theories of human wisdom?

Would AI wisdom resemble human wisdom?

Definition of wisdom^[2]:

What kinds of intractable problems?

Two types of strategies for managing this:

Why do the authors believe current AI falls short in metacognition?

Why build wise AI?:

Concrete Benefits

Comparison to Alignment:

Inaction example

Possible effects on instrumental convergence

Further comments on alignment

Benchmarking Wisdom:

Challenges with benchmarks

Building Wise AI:

Final Thoughts

Why am I focused specifically on wise AI advisors?

Highlights of additional content in the full paper

29

Summary: "Imagining and building wise machines: The centrality of AI metacognition" by Johnson, Karimi, Bengio, et al.

29

Ω 10

29

Ω 10

Paper Authors:

Highlights

Abstract

Notes on this summary

Why I Wrote This Summary

Summary:

What is wisdom?

Examples of Human Wisdom:

What are some theories of human wisdom?

Would AI wisdom resemble human wisdom?

Definition of wisdom[2]:

What kinds of intractable problems?

Two types of strategies for managing this:

Why do the authors believe current AI falls short in metacognition?

Why build wise AI?:

Concrete Benefits

Comparison to Alignment:

Inaction example

Possible effects on instrumental convergence

Further comments on alignment

Benchmarking Wisdom:

Challenges with benchmarks

Building Wise AI:

Final Thoughts

Why am I focused specifically on wise AI advisors?

Highlights of additional content in the full paper

Definition of wisdom^[2]: