Aligning AI by optimizing for "wisdom"

Elliot Mckernon

On first glance I thought this was too abstract to be a useful plan but coming back to it I think this is promising as a form of automated training for an aligned agent, given that you have an agent that is excellent at evaluating small logic chains, along the lines of Constitutional AI or training for consistency. You have training loops using synthetic data which can train for all of these forms of consistency, probably implementable in an MVP with current systems.

The main unknown would be detecting when you feel confident enough in the alignment of its stated values to human values to start moving down the causal chain towards fitting actions to values, as this is clearly a strongly capabilities-enhancing process.

Perhaps you could at least get a measure by looking at comparisons which require multiple steps, of human value -> value -> belief etc, and then asking which is the bottleneck to coming to the conclusion that humans would want. Positing that the agent is capable of this might be assuming away a lot of the problem though.

[-]Chris_Leong2y72

Hmm… this is different from my model of wisdom which is more about getting the important decisions correct, rather than being about consistency.

[-]Gordon Seidoh Worley2y50

I kind of want to comment on this but am finding it hard to do so, so I'll at least leave a comment expressing my frustration.

This post falls into some kind of uncanny valley of "feels wrong but both too much and not enough detail to criticize it directly". There's lots of wiggle room here with things underdefined in ways that are hard to really address and know if this seems reasonable or not. It pattern matches though to lots of things in the category of "hey, I just heard about alignment and I thought about it for a while and I think I see how to solve it" though misses the most egregious errors of that category of thing, which is why this is hard to say much about.

So I come away thinking I have no reason to think this will work but also unable to say anything specific about why I think it won't work other than I think there's a bunch of hidden details in here that are not being adequately explored.

[-]David_Kristoffersson2y*30

I would reckon: no single AI safety method "will work" because no single method is enough by itself. The idea expressed in the post would not "solve" AI alignment, but I think it's a thought-provoking angle on part of the problem.

[-]David_Kristoffersson2y42

I quite like the concept of alignment through coherence between the "coherence factors"!

"Wisdom" has many meanings. I would use the word differently to how the article is using it.

[-]Gurkenglas2y30

At a glance, this has too many parts to be the same math aliens would find. Perhaps formalize the process that generated them, then define wisdom in terms of it?

[-]Justin Bullock2y20

Thank you for this post! As I mentioned to both of you, I like your approach here. In particular, I appreciate the attempt to provide some description of how we might optimize for something we actually want, something like wisdom.

I have a few assorted thoughts for you to consider:

I would be interested in additional discussion around the inherent boundedness of agents that act in the world. I think self-consistency and inter-factor consistency have some fundamental limits that could be worth exploring within this framework. For example, might different types of boundedness systematically undermine wisdom in ways that we can predict or try and account for? You point out that these forms of consistency are continuous, which I think is a useful step in this direction
I'm wondering about feedback mechanisms for a wise agent in this context. For example, it would be interesting to know a little more about how a wise agent incorporates feedback into its model from the consequences of its own actions. I would be interested to see more in this direction in any future posts.
It strikes me that this post titled "Agency from a Causal Perspective" (https://www.alignmentforum.org/posts/Qi77Tu3ehdacAbBBe/agency-from-a-causal-perspective) might be of some particular interest to your approach here.

Excellent post here! I hope the comments are helpful!

[-]Mateusz Bagiński1y10

Based on my attending Oliver's talk, this may be relevant/useful:

	Evidence	Beliefs	Values	Actions
Evidence	Failure to identify patterns
Beliefs	Failure to infer correct patterns from evidence	Cognitive dissonance (& deductive explosion)
Values	Failure to recognise valued states or distinguish the value of two states	Failure to recognise valued states, or distinguish the value of two states	May follow incoherent strategies (e.g. dutch booking)
Actions	Failure of actions to achieve plans	Failure of actions to achieve plans/goals.	Failure to achieve movement towards your ideals	Actions undermine each other: goals not achieved
Environment	Failure of sensory input or mapping	Failure of data processing	Irrelevant values: caring about things you can't influence	Failure to influence the environment

LESSWRONG
LW

LESSWRONG
LW

28

Aligning AI by optimizing for "wisdom"

28

28

Considering the wisdom of intelligent agents

Introducing the coherence factors

Defining self-consistency

Defining inter-factor consistency

Wisdom

Causal structure and sequencing

Ordering the importance of specific consistencies

Wisdom as an optimization target

Additional factors and extensions

Applying this model

Fiction

ChatGPT

Paperclip maximizers and evil geniuses

Conclusion