There needs to be some process which, given a context, specifies what value shards should be created (or removed/edited) to better work in that context. Not clear we can't think of this as constituting the system's immutable goal in some sense, especially as it gets more powerful. That said it would probably not be strongly coherent by your semi-formal definition.
I think you are onto something, with the implication that building a highly intelligent, learning entity with strong coherence in this sense is unlikely, and hence, getting it morally aligned in this fashion is also unlikely. Which isn't that bad, insofar as plans for aligning it that way honestly did not look particularly promising.
Which is why I have been advocating for instead learning from how we teach morals to existing complex intelligent agents - namely, through ethical, rewarding interactions in a controlled environment slowly allowing more freedom.
We know how to do this, it does not require us to somehow define the core of ethics mathematically. We know it works. We know how setbacks look, and how to tackle them. We know how to do this with human interactions the average person can do/train, rather than with code. It seems easier and more doable and promising in so many ways.
That doesn't mean it will be easy, or risk free, and it still comes with a hell of a lot of problems based on the fact that AIs, even machine learning ones, are quite simply not human, they are not inherently social, they do not inherently have altruistic urges, they do not inherently have empathic abilities. But I see a clearer path to dealing with that than to directly encoding an abstract ethics into an intelligent, flexible actor.
EDIT: I found out my answer is quite similar to this other one you probably read already.
I think not.
Imagine such a malleable agent's mind as made of parts. Each part of the mind does something. There's some arrangement of the things each part does, and how many parts do each kind of thing. We won't ask right now where this organization comes from, but take it for given.
Imagine that---be it by chance or design---some parts were cooperating, while some were not. "Cooperation" means making actions that bring about a consequence in a somewhat stable way, so something towards being coherent and consequentialist, although not perfectly so by any measure. The other parts would oftentimes work at cross purposes, treading on each other toes. "Working at cross purposes", again, in other words means not being consequentialist and coherent; from the point of view of the parts, there may not even be a notion of "cross purposes" if there is no purpose.
By the nature of coherence, the ensemble of coherent and aligned parts would get to their purpose much more efficiently than the other parts are not-getting to that purpose and being a hindrance, assuming the purpose was reachable enough. This means that coherent agents are not just reflectively consistent, but also stable: once there's some seed of coherence, it can win other the non-coherent parts.
Conclusion 1: Intelligent systems in the real world do not converge towards strong coherence
It seems to me that humans are more coherent and consequentialist than other animals. Humans are not perfectly coherent, but the direction is towards more coherence. Actually, I'd expect that any sufficiently sophisticated bounded agent would not introspectively look coherent to itself if it spent enough time to think about it. Would the trend break after us?
Would you take a pill that would make you an expected utility maximiser?
Would you take a pill that made you a bit less coherent? Would you take a pill that made you a bit more coherent? (Not rhetorical questions.)
By the nature of coherence, the ensemble of coherent and aligned parts would get to their purpose much more efficiently than the other parts are not-getting to that purpose and being a hindrance, assuming the purpose was reachable enough. This means that coherent agents are not just reflectively consistent, but also stable: once there's some seed of coherence, it can win other the non-coherent parts.
I think this fails to adequately engage with the hypothesis that values are inherently contextual.
Alternatively, the kind of cooperation you describe where ...
(A somewhat theologically inspired answer:)
Outside the dichotomy of values (in the shard-theory sense) vs. immutable goals, we could also talk about valuing something that is in some sense fixed, but "too big" to fit inside your mind. Maybe a very abstract thing. So your understanding of it is always partial, though you can keep learning more and more about it (and you might shift around, feeling out different parts of the elephant). And your acted-on values would appear mutable, but there would actually be a, perhaps non-obvious, coherence to them.
It's possible this is already sort of a consequence of shard theory? In the way learned values would have coherences to accord with (perhaps very abstract or complex) invariant structure in the environment?
My claim is mostly that real world intelligent systems do not have values that can be well described by a single fixed utility function over agent states.
I do not see this answer as engaging with that claim at all.
If you define utility functions over agent histories, then everything is an expected utility maximiser for the function that assigns positive utility to whatever action the agent actually took and zero utility to every other action.
I think such a definition of utility function is useless.
If however you define utility functions over agent states, ...
Oh, huh, this post was on the LW front page, and dated as posted today, so I assumed it was fresh, but the replies' dates are actually from a month ago.
Systems with malleable values do not self modify to have (immutable) terminal goals
Consider the alternative framing where agents with malleable values don't modify themselves, but still build separate optimizers with immutable terminal goals.
These two kinds of systems could then play different roles. For example, strong optimizers with immutable goals could play the role of laws of nature, making the most efficient use of underlying physical substrate to implement many abstract worlds where everything else lives. The immutable laws of nature in each world could specify how and to what extent the within-world misalignment catastrophes get averted, and what other value-optimizing interventions are allowed outside of what the people who live there do themselves.
Here, strong optimizers are instruments of value, they are not themselves optimized to be valuable content. And the agents with malleable values are the valuable content from the point of view of the strong optimizers, but they don't need to be very good at optimizing things for anything in particular. The goals of the strong optimizers could be referring to an equilibrium of what the people end up valuing, over the vast archipelago of civilizations that grow up with many different value-laden laws of nature, anticipating how the worlds develop given these values, and what values the people living there end up expressing as a result.
But this is a moral argument, and misalignment doesn't respect moral arguments. Even if it's a terrible idea for systems with malleable values to either self modify into strong immutable optimizers or build them, that doesn't prevent the outcome where they do that regardless and perish as a result, losing everything of value. Moloch is the most natural force in a disorganized society that's not governed by humane laws of nature. Only nothingness above.
To get to coherence, you need a method that accepts incoherence and spits out coherence. In the context of preferences, two datapoints:
So it look like computing the coherent version of incoherent preferences is computationally difficult. Don't know about approximations, or how this applies to Helmholtz decomposition (though vector fields also can't represent all the known incoherence).
Informally: a system has (immutable) terminal goals. Semiformally: a system's decision making is well described as (an approximation) of argmax over actions (or higher level mappings thereof) to maximise (the expected value of) a simple unitary utility function.
Are the (parenthesized) words part of your operationalization or not? If so, I would recommend removing the parentheses, to make it clear that they are not optional.
Also, what do you mean by "a simple unitary utility function"? I suspect other people will also be confused/thrown off by that description.
The "or higher mappings thereof" is to accommodate agents that choose state —> action policies directly, and agent that choose policies over ... over policies, so I'll keep it.
I don't actually know if my critique applies well to systems that have non immutable terminal goals.
I guess if you have sufficiently malleable terminal goals, you get values near exactly.
Are the (parenthesized) words part of your operationalization or not, or not? If so, I would recommend removing the parentheses, to make it clear that they are not optional.
Will do.
Also, what do you mean by "a simple unitary utility function"? I suspect other people will also be confused/thrown off by that description.
If you define your utility function in a sufficiently convoluted manner, then everything is a utility maximiser.
Less contrived, I was thinking of stuff like Wentworth's subagents that identifies decision making with pareto optimality over a set of utility functions.
I think subagents comes very close to being an ideal model of agency and could probably be adapted to be a complete model.
I don't want to include subagents in my critique at this point.
If you define your utility function in a sufficiently convoluted manner, then everything is a utility maximiser.
Less contrived, I was thinking of stuff like Wentworth's subagents that identifies decision making with pareto optimality over a set of utility functions.
I think subagents comes very close to being an ideal model of agency and could probably be adapted to be a complete model.
I don't want to include subagents in my critique at this point.
I think what you want might be "a single fixed utility function over states" or something similar. That captures that you're excluding from critique:
Related:
Background and Core Concepts
I operationalised "strong coherence" as:
And contended that humans, animals (and learning based agents more generally?) seem to instead have values ("contextual influences on decision making").
The shard theory account of value formation in learning based agents is something like:
And I think this hypothesis of how values form in intelligent systems could be generalised out of a RL context to arbitrary constructive optimisation processes[1]. The generalisation may be something like:
This seems to be an importantly different type of decision making from expected utility maximisation[3]. For succinctness, I'd refer to systems of the above type as "systems with malleable values".
The Argument
In my earlier post I speculated that "strong coherence is anti-natural". To operationalise that speculation:
E.g:
* Stochastic gradient descent
* Natural selection/other evolutionary processes
Intelligent systems are adaptation executors not objective function maximisers
Of a single fixed utility function over states.
E.g I'm under the impression that humans can't explicitly design an algorithm to achieve AlexNet accuracy on the ImageNet dataset.
I think the self supervised learning that underscores neocortical cognition is a much harder learning task.
I believe that learning is the only way there is to create capable intelligent systems that operate in the real world given our laws of physics.