Some of my disagreements with List of Lethalities

[-]beren3yΩ693

I broadly agree with a lot of shard theory claims. However, the important thing to realise is that 'human values' do not really come from inner misalignment wrt our innate reward circuitry but rather are the result of a very long process of social construction influenced both by our innate drives but also by the game-theoretic social considerations needed to create and maintain large social groups, and that these value constructs have been distilled into webs of linguistic associations learnt through unsupervised text-prediction-like objectives which is how we practically interact with our values. Most human value learning occurs through this linguistic learning grounded by our innate drives but extended to much higher abstractions by language.i.e. for humans we learn our values as some combination of bottom-up (how well do our internal reward evaluators in basal ganglia/hypothalamus) accord with the top-down socially constructed values) as well as top-down association of abstract value concepts with other more grounded linguistic concepts.

With AGI, the key will be to work primarily top-down since our linguistic constructs of values tend to reflect much better our ideal values than our actually realised behaviours. Using the AGI's 'linguistic cortex' which already has encoded verbal knowledge about human morality and values to evaluate potential courses of action and as a reward signal which can then get crystallised into learnt policies. The key difficulty is understanding how, in humans, the base reward functions interact with behaviour to make us 'truly want' specific outcomes (if humans even do) as opposed to reward or their correlated social assessments. It is possible, even likely, that this is just the default outcome of model-free RL experienced from the inside and in this case our AGIs would look highly anthropomorphic.

Also in general I disagree about aligning agents to evaluations of plans being unnecessary. What you are describing here is just direct optimization. But direct optimization -- i.e .effectively planning over a world model -- is necessary in situations where a.) you can't behaviourally clone existing behaviour and b.) you can't self-play too much with a model-free RL algorithms and so must rely on the world-model. In such a scenario you do not have ground truth reward signals and the only way to amake progresss is to optimise against some implicit learnt reward function.

I also am not sure that an agent that explicitly optimises this is hard to align and the major threat is goodhearting. We can perfectly align Go-playing AIs with this scheme because we have a ground truth exact reward function. Goodhearting is essentially isomorphic to a case of overfitting and can in theory be solved with various kinds of regularisation, especially if the AI maintains a well-calibrated sense of reward function uncertainty then in theory we can derive quantification bounds on its divergence from the true reward function.

[-]TurnTrout3yΩ220

Also in general I disagree about aligning agents to evaluations of plans being unnecessary. What you are describing here is just direct optimization. But direct optimization -- i.e .effectively planning over a world model

FWIW I don't consider myself to be arguing against planning over a world model.

[-]TurnTrout3yΩ220

the important thing to realise is that 'human values' do not really come from inner misalignment wrt our innate reward circuitry but rather are the result of a very long process of social construction influenced both by our innate drives but also by the game-theoretic social considerations needed to create and maintain large social groups, and that these value constructs have been distilled into webs of linguistic associations learnt through unsupervised text-prediction-like objectives which is how we practically interact with our values.
Most human value learning occurs through this linguistic learning grounded by our innate drives but extended to much higher abstractions by language.i.e. for humans we learn our values as some combination of bottom-up (how well do our internal reward evaluators in basal ganglia/hypothalamus) accord with the top-down socially constructed values) as well as top-down association of abstract value concepts with other more grounded linguistic concepts.

Can you give me some examples here? I don't know that I follow what you're pointing at.

[-]Chris_Leong3yΩ230

(Evolution) → (human values) is not the only case of inner alignment failure which we know about. I have argued that human values themselves are inner alignment failures on the human reward system. This has happened billions of times in slightly different learning setups.

I expect that it has also happened to an extent with animals as well. I wonder if anyone has ever looked into this.

[-]beren3y85

Yes they have. There's quite a large literature on animal emotion and cognition and my general synthesis is that animals (at least mammals) have at least the same basic emotions as humans and often quite subtle ones such as empathy and a sense of fairness. It seems pretty likely to me whatever the set of base reward functions encoded in the mammalian basal ganglia and hypothalamus is, it can quite robustly generate expressed behavioural 'values' that fall within some broadly humanly recognisable set.

[-]GeneSmith3y10

We don’t need a function of sensory input which is safe to maximize, that’s not the function of the reward signal. Reward chisels cognition. Reward is not necessarily—nor do we want it to be—a ground-truth signal about alignment.

I'm confused about this statement. How can reward be unnecessary as a ground-truth signal about alignment? Especially if "reward chisels cognition"?

[-]TurnTrout3y30

How can reward be unnecessary as a ground-truth signal about alignment? Especially if "reward chisels cognition"?

Reward's purpose isn't to demarcate "this was good by my values." That's one use, and it often works, but it isn't intrinsic to reward's mechanistic function. Reward develops certain kinds of cognition / policy network circuits. For example, reward shaping a dog to stand on its hind legs. I don't reward the dog because I intrinsically value its front paws being slightly off the ground for a moment. I reward the dog at that moment because that helps develop the stand-up cognition in the dog's mind.

^{^}

List of Lethalities was, AFAICT, intended to convey the most important dangers, in the right language. Rob Bensinger (who works at MIRI but was expressing his own views) also commented:

My shoulder Eliezer (who I agree with on alignment, and who speaks more bluntly and with less hedging than I normally would) says:
The list is true, to the best of my knowledge, and the details actually matter.

Many civilizations try to make a canonical list like this in 1980 and end up dying where they would have lived just because they left off one item, or under-weighted the importance of the last three sentences of another item, or included ten distracting less-important items.

So if Eliezer's talking about "how do we get agents to care about non-sensory observables", this indicates to me that I disagree with him about what the central subproblems of alignment are.

^{^}

From Inner and outer alignment decompose one hard problem into two extremely hard problems:

Outer/inner unnecessarily assumes that the loss function/outer objective should “embody” the goals which we want the agent to pursue.
For example, shaping is empirically useful in both AI and animals. When a trainer is teaching a dog to stand on its hind legs, they might first give the dog a treat when it lifts its front paws off the ground. This treat translates into an internal reward event for the dog, which (roughly) reinforces the dog to be more likely to lift its paws next time. The point isn’t that we terminally value dogs lifting their paws off the ground. We do this because it reliably shapes target cognition (e.g. stand on hind legs on command) into the dog. If you think about reward as exclusively “encoding” what you want, you lose track of important learning dynamics and seriously constrain your alignment strategies.

^{^}

I think this holds for basically any values in a rich, partially observable domain, including paperclip optimization or picking three flowers.

^{^}

"The loss function is used to train the AI, and the loss function represents human values" is akin to saying "a hammer is used to build a house, and the hammer represents the architect's design." Just as a hammer is a tool to facilitate the building process, a loss function is a tool to facilitate the learning process. The hammer doesn't represent the design of the house, it is simply a means to an end. Similarly, the loss function doesn't represent human values, it is simply a means to an end of training the AI to perform a task.

ChatGPT wrote this hammer analogy, given the prompt of a post draft (but the draft didn't include any of my reward-as-chisel analogies).

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

63

Some of my disagreements with List of Lethalities

63

Ω 31

63

Ω 31

List of Lethalities

3: Loss functions "representing" goals

Other disagreements with alignment thinkers

Evan Hubinger

Paul Christiano

Nick Bostrom