[DISC] Are Values Robust?

Dec 21, 2022

I think there might be a broad set of values that emerge around group survival, essentially game-theoretic or evolutionary pressures that lead to cooperation. But I think the details beyond that are likely to incredibly specific. I'd point to the "preference construction" literature as a more realistic account of how humans make choices, without assuming an underlying consistent preference structure.

Charlie Steiner

Dec 21, 2022

My best guess is that if we pretend we knew how to define a space where AIs that are similar under self-modification are close together, there would indeed be basins of attraction around most good points (AIs that do good things with the galaxy). However, I see no particular reason why there should only be one such basin of attraction, at least not without defining your space in an unnatural way. And of course there are going to be plenty of other basins of attraction, you don't ever get alignment by default by just throwing a dart into AI-space.

[-]DragonGod3y10

A load bearing claim of the robust values hypothesis for "alignment by default" is $# 2$ :

Said subset is a "naturalish" abstraction
1. The more natural the abstraction, the more robust values are
2. Example operationalisations of "naturalish abstraction"
  1. The subset is highly privileged by the inductive biases of most learning algorithms that can efficiently learn our universe
    - More privileged $\to$ more natural
  2. Most efficient representations of our universe contain a simple embedding of the subset
    - Simpler embeddings $\to$ more natural

The safety comes... (read more)

2Charlie Steiner3y

Sure. Though see Take 4.

1DragonGod3y

Claim #1 (about a "privileged subset") is a claim that there aren't multiple such natural abstractions (e.g. any other subset of human values that satisfies #3 would be a superset of the privileged subset, or a subset of the basin of attraction around the privileged subset.) [But I haven't yet fully read that post or your other linked posts.]

Mark Neyer

Dec 22, 2022

-1-2

Hi! I've been an outsider in this community for a while effectively for arguing exactly this: yes, values are robust. Before I set off all the 'quack' filters, I did manage to persuade Richard Ngo that an AGI wouldn't want to kill humans right away.

I think that for embodied agents, convergent instrumental subgoals very well likely lead to alignment.

I think this is definitely not true if we imagine an agent living outside of a universe it can wholly observe and reliably manipulate, but the story changes dramatically when we make the agent an embodied agent in our own universe.

Our universe is so chaotic and unpredictable that actions increasing the likelihood of direct progress towards a goal will become increasingly difficult to compute beyond some time horizon, and the threat of death is going to be present for any agent of any size. If you can't reliably predict something like, 'the position of the moon 3,000 years from tomorrow' due to the numerical error getting worse over time, i don't see how it's possible to compute far more complicated queries about possible futures involving billions of agents.

Hence I suspect that the best way to maximize long term progress towards any goal is to increase the number and diversity of agents that have an interest in keeping you alive. The easiest, simplest way to do this is with a strategy of identifying agents whose goals are roughly compatible with yours, identifying the convergent instrumental subgoals of those agents, and helping those agents on their path. This is effectively a description of being loving: figuring out how you can help those around you grow and develop.

There is also a longer argument which says, 'instrumental rationality, once you expand the scope, turns into something like religion'

[-]quetzal_rainbow3y32

If your future doesn't have billions of agents, you don't need to predict them.

1Mark Neyer3y

Fine, replace the agents with rocks. The problem still holds. There's no closed form solution for the 3-body problem; you can only numerically approximate the future, with decreasing accuracy as time goes on. There are far more than 3 bodies in the universe relevant to the long term survival of an AGI that could die in any number of ways because it's made of many complex pieces that can all break or fail.

[-]Seth Herd2y20

The reason we're so concerned with instrumental convergence is that we're usually thinking of an AGI that can recursively self-improve until it can outmaneuver all of humanity and do whatever it wants. If it's a lot smarter than us, any benefit we could give it is small compared to the risk that we'll try to kill it or create more AGIs that will.

The future is hard to predict, that's why it's safest to eliminate any hard to predict parts that might actively try to kill you. If you can. If an AGI isn't that capable, we're not that concerned. But AGI will hav... (read more)

^{^}

Using the shard theory conception of "value" as "contextual influence on decision making".

^{^}

To be clear, "example operationalisation" in this document does not refer to any kind of canonical formalisations. The example operationalisations aren't even necessarily correct/accurate/sensible. They are meant to simply gesture in the right direction for what those terms might actually cash out to.

^{^}

"Benevolent": roughly the subset of human values that we are happy for arbitrarily capable systems to optimise for.

"Universal": roughly the subset of human values that we are happy for other humans to optimise for.

^{^}

Including "astronomical waste" as an existential catastrophe.

^{^}

The other approach being to safeguard systems that may not necessarily be optimising for values that we'd be "happy" for them to pursue, were we fully informed.

Examples of safeguarding approaches: corrigibility, impact regularisation, myopia, non-agentic system design, quantilisation, etc.

LESSWRONG
LW

LESSWRONG
LW

12

[ Question ]

[DISC] Are Values Robust?

12

12

3 Answers sorted by
top scoring

Dec 21, 2022

Dec 21, 2022

Dec 22, 2022

Epistemic Status

Related Posts

Robust Values Hypothesis

Why Does it Matter?

Questions

12

[ Question ]

[DISC] Are Values Robust?

12

12

3 Answers sorted by top scoring

Dec 21, 2022

Dec 21, 2022

Dec 22, 2022

Epistemic Status

Related Posts

Robust Values Hypothesis

Why Does it Matter?

Questions

3 Answers sorted by
top scoring