Half-baked idea: a straightforward method for learning environmental goals?

You may be interested in this article:

Orseau and Ring, as well as Dewey, have recently described problems, including self-delusion, with the behavior of agents using various definitions of utility functions. An agent's utility function is defined in terms of the agent's history of interactions with its environment. This paper argues, via two examples, that the behavior problems can be avoided by formulating the utility function in two steps: 1) inferring a model of the environment from interactions, and 2) computing utility as a function of the environment model. Basing a utility function on a model that the agent must learn implies that the utility function must initially be expressed in terms of specifications to be matched to structures in the learned model. These specifications constitute prior assumptions about the environment so this approach will not work with arbitrary environments. But the approach should work for agents designed by humans to act in the physical world. The paper also addresses the issue of self-modifying agents and shows that if provided with the possibility to modify their utility functions agents will not choose to do so, under some usual assumptions

Also, regarding this part of your post:

For example: moving yourself in space (in a certain speed range)

This range is quite huge. In certain contexts, you'd want to be moving through space at high fractions of the speed of light, rather than walking speed. Same goes for moving other objects through space. Btw, would you count a data packet as an object you move through space?

staying in a single spot (for a certain time range)

Hopefully the AI knows you mean moving in sync with Earth's movement through space.

[-]Q Home10mo50

Thank you for actually engaging with the idea (pointing out problems and whatnot) rather than just suggesting reading material.

Btw, would you count a data packet as an object you move through space?

A couple of points:

I only assume AI models the world as "objects" moving through space and time, without restricting what those objects could be. So yes, a data packet might count.
"Fundamental variables" don't have to capture all typical effects of humans on the world, they only need to capture typical human actions which humans themselves can easily perceive and comprehend. So the fact that a human can send an Internet message at 2/3 speed of light doesn't mean that "2/3 speed of light" should be included in the range of fundamental variables, since humans can't move and react at such speeds.
Conclusion: data packets can be seen as objects, but there are many other objects which are much easier for humans to interact with.
Also note that fundamental variables are not meant to be some kind of "moral speed limits", prohibiting humans or AIs from acting at certain speeds. Fundamental variables are only needed to figure out what physical things humans can most easily interact with (because those are the objects humans are most likely to care about).

This range is quite huge. In certain contexts, you'd want to be moving through space at high fractions of the speed of light, rather than walking speed. Same goes for moving other objects through space.

What contexts do you mean? Maybe my point about "moral speed limits" addresses this.

Hopefully the AI knows you mean moving in sync with Earth's movement through space.

Yes, relativity of motion is a problem which needs to be analyzed. Fundamental variables should refer to relative speeds/displacements or something.

The paper is surely at least partially relevant, but what's your own opinion on it? I'm confused about this part: (4.2 Defining Utility Functions in Terms of Learned Models)

For example a person may be specified by textual name and address, by textual physical description, and by images and other recordings. There is very active research on recognizing people and objects by such specifications (Bishop, 2006; Koutroumbas and Theodoris, 2008; Russell and Norvig, 2010). This paper will not discuss the details of how specifications can be matched to structures in learned environment models, but assumes that algorithms for doing this are included in the utility function implementation.

Does it just completely ignore the main problem?

I know Abram Demski wrote about Model-based Utility Functions, but I couldn't fully understand his post too.

(Disclaimer: I'm almost mathematically illiterate, except knowing a lot of mathematical concepts from popular materials. Halting problem, Godel, uncountability, ordinals vs. cardinals, etc.)

[-]Capybasilisk10mo10

Also note that fundamental variables are not meant to be some kind of “moral speed limits”, prohibiting humans or AIs from acting at certain speeds. Fundamental variables are only needed to figure out what physical things humans can most easily interact with (because those are the objects humans are most likely to care about).

Ok, that clears things up a lot. However, I still worry that if it's at the AI's discretion when and where to sidestep the fundamental variables, we're back at the regular alignment problem. You have to be reasonably certain what the AI is going to do in extremely out of distribution scenarios.

[-]Q Home10mo50

The subproblem of environmental goals is just to make AI care about natural enough (from the human perspective) "causes" of sensory data, not to align AI to the entirety of human values. Fundamental variables have no (direct) relation to the latter problem.

However, fundamental variables would be helpful for defining impact measures if we had a principled way to differentiate "times when it's OK to sidestep fundamental variables" from "times when it's NOT OK to sidestep fundamental variables". That's where the things you're talking about definitely become a problem. Or maybe I'm confused about your point.

[-]Capybasilisk10mo10

Thanks. That makes sense.

[-]Charlie Steiner10mo30

So at first I though this didn't include a step where the AI learns to care about things - it only learns to model things. But I think actually you're assuming that we can just directly use the model to pick actions that have predicted good outcomes - which are going to be selected as "good" according the the pre-specified P-properties. This is a flaw because it's leaving too much hard work for the specifiers to do - we want the environment to do way more work at selecting what's "good."

Second problem comes in two flavors - object level and meta level. The object level problem is that sometimes your AI will assign your P-properties to atoms and quantum fields ("What they want is to obey the laws of physics. What they believe is their local state."), or your individual cells, etc. The meta level problem is that trying to get the AI to assign properties in a human-approved way is a complicated problem that you can only do so well without communicating with humans. (John Wentworth disagrees more or less, check out things tagged Natural Abstractions for more reading, but also try not to get too confirmation-biased.)

Another potential complication is the difficulty of integrating some features of this picture with modern machine learning. I think it's fine to do research that assumes a POMDP world model or whatever. But demonstrations of alignment theories working in gridworlds have a real hard time moving me, precisely because they often let you cheat (and let you forget that you cheated) on problems one and two.

[-]Q Home10mo10

So at first I though this didn't include a step where the AI learns to care about things - it only learns to model things. But I think actually you're assuming that we can just directly use the model to pick actions that have predicted good outcomes - which are going to be selected as "good" according the the pre-specified P-properties. This is a flaw because it's leaving too much hard work for the specifiers to do - we want the environment to do way more work at selecting what's "good."

I assume we get an easily interpretable model where the difference between "real strawberries" and "pictures of strawberries" and "things sometimes correlated with strawberries" is easy to define, so we can use the model to directly pick the physical things AI should care about. I'm trying to address the problem of environmental goals, not the problem of teaching AI morals. Or maybe I'm misunderstanding your point?

The object level problem is that sometimes your AI will assign your P-properties to atoms and quantum fields ("What they want is to obey the laws of physics. What they believe is their local state."), or your individual cells, etc.

If you're talking about AI learning morals, my idea is not about that. Not about modeling desires and beliefs.

The meta level problem is that trying to get the AI to assign properties in a human-approved way is a complicated problem that you can only do so well without communicating with humans. (John Wentworth disagrees more or less, check out things tagged Natural Abstractions for more reading, but also try not to get too confirmation-biased.)

I disagree too, but in a slightly different way. IIRC, John says approximately the following:

All reasoning systems converge on the same space of abstractions. This space of abstractions is the best way to model the universe.
In this space of abstractions it's easy to find the abstraction corresponding to e.g. real diamonds.

I think (1) doesn't need to be true. I say:

By default, humans only care about things they can easily interact with in humanly comprehensible ways. "Things which are easy to interact with in humanly comprehensible ways" should have a simple definition.
Among all "things which are easy to interact with in humanly comprehensible ways", it's easy to find the abstraction corresponding to e.g. real diamonds.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

16

Half-baked idea: a straightforward method for learning environmental goals?

16

Ω 4

16

Ω 4

Explanation 1

One naive solution

One philosophical argument

One toy example

Explanation 2

Formalization