You may be interested in this article:
Orseau and Ring, as well as Dewey, have recently described problems, including self-delusion, with the behavior of agents using various definitions of utility functions. An agent's utility function is defined in terms of the agent's history of interactions with its environment. This paper argues, via two examples, that the behavior problems can be avoided by formulating the utility function in two steps: 1) inferring a model of the environment from interactions, and 2) computing utility as a function of the environment model. Basing a utility function on a model that the agent must learn implies that the utility function must initially be expressed in terms of specifications to be matched to structures in the learned model. These specifications constitute prior assumptions about the environment so this approach will not work with arbitrary environments. But the approach should work for agents designed by humans to act in the physical world. The paper also addresses the issue of self-modifying agents and shows that if provided with the possibility to modify their utility functions agents will not choose to do so, under some usual assumptions
Also, regarding this part of your post:
For example: moving yourself in space (in a certain speed range)
This range is quite huge. In certain contexts, you'd want to be moving through space at high fractions of the speed of light, rather than walking speed. Same goes for moving other objects through space. Btw, would you count a data packet as an object you move through space?
staying in a single spot (for a certain time range)
Hopefully the AI knows you mean moving in sync with Earth's movement through space.
Thank you for actually engaging with the idea (pointing out problems and whatnot) rather than just suggesting reading material.
Btw, would you count a data packet as an object you move through space?
A couple of points:
This range is quite huge. In certain contexts, you'd want to be moving through space at high fractions of the speed of light, rather than walking speed. Same goes for moving other objects through space.
What contexts do you mean? Maybe my point about "moral speed limits" addresses this.
Hopefully the AI knows you mean moving in sync with Earth's movement through space.
Yes, relativity of motion is a problem which needs to be analyzed. Fundamental variables should refer to relative speeds/displacements or something.
The paper is surely at least partially relevant, but what's your own opinion on it? I'm confused about this part: (4.2 Defining Utility Functions in Terms of Learned Models)
For example a person may be specified by textual name and address, by textual physical description, and by images and other recordings. There is very active research on recognizing people and objects by such specifications (Bishop, 2006; Koutroumbas and Theodoris, 2008; Russell and Norvig, 2010). This paper will not discuss the details of how specifications can be matched to structures in learned environment models, but assumes that algorithms for doing this are included in the utility function implementation.
Does it just completely ignore the main problem?
I know Abram Demski wrote about Model-based Utility Functions, but I couldn't fully understand his post too.
(Disclaimer: I'm almost mathematically illiterate, except knowing a lot of mathematical concepts from popular materials. Halting problem, Godel, uncountability, ordinals vs. cardinals, etc.)
Also note that fundamental variables are not meant to be some kind of “moral speed limits”, prohibiting humans or AIs from acting at certain speeds. Fundamental variables are only needed to figure out what physical things humans can most easily interact with (because those are the objects humans are most likely to care about).
Ok, that clears things up a lot. However, I still worry that if it's at the AI's discretion when and where to sidestep the fundamental variables, we're back at the regular alignment problem. You have to be reasonably certain what the AI is going to do in extremely out of distribution scenarios.
The subproblem of environmental goals is just to make AI care about natural enough (from the human perspective) "causes" of sensory data, not to align AI to the entirety of human values. Fundamental variables have no (direct) relation to the latter problem.
However, fundamental variables would be helpful for defining impact measures if we had a principled way to differentiate "times when it's OK to sidestep fundamental variables" from "times when it's NOT OK to sidestep fundamental variables". That's where the things you're talking about definitely become a problem. Or maybe I'm confused about your point.
So at first I though this didn't include a step where the AI learns to care about things - it only learns to model things. But I think actually you're assuming that we can just directly use the model to pick actions that have predicted good outcomes - which are going to be selected as "good" according the the pre-specified P-properties. This is a flaw because it's leaving too much hard work for the specifiers to do - we want the environment to do way more work at selecting what's "good."
Second problem comes in two flavors - object level and meta level. The object level problem is that sometimes your AI will assign your P-properties to atoms and quantum fields ("What they want is to obey the laws of physics. What they believe is their local state."), or your individual cells, etc. The meta level problem is that trying to get the AI to assign properties in a human-approved way is a complicated problem that you can only do so well without communicating with humans. (John Wentworth disagrees more or less, check out things tagged Natural Abstractions for more reading, but also try not to get too confirmation-biased.)
Another potential complication is the difficulty of integrating some features of this picture with modern machine learning. I think it's fine to do research that assumes a POMDP world model or whatever. But demonstrations of alignment theories working in gridworlds have a real hard time moving me, precisely because they often let you cheat (and let you forget that you cheated) on problems one and two.
So at first I though this didn't include a step where the AI learns to care about things - it only learns to model things. But I think actually you're assuming that we can just directly use the model to pick actions that have predicted good outcomes - which are going to be selected as "good" according the the pre-specified P-properties. This is a flaw because it's leaving too much hard work for the specifiers to do - we want the environment to do way more work at selecting what's "good."
I assume we get an easily interpretable model where the difference between "real strawberries" and "pictures of strawberries" and "things sometimes correlated with strawberries" is easy to define, so we can use the model to directly pick the physical things AI should care about. I'm trying to address the problem of environmental goals, not the problem of teaching AI morals. Or maybe I'm misunderstanding your point?
The object level problem is that sometimes your AI will assign your P-properties to atoms and quantum fields ("What they want is to obey the laws of physics. What they believe is their local state."), or your individual cells, etc.
If you're talking about AI learning morals, my idea is not about that. Not about modeling desires and beliefs.
The meta level problem is that trying to get the AI to assign properties in a human-approved way is a complicated problem that you can only do so well without communicating with humans. (John Wentworth disagrees more or less, check out things tagged Natural Abstractions for more reading, but also try not to get too confirmation-biased.)
I disagree too, but in a slightly different way. IIRC, John says approximately the following:
I think (1) doesn't need to be true. I say:
Epistemic status: I want to propose a method of learning environmental goals (a super big, super important subproblem in Alignment). It's informal, so has a lot of gaps. I worry I missed something obvious, rendering my argument completely meaningless. I asked LessWrong feedback team, but they couldn't get someone knowledgeable enough to take a look.
Can you tell me the biggest conceptual problems of my method? Can you tell me if agent foundations researchers are aware of this method or not?
If you're not familiar with the problem, here's the context: Environmental goals; identifying causal goal concepts from sensory data; ontology identification problem; Pointers Problem; Eliciting Latent Knowledge.
Explanation 1
One naive solution
Imagine we have a room full of animals. AI sees the room through a camera. How can AI learn to care about the real animals in the room rather than their images on the camera?
Assumption 1. Let's assume AI models the world as a bunch of objects interacting in space and time. I don't know how critical or problematic this assumption is.
Idea 1. Animals in the video are objects with certain properties (they move continuously, they move with certain relative speeds, they have certain sizes, etc). Let's make the AI search for the best world-model which contains objects with similar properties (P properties).
Problem 1. Ideally, AI will find clouds of atoms which move similarly to the animals on the video. However, AI might just find a world-model (X) which contains the screen of the camera. So it'll end up caring about "movement" of the pixels on the screen. Fail.
Observation 1. Our world contains many objects with P properties which don't show up on the camera. So, X is not the best world-model containing the biggest number of objects with P properties.
Idea 2. Let's make the AI search for the best world-model containing the biggest number of objects with P properties.
Question 1. For "Idea 2" to make practical sense, we need to find a smart way to limit the complexity of the models. Otherwise AI might just make any model contain arbitrary amounts of any objects. Can we find the right complexity restriction?
Question 2. Assume we resolved the previous question positively. What if "Idea 2" still produces an alien ontology humans don't care about? Can it happen?
Question 3. Assume everything works out. How do we know that this is a general method of solving the problem? We have an object in sense data (A), we care about the physical thing corresponding to it (B): how do we know B always behaves similarly to A and there are always more instances of B than of A?
One philosophical argument
I think there's a philosophical argument which allows to resolve Questions 2 & 3 (giving evidence that Question 1 should be resolvable too).
If the argument is true, the pointers problem should be solvable without Natural Abstraction hypothesis being true.
Anyway, I'll add a toy example which hopefully helps to better understand what's this all about.
One toy example
You're inside a 3D video game. 1st person view. The game contains landscapes and objects, both made of small balls (the size of tennis balls) of different colors. Also a character you control.
The character can push objects. Objects can break into pieces. Physics is Newtonian. Balls are held together by some force. Balls can have dramatically different weights.
Light is modeled by particles. Sun emits particles, they bounce off of surfaces.
The most unusual thing: as you move, your coordinates are fed into a pseudorandom number generator. The numbers from the generator are then used to swap places of arbitrary balls.
You care about pushing boxes (as everything, they're made of balls too) into a certain location.
...
So, the reality of the game has roughly 5 levels:
I think AI should be able to figure out that it needs to care about 2A level of reality. Because ball structures are much simpler to control (by doing normal activities with the game's character) than individual balls. And light particles are harder to interact with than ball structures, due to their speed and nature.
Explanation 2
An alternative explanation of my argument:
Formalization
So, the core of my idea is this:
How far are we from being able to do something like this? Are agent foundations researches pursuing this or something else?