Humans Are Embedded Agents Too

We actually avoided talking about AI in most of the cartoon, and tried to just imply it by having a picture of a robot.

The first time (I think) I presented the factoring in the embedded agency sequence was at a MIRI CFAR collaboration workshop, so parallels with humans was live in my thinking.

The first time we presented the cartoon in roughly its current form was at MSFP 2018, where we purposely did it on the first night before a CFAR workshop, so people could draw analogies that might help them transfer their curiosity in both directions.

[-]Gordon Seidoh Worley6yΩ4110

I agree and think this is an unappreciated idea, which is why I liberally link the embedded agency post in things I write. I'm not sure I'm doing a perfect job of not forgetting we are all embedded, but I consider it important and essential to not getting confused about, for example, human values, and think many of the confusions we have (especially the ones we fail to notice) are a result of incorrectly thinking, to put in another way, that the map does not also reside in the territory.

[-]Shmi6y90

It makes sense that to address the challenges of the agent being embedded one needs to start at the very foundations. I suspect that there is a fair bit of work before even addressing embedding the agents. For example, in a basic map-territory correspondence the map is a part of the territory. So, a question arises, what does it mean for a part of the territory be a coarse-grained representation of the territory? What restrictions it places on the type of territories that are internally mappable to begin with? For example, it has to admit lossy compression of some kind, yet not be completely fractal. Anyway, my point is that focusing on the agency maybe a wrong place to start, there are more basic questions of embeddings that need to be addressed first. And even figuring out what those questions might be would count as progress.

[-]johnswentworth6y80

I strongly agree with this. Those sorts of questions are exactly what I see as the main objective of my own research right now.

[-]justinpombrio6y60

This post points out that many alignment problems can be phrased as embedded agency problems. It seems to me that they can also all be phrased as word-boundary problems. More precisely, for each alignment/embedded-agency problem listed here, there's a question (or a set of questions) of the form "what is X?" such that answering that question would go a long way toward solving the alignment/embedded-agency problem, and vice-versa.

Is this a useful reduction?

The "what is X?" question I see for each problem:

The Keyboard is Not The Human

What does it mean for a person to "say" something (in the abstract sense of the word)?

Modified Humans

What is a "human"? Furthermore, what does it mean to "modify" or "manipulate" a human?

Off-Equilibrium

What are the meanings of counterfactual statements? For example, what does it mean to say "We will launch of nukes if you do."?

Perhaps also, what is a "choice"?

Drinking

What is a "valid profession of one's values"?

Value Drift

What are a person's "values"? Focus being on people changing over time.

Akrasia

What is a "person", and what are a person's "values"? Focus being on people being make of disparate parts.

Preferences Over Quantum Fields

What are the meanings of abstract, high-level statements? Do they change if your low-level model of the world fundamentally shifts?

Unrealized Implications

What are a person's "values"? Focus being on someone knowing A and knowing A->B but not yet knowing B.

Socially Strategic Self-Modification

What are a person's "true values"? Focus being on self-modification.

[-]johnswentworth6y70

Yes and no.

I do think you're pointing to the right problems - basically the same problems Shminux was pointing at in his comment, and the same problems which I think are the most promising entry point to progress on embedded agency in general.

That said, I think "word boundaries" is a very misleading label for this class of problems. It suggests that the problem is something like "draw a boundary around points in thing-space which correspond to the word 'tree'", except for concepts like "values" or "person" rather than "tree". Drawing a boundary in thing-space isn't really the objective here; the problem is that we don't know what the right parameterization of thing-space is or whether that's even the right framework for grounding these concepts at all.

Here's how I'd pose it. Over the course of history, humans have figured out how to translate various human intuitions into formal (i.e. mathematical) models. For instance:

Game theory gave a framework for translating intuitions about "strategic behavior" into math
Information theory gave a framework for translating intuitions about information into math
More recently, work on causality gave a framework for translating intuitions about counterfactuals into math
In the early days, people like Galileo showed how to translate physical intuitions into math

A good heuristic: if a class of intuitive reasoning is useful and effective in practice, then there's probably some framework which would let us translate those intuitions into math. In the case of embedded-agency-related problems, we don't yet have the framework - just the intuitions.

With that in mind, I'd pose the problem as: build a framework for translating intuitions about "values", "people", etc into math. That's what we mean by the question "what is X?".

[-]justinpombrio6y30

Ooh, that is very insightful. The word-boundary problem around "values" feels fuzzy and ill-defined, but that doesn't mean that the thing we care about is actually fuzzy and ill-defined.

[-]Noosphere891y40

IMO, the I/O part is not about the lack of such a channel, but rather the lack of a channel that is invulnerable to hacking/modification, such that the channel can be assumed to only come from a certain source.

You could always create such a channel, though it isn't fundamental, but rather that you can't create a channel that isn't able to be modified/hacked, such that the channel can be assumed to only come from a certain source.

I like dxu's comment:

https://www.lesswrong.com/s/Rm6oQRJJmhGCcLvxh/p/zcPLNNw4wgBX5k8kQ#uFdZuNY3XxBBakLv7

[-]Noosphere891y*40

Some thoughts on the embedded agents part today, now that I'm inspired to have thoughts on it.

On unrealized implications, I don't think this is exactly an embedded agent problem so much as a problem of limited computational abilities.

More seriously, I suspect it's possible for an infinite agent to be both embedded within the structure of it's universe and also be logically/computationally omniscient, but if we do impose a condition of finiteness, the unrealized implications part comes back.

So in that sense, I think it's not exactly a problem of being in the world, but rather being finite.

But the finiteness condition is fine for now, so I'll talk about other things.

A lot of embedded agency problems, IMO are either created or are significantly enhanced via physical universality, which is semi-plausible for our universe, and in particular, a big thing that physical universality does for embedded agency is you can no longer create a perfect isolator, because the environment can always revitalize an isolated area, and this is why any reversible cellular automaton that allows for perfect walls cannot be physically universal.

This means that there's no ground truth Cartesian boundary available that persists for all time, which breaks the abstraction of an agent separated from it's environment, which means reward corruption and self-modification can happen.

Thus, we have to replace it by a theory that can handle shifts in boundaries. Ideally, the boundary should either be arbitrarily shiftable or not exist at all, but this creates problems since physical universality is way less studied than computational universality, and their interaction is not studied at all.

The I/O part is not about the lack of such a channel, but rather the lack of a channel that is invulnerable to hacking/modification, such that the channel can be assumed to only come from a certain source.

[-]Rohin Shah6yΩ340

Planned summary for the Alignment Newsletter:

<@Embedded agency@>(@Embedded Agents@) is not just a problem for AI systems: humans are embedded agents too; many problems in understanding human values stem from this fact. For example, humans don't have a well-defined output channel: we can't say "anything that comes from this keyboard is direct output from the human", because the AI could seize control of the keyboard and wirehead, or a cat could walk over the keyboard, etc. Similarly, humans can "self-modify", e.g. by drinking, which often modifies their "values": what does that imply for value learning? Based on these and other examples, the post concludes that "a better understanding of embedded agents in general will lead to substantial insights about the nature of human values".

Planned opinion:

I certainly agree that many problems with value learning stem from embedded agency issues with humans, and any <@formal account@>(@Why we need a *theory* of human values@) of this will benefit from general progress in understanding embeddedness. Unlike many others, I do not think we need a formal account of human values, and that a "common-sense" understanding will suffice, including for the embeddedness problems detailed in this post.

[-]johnswentworth6yΩ120

One (possibly minor?) point: this isn't just about value learning; it's the more general problem of pointing to values. For instance, a system with a human in the loop may not need to learn values; it could rely on the human to provide value judgements. On the other hand, the human still needs to point to their own values in manner usable/interpretable by the rest of the system (possibly with the human doing the "interpretation", as in e.g. tool AI). Also, the system still needs to point to the human somehow - cats walking on keyboards are still a problem.

Also, if you have written up your views on these sorts of problems, and how human-common-sense understanding will solve them, I'd be interested to read that. (Or if someone else has written up views similar to your own, that works too.)

[-]Rohin Shah6yΩ340

One (possibly minor?) point: this isn't just about value learning; it's the more general problem of pointing to values.

Makes sense, I changed "value learning" to "figuring out what to optimize".

Also, if you have written up your views on these sorts of problems, and how human-common-sense understanding will solve them, I'd be interested to read that.

Hmm, I was going to say Chapter 3 of the Value Learning sequence, but looking at it again it doesn't really talk about this. Maybe the post on Following human norms gives some idea of the flavor of what I mean, but it doesn't explicitly talk about it. Perhaps I should write about this in the future.

Here's a brief version:

We'll build ML systems with common sense, because common sense is necessary for tasks of interest; common sense already deals with most (all?) of the human embeddedness problems. There are still two remaining problems:

Ensuring the AI uses its common sense when interpreting our goals / instructions. We'll probably figure this out in the future; it seems likely that "give instructions in natural language" automatically works (this is the case with human assistants for example).
Ensuring the AI is not trying to deceive us. This seems mostly-independent of human embeddedness. You can certainly construct examples where human embeddedness makes it hard to tell whether something is deceptive or not, but I think in practice "is this deceptive" is a common sense natural category that we can try to detect. (You may not be able to prove theorems, since it relies on common sense understanding; but you could be able to detect deception in any case that actually arises.)

[-]johnswentworth6yΩ690

Thanks, that makes sense.

FWIW, my response would be something like: assuming that common-sense reasoning is sufficient, we'll probably still need a better understanding of embeddedness in order to actually build common-sense reasoning into an AI. When we say "common sense can solve these problems", it means humans know how to solve the problems, but that doesn't mean we know how to translate the human understanding into something an AI can use. I do agree that humans already have a good intuition for these problems, but we still don't know how to automate that intuition.

I think our main difference in thinking here is not in whether or not common sense is sufficient, but in whether or not "common sense" is a natural category that ML-style methods could figure out. I do think it's a natural category in some sense, but I think we still need a theoretical breakthrough before we'll be able to point a system at it - and I don't think systems will acquire human-compatible common sense by default as an instrumentally convergent tool.

[-]Rohin Shah6yΩ340

I think our main difference in thinking here is not in whether or not common sense is sufficient, but in whether or not "common sense" is a natural category that ML-style methods could figure out.

To give some flavor of why I think ML could figure it out:

I don't think "common sense" itself is a natural category, but is instead more like a bundle of other things that are natural, e.g. pragmatics. It doesn't seem like "common sense" is innate to humans; we seem to learn "common sense" somehow (toddlers are often too literal). I don't see an obvious reason why an ML algorithm shouldn't be able to do the same thing.

In addition, "common sense" type rules are often very useful for prediction, e.g. if you hear "they gave me a million packets of hot sauce", and then you want to predict how many packets of hot sauce there are in the bad, you're going to do better if you understand common sense. So common sense is instrumentally useful for prediction (and probably any other objective you care to name that we might use to train an AI system).

That said, I don't think it's a crux for me -- even if I believed that current ML systems wouldn't be able to figure "common sense" out, my main update would be that current ML systems wouldn't lead to AGI / transformative AI, since I expect most tasks require common sense. Perhaps the crux is "transformative AI will necessarily have figured out most aspects of 'common sense'".

[-]johnswentworth6yΩ340

Ah, ok, I may have been imagining something different by "common sense" than you are - something more focused on the human-specific parts.

Maybe this claim gets more at the crux: the parts of "common sense" which are sufficient for handling embeddedness issues with human values are not instrumentally convergent; the parts of "common sense" which are instrumentally convergent are not sufficient for human values.

The cat on the keyboard seems like a decent example here (though somewhat oversimplified). If the keyboard suddenly starts emitting random symbols, then it seems like common sense to ignore it - after all, those symbols obviously aren't coming from a human. On the other hand, if the AI's objective is explicitly pointing to the keyboard, then that common sense won't do any good - it doesn't have any reason to care about the human's input more than random input a priori, common sense or not. Obviously there are simple ways of handling this particular problem, but it's not something the AI would learn unless it was pointing to the human to begin with.

[-]Rohin Shah6yΩ340

Hmm, this seems to be less about whether or not you have common sense, and more about whether the AI system is motivated to use its common sense in interpreting instructions / goals.

I think if you have an AI system that is maximizing an explicit objective, e.g. maximize the numbers input from this keyboard; then the AI will have common sense, but (almost tautologically) won't use it to interpret the input correctly. (See also Failed Utopia.)

The hope is to train an AI system that doesn't work like that, in the same way that humans don't work like that. (In fact, I could see that by default AI systems are trained like that; e.g. instruction-following AI systems like CraftAssist seem to be in this vein.)

[-]johnswentworth6yΩ450

The hope is to train an AI system that doesn't work like that, in the same way that humans don't work like that. (In fact, I could see that by default AI systems are trained like that; e.g. instruction-following AI systems like CraftAssist seem to be in this vein.)

Let me make sure I understand what you're picturing as an example. Rather than giving an AI an explicit objective, we train it to follow instructions from a human (presumably using something RL-ish?), and the idea is that it will learn something like human common sense in order to better follow instructions. Is that a prototypical case of what you're imagining? If so, what criteria do you imagine using for training? Maximizing a human approval score? Mimicking a human/predicting what a human would do and then doing that? Some kind of training procedure which somehow avoids optimizing anything at all?

[-]Rohin Shah6yΩ450

Is that a prototypical case of what you're imagining?

Yes.

Maximizing a human approval score?

Sure, that seems reasonable. Note that this does not mean that the agent ends up taking whichever actions maximize the number entered into a keyboard; it instead creates a policy that is consistent with the constraints "when asked to follow <instruction i>, I should choose action <most approved action i>", for instructions and actions it is trained on. It's plausible to me that the most "natural" policy that satisfies these constraints is one which predicts what a real human would think of the chosen action, and then chooses the action that does best according to that prediction.

(In practice you'd want to add other things like e.g. interpretability and adversarial training.)

[-]johnswentworth6yΩ340

It's plausible to me that the most "natural" policy that satisfies these constraints is one which predicts what a real human would think of the chosen action...

I'd expect that's going to depend pretty heavily on how we're quantifying "most natural", which brings us right back to the central issue.

Just in terms of pure predictive power, the most accurate policy is going to involve a detailed simulation of a human at a keyboard, reflecting the physical setup in which the data is collected - and that will produce basically the same problems as an actual human at a keyboard. The final policy won't point to human values any more robustly than the data collection process did - if the data was generated by a human typing at a keyboard, then the most-predictive policy will predict what a human would type at a keyboard, not what a human "actually wants". Garbage in, garbage out, etc.

More pithily: if a problem can't be solved by a human typing something into a keyboard, then it also won't be solved by simulating/predicting what the human would type into the keyboard.

It could be that there's some viable criterion of "natural" other than just maximizing predictive power, but predictive power alone won't circumvent the embeddedness problems.

[-]Rohin Shah6yΩ450

Just in terms of pure predictive power, the most accurate policy is going to involve a detailed simulation of a human at a keyboard, reflecting the physical setup in which the data is collected - and that will produce basically the same problems as an actual human at a keyboard. [...] the most-predictive policy will predict what a human would type at a keyboard, not what a human "actually wants".

Agreed. I don't think we will get that policy, because it's very complex. (It's much easier / cheaper to predict what the human wants than to run a detailed simulation of the room.)

I'd expect that's going to depend pretty heavily on how we're quantifying "most natural", which brings us right back to the central issue.

I'm making an empirical prediction; so I'm not quantifying "most natural", reality is.

Tbc, I'm not saying that this is a good on-paper solution to AI safety; it doesn't seem like we could know in advance that this would work. I'm saying that it may turn out that as we train more and more powerful systems, we see evidence that the picture I painted is basically right; in that world it could be enough to do some basic instruction-following.

I'm also not saying that this is robust to scaling up arbitrarily far; as you said, the literal most predictive policy doesn't work.

[-]johnswentworth6yΩ340

Cool, I agree with all of that. Thanks for taking the time to talk through this.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

82

Humans Are Embedded Agents Too

82

Ω 22

82

Ω 22

The Keyboard is Not The Human

Modified Humans

Off-Equilibrium

Drinking

Value Drift

Akrasia

Preferences Over Quantum Fields

Unrealized Implications

Socially Strategic Self-Modification