Concept Safety: What are concepts for, and how to deal with alien concepts

Kaj_Sotala

In The Problem of Alien Concepts, I posed the following question: if your concepts (defined as either multimodal representations or as areas in a psychological space) previously had N dimensions and then they suddenly have N+1, how does that affect (moral) values that were previously only defined in terms of N dimensions?

I gave some (more or less) concrete examples of this kind of a "conceptual expansion":

Children learn to represent dimensions such as "height" and "volume", as well as "big" and "bright", separately at around age 5.
As an inhabitant of the Earth, you've been used to people being unable to fly and landowners being able to forbid others from using their land. Then someone goes and invents an airplane, leaving open the question of the height to which the landowner's control extends. Similarly for satellites and nation-states.
As an inhabitant of Flatland, you've been told that the inside of a certain rectangle is a forbidden territory. Then you learn that the world is actually three-dimensional, leaving open the question of the height of which the forbidden territory extends.
An AI has previously been reasoning in terms of classical physics and been told that it can't leave a box, which it previously defined in terms of classical physics. Then it learns about quantum physics, which allow for definitions of "location" which are substantially different from the classical ones.

As a hint of the direction where I'll be going, let's first take a look at how humans solve these kinds of dilemmas, and consider examples #1 and #2.

The first example - children realizing that items have a volume that's separate from their height - rarely causes any particular crises. Few children have values that would be seriously undermined or otherwise affected by this discovery. We might say that it's a non-issue because none of the children's values have been defined in terms of the affected conceptual domain.

As for the second example, I don't know the exact cognitive process by which it was decided that you didn't need the landowner's permission to fly over their land. But I'm guessing that it involved reasoning like: if the plane flies at a sufficient height, then that doesn't harm the landowner in any way. Flying would become impossible difficult if you had to get separate permission from every person whose land you were going to fly over. And, especially before the invention of radar, a ban on unauthorized flyovers would be next to impossible to enforce anyway.

We might say that after an option became available which forced us to include a new dimension in our existing concept of landownership, we solved the issue by considering it in terms of our existing values.

Concepts, values, and reinforcement learning

Before we go on, we need to talk a bit about why we have concepts and values in the first place.

From an evolutionary perspective, creatures that are better capable of harvesting resources (such as food and mates) and avoiding dangers (such as other creatures who think you're food or after their mates) tend to survive and have offspring at better rates than otherwise comparable creatures who are worse at those things. If a creature is to be flexible and capable of responding to novel situations, it can't just have a pre-programmed set of responses to different things. Instead, it needs to be able to learn how to harvest resources and avoid danger even when things are different from before.

How did evolution achieve that? Essentially, by creating a brain architecture that can, as a very very rough approximation, be seen as consisting of two different parts. One part, which a machine learning researcher might call the reward function, has the task of figuring out when various criteria - such as being hungry or getting food - are met, and issuing the rest of the system either a positive or negative reward based on those conditions. The other part, the learner, then "only" needs to find out how to best optimize for the maximum reward. (And then there is the third part, which includes any region of the brain that's neither of the above, but we don't care about those regions now.)

The mathematical theory of how to learn to optimize for rewards when your environment and reward function are unknown is reinforcement learning (RL), which recent neuroscience indicates is implemented by the brain. An RL agent learns a mapping from states of the world to rewards, as well as a mapping from actions to world-states, and then uses that information to maximize the amount of lifetime rewards it will get.

There are two major reasons why an RL agent, like a human, should learn high-level concepts:

They make learning massively easier. Instead of having to separately learn that "in the world-state where I'm sitting naked in my cave and have berries in my hand, putting them in my mouth enables me to eat them" and that "in the world-state where I'm standing fully-clothed in the rain outside and have fish in my hand, putting it in my mouth enables me to eat it" and so on, the agent can learn to identify the world-states that correspond to the abstract concept of having food available, and then learn the appropriate action to take in all those states.
There are useful behaviors that need to be bootstrapped from lower-level concepts to higher-level ones in order to be learned. For example, newborns have an innate preference for looking at roughly face-shaped things (Farroni et al. 2005), which develops into a more consistent preference for looking at faces over the first year of life (Frank, Vul & Johnson 2009). One hypothesis is that this bias towards paying attention to the relatively-easy-to-encode-in-genes concept of "face-like things" helps direct attention towards learning valuable but much more complicated concepts, such as ones involved in a basic theory of mind (Gopnik, Slaughter & Meltzoff 1994) and the social skills involved with it.

Viewed in this light, concepts are cognitive tools that are used for getting rewards. At the most primitive level, we should expect a creature to develop concepts that abstract over situations that are similar with regards to the kind of reward that one can gain from taking a certain action in those states. Suppose that a certain action in state s₁ gives you a reward, and that there are also states s₂ - s₅ in which taking some specific action causes you to end up in s_1. Then we should expect the creature to develop a common concept for being in the states s₂ - s₅_,and we should expect that concept to be "more similar" to the concept of being in state s₁ than to the concept of being in some state that was many actions away.

"More similar" how?

In reinforcement learning theory, reward and value are two different concepts. The reward of a state is the actual reward that the reward function gives you when you're in that state or perform some action in that state. Meanwhile, the value of the state is the maximum total reward that you can expect to get from moving that state to others (times some discount factor). So a state A with reward 0 might have value 5 if you could move from it to state B, which had a reward of 5.

Below is a figure from DeepMind's recent Nature paper, which presented a deep reinforcement learner that was capable of achieving human-level performance or above on 29 of 49 Atari 2600 games (Mnih et al. 2015). The figure is a visualization of the representations that the learning agent has developed for different game-states in Space Invaders. The representations are color-coded depending on the value of the game-state that the representation corresponds to, with red indicating a higher value and blue a lower one.

As can be seen (and is noted in the caption), representations with similar values are mapped closer to each other in the representation space. Also, some game-states which are visually dissimilar to each other but have a similar value are mapped to nearby representations. Likewise, states that are visually similar but have a differing value are mapped away from each other. We could say that the Atari-playing agent has learned a primitive concept space, where the relationships between the concepts (representing game-states) depend on their value and the ease of moving from one game-state to another.

In most artificial RL agents, reward and value are kept strictly separate. In humans (and mammals in general), this doesn't seem to work quite the same way. Rather, if there are things or behaviors which have once given us rewards, we tend to eventually start valuing them for their own sake. If you teach a child to be generous by praising them when they share their toys with others, you don't have to keep doing it all the way to your grave. Eventually they'll internalize the behavior, and start wanting to do it. One might say that the positive feedback actually modifies their reward function, so that they will start getting some amount of pleasure from generous behavior without needing to get external praise for it. In general, behaviors which are learned strongly enough don't need to be reinforced anymore (Pryor 2006).

Why does the human reward function change as well? Possibly because of the bootstrapping problem: there are things such as social status that are very complicated and hard to directly encode as "rewarding" in an infant mind, but which can be learned by associating them with rewards. One researcher I spoke with commented that he "wouldn't be at all surprised" if it turned out that sexual orientation was learned by men and women having slightly different smells, and sexual interest bootstrapping from an innate reward for being in the presence of the right kind of a smell, which the brain then associated with the features usually co-occurring with it. His point wasn't so much that he expected this to be the particular mechanism, but that he wouldn't find it particularly surprising if a core part of the mechanism was something that simple. Remember that incest avoidance seems to bootstrap from the simple cue of "don't be sexually interested in the people you grew up with".

This is, in essence, how I expect human values and human concepts to develop. We have some innate reward function which gives us various kinds of rewards for different kinds of things. Over time we develop a various concepts for the purpose of letting us maximize our rewards, and lived experiences also modify our reward function. Our values are concepts which abstract over situations in which we have previously obtained rewards, and which have become intrinsically rewarding as a result.

Getting back to conceptual expansion

Having defined these things, let's take another look at the two examples we discussed above. As a reminder, they were:

Children learn to represent dimensions such as "height" and "volume", as well as "big" and "bright", separately at around age 5.
As an inhabitant of the Earth, you've been used to people being unable to fly and landowners being able to forbid others from using their land. Then someone goes and invents an airplane, leaving open the question of the height to which the landowner's control extends.

I summarized my first attempt at describing the consequences of #1 as "it's a non-issue because none of the children's values have been defined in terms of the affected conceptual domain". We can now reframe it as "it's a non-issue because the [concepts that abstract over the world-states which give the child rewards] mostly do not make use of the dimension that's now been split into 'height' and 'volume'".

Admittedly, this new conceptual distinction might be relevant for estimating the value of a few things. A more accurate estimate of the volume of a glass leads to a more accurate estimate of which glass of juice to prefer, for instance. With children, there probably is some intuitive physics module that figures out how to apply this new dimension for that purpose. Even if there wasn't, and it was unclear whether it was the "tall glass" or "high-volume glass" concept that needed be mapped closer to high-value glasses, this could be easily determined by simple experimentation.

As for the airplane example, I summarized my description of it by saying that "after an option became available which forced us to include a new dimension in our existing concept of landownership, we solved the issue by considering it in terms of our existing values". We can similarly reframe this as "after the feature of 'height' suddenly became relevant for the concept of landownership, when it hadn't been a relevant feature dimension for landownership before, we redefined landownership by considering which kind of redefinition would give us the largest amounts of rewarding things". "Rewarding things", here, shouldn't be understood only in terms of concrete physical rewards like money, but also anything else that people have ended up valuing, including abstract concepts like right to ownership.

Note also that different people, having different experiences, ended up making redefinitions. No doubt some landowners felt that the "being in total control of my land and everything above it" was a more important value than "the convenience of people who get to use airplanes"... unless, perhaps, they got to see first-hand the value of flying, in which case the new information could have repositioned the different concepts in their value-space.

As an aside, this also works as a possible partial explanation for e.g. someone being strongly against gay rights until their child comes out of the closet. Someone they care about suddenly benefiting from the concept of "gay rights", which previously had no positive value for them, may end up changing the value of that concept. In essence, they gain new information about the value of the world-states that the concept of "my nation having strong gay rights" abstracts over. (Of course, things don't always go this well, if their concept of homosexuality is too strongly negative to start with.)

The Flatland case follows a similar principle: the Flatlanders have some values that declared the inside of the rectangle a forbidden space. Maybe the inside of the rectangle contains monsters which tend to eat Flatlanders. Once they learn about 3D space, they can rethink about it in terms of their existing values.

Dealing with the AI in the box

This leaves us with the AI case. We have, via various examples, taught the AI to stay in the box, which was defined in terms of classical physics. In other words, the AI has obtained the concept of a box, and has come to associate staying in the box with some reward, or possibly leaving it with a lack of a reward.

Then the AI learns about quantum mechanics. It learns that in the QM formulation of the universe, "location" is not a fundamental or well-defined concept anymore - and in some theories, even the concept of "space" is no longer fundamental or well-defined. What happens?

Let's look at the human equivalent for this example: a physicist who learns about quantum mechanics. Do they start thinking that since location is no longer well-defined, they can now safely jump out of the window on the sixth floor?

Maybe some do. But I would wager that most don't. Why not?

The physicist cares about QM concepts to the extent that the said concepts are linked to things that the physicist values. Maybe the physicist finds it rewarding to develop a better understanding of QM, to gain social status by making important discoveries, and to pay their rent by understanding the concepts well enough to continue to do research. These are some of the things that the QM concepts are useful for. Likely the brain has some kind of causal model indicating that the QM concepts are relevant tools for achieving those particular rewards. At the same time, the physicist also has various other things they care about, like being healthy and hanging out with their friends. These are values that can be better furthered by modeling the world in terms of classical physics.

In some sense, the physicist knows that if they started thinking "location is ill-defined, so I can safely jump out of the window", then that would be changing the map, not the territory. It wouldn't help them get the rewards of being healthy and getting to hang out with friends - even if a hypothetical physicist who did make that redefinition would think otherwise. It all adds up to normality.

A part of this comes from the fact that the physicist's reward function remains defined over immediate sensory experiences, as well as values which are linked to those. Even if you convince yourself that the location of food is ill-defined and you thus don't need to eat, you will still suffer the negative reward of being hungry. The physicist knows that no matter how they change their definition of the world, that won't affect their actual sensory experience and the rewards they get from that.

So to prevent the AI from leaving the box by suitably redefining reality, we have to somehow find a way for the same reasoning to apply to it. I haven't worked out a rigorous definition for this, but it needs to somehow learn to care about being in the box in classical terms, and realize that no redefinition of "location" or "space" is going to alter what happens in the classical model. Also, its rewards need to be defined over models to a sufficient extent to avoid wireheading (Hibbard 2011), so that it will think that trying to leave the box by redefining things would count as self-delusion, and not accomplish the things it really cared about. This way, the AI's concept for "being in the box" should remain firmly linked to the classical interpretation of physics, not the QM interpretation of physics, because it's acting in terms of the classical model that has always given it the most reward.

It is my hope that this could also be made to extend to cases where the AI learns to think in terms of concepts that are totally dissimilar to ours. If it learns a new conceptual dimension, how should that affect its existing concepts? Well, it can figure out how to reclassify the existing concepts that are affected by that change, based on what kind of a classification ends up producing the most reward... when the reward function is defined over the old model.

Next post in series: World-models as tools.

A caveat which didn't fit the flow of the text:

Human concepts aren't always purely derived from just their instrumental value. We do seem to have an automatic unsupervised learning component which independently constructs models of the environment and gains new modeling capabilities during maturation, as was seen in the children's height/volume example. Novelty is also one of the things that we find rewarding, and we are driven by curiosity to develop concepts that allow us to compress previous observations more effectively (Schmidhuber 2009). Still, it's worth noting that most people have specific subjects that they are curious about (which some others find uninteresting) while having other subjects they find uninteresting (which some others find interesting), suggesting that even this intrinsic concept-formation drive is guided and directed by various rewards.

There are plenty of other such caveats that I could have made, like a discussion of how emotions affect our reward function, how there seem to be distinct System 1 and System 2 concepts, and so on. But they would have distracted from the main point. I'll just note here that I'm aware of the full picture being quite a bit more complicated than this post might make it seem.

I much enjoyed your posts so far Kaj, thanks for creating them.

I'd like to draw attention, in this particular one, to

Viewed in this light, concepts are cognitive tools that are used for getting rewards.

to add a further caveat: though some concepts are related to rewards, and some conceptual clustering is done in a way that maps to the reward of the agent as a whole, much of what goes on in concept formation, simple or complex, is just the wire together, fire together old saying. More specifically, if we are only calling "reward" what is a reward for the whole individual, then most concept formation will not be reward related. At the level of neurons or neural columns, there are reward-like mechanisms taking place, no doubt, but it would be a mereological fallacy to assume that rewardness carries upward from parts to wholes.
There are many types of concepts for which indeed, as you contend, rewards are very important, and they deserve as much attention as those which cannot be explained merely by the idea of a single monolithic agent seeking rewards.

Very interesting post. I liked how your sequence of examples led from very from things we understand to the thing we're trying to understand.

Also, I recognized myself in the analogy of the scientists trying to redefine reality in order to fly :)

In most artificial RL agents, reward and value are kept strictly separate. In humans (and mammals in general), this doesn't seem to work quite the same way.

Note that practical artificial RL agents make decisions using the value function (indeed that is it's entire point), rather than directly computing the expected future discounted reward - as computing that measure is generally intractable.

The details of the various RL mechanisms in the human brain are complex and are still very much an active area of study, but if anything the evidence rather strongly indicates that value function approximation is a necessary component of the biological solution for the same reasons we employ it in our artificial systems (and perhaps more so, because neurons are slow and thus fast planning search is much less of an option).

Rather, if there are things or behaviors which have once given us rewards, we tend to eventually start valuing them for their own sake.

Any practical RL agents exhibit the same behavior: once an agent learns that something leads to rewards, it encodes that as a valuable something in its value function - value is just approximate expected discounted future reward.

It is my hope that this could also be made to extend to cases where the AI learns to think in terms of concepts that are totally dissimilar to ours. If it learns a new conceptual dimension, how should that affect its existing concepts? Well, it can figure out how to reclassify the existing concepts that are affected by that change, based on what kind of a classification ends up producing the most reward... when the reward function is defined over the old model.

The challenge then shifts to specifying the correct initial utility function in terms of a poor initial world model and utility function that somehow develops into the correct long term utility function when blown up to superintelligence.

For example, a utility function that assigns high value to "stay in the box" is of course probably a very bad idea due to perverse instantiations.

In the concept learning approach - if I understand it correctly - we define the reward/utility function through a manual mapping of concepts -> utility examples. The resulting reward function can be learned or hand constructed, but either way it is defined by the example set which maps primitive concepts to utilities.

One issue with this type of mapping (versus say the IRL alternative) is it requires the designers to determine in advance some key hyperparameters of a correct/safe reward function, such as it's overall distribution over time.

The other bigger issue is the distinction between concepts that represent actual concrete beliefs about the current state of the world vs imagined beliefs about the world or abstract beliefs about the potential future world state. We want the reward function to be high only for concept sequence inputs that correspond to internal representations of the AI actually observing and believing that it did something 'good', not situations where the AI just imagines a good outcome. This is actually gets pretty tricky quickly, because essentially it involves mapping out what amount to simulated outcome states in the AI's mind.

You can't just have the AI imagine a nice world in the abstract and then hard code that concept to high reward. You actually need the AI to concretely experience a nice world internally and map those concept sequences to high reward.

In the case of DQN atari agent, this isn't a problem, because the agent exists entirely in a simulation that provides the correct training data that fully covers the relevant domain of the reward function.

The challenge then shifts to specifying the correct initial utility function in terms of a poor initial world model and utility function that somehow develops into the correct long term utility function when blown up to superintelligence.

Agreed, this is a key problem.

In the concept learning approach - if I understand it correctly - we define the reward/utility function through a manual mapping of concepts -> utility examples. The resulting reward function can be learned or hand constructed, but either way it is defined by the example set which maps primitive concepts to utilities.

I've intentionally left vague the exact mechanism of how to define the initial utility function, since I don't feel like I have a good answer to it. An IRL approach sounds like it'd be one possible way of doing it, but I haven't had the chance to read more about it yet.

The main scenario I had implicitly in mind had something resembling a "childhood" for the AI, where its power and intelligence would be gradually increased while it interacted with human programmers in a training environment and was given feedback on what was considered "good" or "bad", so that it would gradually develop concepts that approximated human morality as it tried to maximize the positive feedback. Possibly even giving it a humanoid body at first, to further give a human-like grounding to its concepts. Of course this essentially assumes a slow takeoff and an environment where there is time to give the AI an extended childhood, so it might very well be that this is unfeasible. (Another potential problem with it is that the AI's values would become quite strongly shaped by those of its programmers, which not everyone would be likely to agree with.)

Another scenario I thought of would be to train the AI by something like the word embedding models, i.e. being given a vast set of moral judgments and then needing to come up with concepts simulating human moral reasoning in order to correctly predict the "right" judgments. There a problem would be in finding an appropriate and large enough dataset, plus again the fact that different humans would have differing judgments, making the set noisy. (But maybe that could be leveraged to one's advantage, too, so that the AI would only be sure about the kinds of moral values that were nearly universally agreed upon.)

The main scenario I had implicitly in mind had something resembling a "childhood" for the AI, where its power and intelligence would be gradually increased while it interacted with human programmers in a training environment and was given feedback on what was considered "good" or "bad", so that it would gradually develop concepts that approximated human morality as it tried to maximize the positive feedback.

This really is the most realistic scenario for AGI in general, given the generality of the RL architecture. Of course, there are many variations - especially in how the training environment and utility feedback interact.

Possibly even giving it a humanoid body at first, to further give a human-like grounding to its concepts

If we want the AI to do human-ish labor tasks, a humanoid body makes lots of sense. It also makes sense for virtual acting, interacting with humans in general, etc. A virtual humanoid body has many advantages - with instantiation in a physical robot as a special case.

(Another potential problem with it is that the AI's values would become quite strongly shaped by those of its programmers, which not everyone would be likely to agree with.)

Yep - kindof unavoidable unless somebody releases the first advanced AGI for free. Even then, most people wouldn't invest the time to educate and instill their values.

Another scenario I thought of would be to train the AI by something like the word embedding models, i.e. being given a vast set of moral judgments and then needing to come up with concepts simulating human moral reasoning in order to correctly predict the "right" judgments.

So say you train the AI to compute a mapping between a sentence in english describing a moral scenario and a corresponding sentiment/utility, how do you translate that into the AI's reward/utility function? You'd need to somehow also map encodings of imagined moral scenarios back and forth between encodings of observation histories.

This really is the most realistic scenario for AGI in general, given the generality of the RL architecture.

Of course, "gradually training the AGI's values through an extended childhood" gets tricky if it turns out that there's a hard takeoff.

So say you train the AI to compute a mapping between a sentence in english describing a moral scenario and a corresponding sentiment/utility, how do you translate that into the AI's reward/utility function? You'd need to somehow also map encodings of imagined moral scenarios back and forth between encodings of observation histories.

I was thinking that the task of training the AI to classify human judgments would then lead to it building up a model of human values, similar to the way that training a system to do word prediction builds up a language / world model. You make a good point of the need to then ground those values further; I haven't really thought about that part.

Of course, "gradually training the AGI's values through an extended childhood" gets tricky if it turns out that there's a hard takeoff.

Yes. Once you get the AGI up to roughly human child level, presumably autodidactic learning could takeover. Reading and understanding text on the internet is a specific key activity that could likely be sped up by a large factor.

So - then we need ways to speed up human interaction and supervision to match.

Looking at that map of representations of the DNQ agent playing Space Invaders I can't help thinking if it really has learned any concepts that are similar to what a human would learn. Does the DNQ agent have the concepts of ship, shield, bullet or space invader? Does it have anything that corresponds to the concept of a group of space invaders? Can it generalize? I'm sure human players could quickly adapt if we changed the game so that the ship would shoot from the top to the bottom instead. Does the DNQ agent have anything analogous to an inner simulator? If we showed it a movie where the ship would fly up to the invaders and collide what would it predict happens next?

My gut feeling is that artificial agents are still far away from having reusable and generalizable concepts. It's one thing, although an impressive one, to use the same framework with identical parameters for different DNQ agents learning different games than it is to use one framework for one agent that learns to play all the games and abstract concepts across them.

That's... rather the point of this research, no? Kaj is trying to make an original contribution here. The DNQ agent does have an internal simulator, but is not designed to have a rigorous concept network like Kaj is describing here.

Yes. Also my thought. The space invader concepts are alien (uh, kind of pun) and this can use used to gain some intuition about how alien concepts work.

The concepts of the DNQ agent are surely alien to us - because it's world is alien. At least if you think of the world as what it is: A bunch of coordinates and movement-patterns. The 2D visualization is just a rendering to even allow humans to deal with it in familiar categories. And that is what we do: We use our pre-existing categories to play it well. Over time an experienced player will improve these toward what the DNQ starts to aquire from scratch.

I'd bet that it possible to map the DNQ representations to player coodinate, enemy hight and something that amounts to the aim of a shot. At least if the NN is deep and wide enough.

Looking at that map of representations of the DNQ agent playing Space Invaders I can't help thinking if it really has learned any concepts that are similar to what a human would learn.

The DNQ agent has a much simpler visual system that is suitable to the low complexity graphics of the Atari world. It also learns through supervised backprop on the RL signals, whereas the human cortex appears to learn through some more complex mix of RL and UL (more UL the closer one gets to the sensory stream, more RL as one moves up closer to the reward cirucits)

The more complex vision ANNs trained on natural images do produce visual concepts (features) that are reasonably close to those found in various stages of the human visual system. It all depends on the training data and the model architecture.

At some point children believe that their parents make the sun raise in the morning. This worldview is more morally involved than learning some technical things about physical occupancy.

The concept of sovereign airspace is a real thing. Nations do expect permission and there are regulations concerning following the air control etc. Now this works on a differnt level than private property but is an example where previous concept of "owning the land" extends to "owning the air" if in the private property case it doesn't.

We do not tell an AI that "classically you ought to stay in the box". We do some conrete talking to it or coding to it that we abbriviate. When extending the concept to new conceptual areas the AI falls on back on the details. When a scientist newtonian sense of location is destroyed he needs to reformulate what he means by space. I could imagine that the "extensions safety" of various ways of telling an ai to stay in a box could vary a lot. If I stay "We need to know what space you can affect so we ask you to stay within the marked area over there" if there is an ostensive act in there we actually do not refer to any newtonian sense of space. Additionally we give the AI tools to let us know if he is elsewhere. So when he expans the concepts behind the order it has basis on understanding the psyhological dimension of the marking. "If I understand 'markings' this way does it still make the programmer know were I am if I follow that sense of marking?". Now there is a certain style of directing that adds a whole lot of these safety valves. But it would still be hard and cumbersome to be sure that "every safety valve" has been included in the direction.

On human to human interaction we do not expect to take each other literally all the time. In the military the breakdown of the reguired "roundings" is called "white mutiny". If you start to do things really by the book nothing gets really done. In the AI case it would be really handy if the AI could detect that the asking it to sit in the corner is not supposed to hang on the details of newtonian space. But if we allow it to interpret it's commands it's harder to outrule it interpreting them malevolantly. However it could be used as a safety feature too. If a human says "I am going to kill you" we are inclined to interpret it as a joke or rhetorical emphasis not as an actual plan of murder if no addiotional evidence to that direction is given. In that way if a AI would "refuse to understand" a command to blow up the world it in most cases would be the desired behaviour.

But does quantum physics really imply that food has no location and physicists don't need to eat?

No, it's not the best example. But I for one am having trouble thinking of a more accurate one.

If you could easily think of a concept then it wouldn't be an alien concept.

QM makes all previous explanations for eating moot and one might make the rroneous leap that if ones understanding of eating (in terms of places and things and so on) is wrong that eating itself is meaningless. But this confuses the map with the territory.

A caveat which didn't fit the flow of the text:

I much enjoyed your posts so far Kaj, thanks for creating them.

I'd like to draw attention, in this particular one, to

Viewed in this light, concepts are cognitive tools that are used for getting rewards.

Very interesting post. I liked how your sequence of examples led from very from things we understand to the thing we're trying to understand.

Also, I recognized myself in the analogy of the scientists trying to redefine reality in order to fly :)

In most artificial RL agents, reward and value are kept strictly separate. In humans (and mammals in general), this doesn't seem to work quite the same way.

Rather, if there are things or behaviors which have once given us rewards, we tend to eventually start valuing them for their own sake.

It is my hope that this could also be made to extend to cases where the AI learns to think in terms of concepts that are totally dissimilar to ours. If it learns a new conceptual dimension, how should that affect its existing concepts? Well, it can figure out how to reclassify the existing concepts that are affected by that change, based on what kind of a classification ends up producing the most reward... when the reward function is defined over the old model.

For example, a utility function that assigns high value to "stay in the box" is of course probably a very bad idea due to perverse instantiations.

The challenge then shifts to specifying the correct initial utility function in terms of a poor initial world model and utility function that somehow develops into the correct long term utility function when blown up to superintelligence.

Agreed, this is a key problem.

In the concept learning approach - if I understand it correctly - we define the reward/utility function through a manual mapping of concepts -> utility examples. The resulting reward function can be learned or hand constructed, but either way it is defined by the example set which maps primitive concepts to utilities.

The main scenario I had implicitly in mind had something resembling a "childhood" for the AI, where its power and intelligence would be gradually increased while it interacted with human programmers in a training environment and was given feedback on what was considered "good" or "bad", so that it would gradually develop concepts that approximated human morality as it tried to maximize the positive feedback.

Possibly even giving it a humanoid body at first, to further give a human-like grounding to its concepts

(Another potential problem with it is that the AI's values would become quite strongly shaped by those of its programmers, which not everyone would be likely to agree with.)

Yep - kindof unavoidable unless somebody releases the first advanced AGI for free. Even then, most people wouldn't invest the time to educate and instill their values.

Another scenario I thought of would be to train the AI by something like the word embedding models, i.e. being given a vast set of moral judgments and then needing to come up with concepts simulating human moral reasoning in order to correctly predict the "right" judgments.

This really is the most realistic scenario for AGI in general, given the generality of the RL architecture.

Of course, "gradually training the AGI's values through an extended childhood" gets tricky if it turns out that there's a hard takeoff.

So say you train the AI to compute a mapping between a sentence in english describing a moral scenario and a corresponding sentiment/utility, how do you translate that into the AI's reward/utility function? You'd need to somehow also map encodings of imagined moral scenarios back and forth between encodings of observation histories.

Of course, "gradually training the AGI's values through an extended childhood" gets tricky if it turns out that there's a hard takeoff.

So - then we need ways to speed up human interaction and supervision to match.

Yes. Also my thought. The space invader concepts are alien (uh, kind of pun) and this can use used to gain some intuition about how alien concepts work.

I'd bet that it possible to map the DNQ representations to player coodinate, enemy hight and something that amounts to the aim of a shot. At least if the NN is deep and wide enough.

Looking at that map of representations of the DNQ agent playing Space Invaders I can't help thinking if it really has learned any concepts that are similar to what a human would learn.

At some point children believe that their parents make the sun raise in the morning. This worldview is more morally involved than learning some technical things about physical occupancy.

But does quantum physics really imply that food has no location and physicists don't need to eat?

No, it's not the best example. But I for one am having trouble thinking of a more accurate one.

If you could easily think of a concept then it wouldn't be an alien concept.