Bridging syntax and semantics, empirically

by Stuart_Armstrong 1y19th Sep 20184 comments

25

Ω 4


Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

EDIT: I've found that using humans to detect intruders, is a more illustrative example than the temperature example one of this post. The intruder example can be found in this post.

This is a series of posts with the modest goal of showing on how you can get syntax from semantics, solve the grounding problem, and start looking for models within human brains.

I think much of the work in this area has been focusing on the wrong question, looking at how symbols might be grounded in theory, rather that whether a particular symbol is well grounded for a particular concept. When Searle argues against a thermostat having beliefs about temperature, what is actually happening is that the thermostat's internal variables correlate poorly with temperature in general environments.

So, I'll start by presenting a derisively simple solution to the symbol grounding problem, and then see what this means in practice:

  • The variable within agent is a symbol for variable in the set of environments , iff knowing allows one to predict well within .

This could be measured, for example, by high mutual information between the variables, or low conditional entropy .

Why do I mention the set ? It's because any claim that is a symbol of will almost always include an example in which that is the case. Then those arguing against that claim will often produce another environment in which fails to correlate with , thus showing that the agent didn't have a "genuine" understanding of . So lack of understanding is often demonstrated by error, which is an empirical standard. Thus keeping track of the environments that cause error - and those that don't - is important.

Variables that always move together

If you trained a neural net on images of black cats versus white dogs, you might think you're training an animal classifier, when you're really training an colour classifier. According to the definition above, the output variable of the neural net, in the training environment, counts as both a "symbol" for "black" and a symbol for "cat". But which is it?

That question has no real meaning in the training environment. We can label that variable "cat", or "black", or "mix of blackness and catness", and all are equally good. This might seem like a cheat - but remember that within the training environment, there is no such thing as a non-black cat or a non-cat black object. Hence "cat" and "black" are synonyms within the training environment.

In order to separate the labels, we need to present the neural net with a black dog, a white cat, or something else that allows distinctions to be made. That's the reason I talked about "extending definitions" and "web of connotations" in the previous post. Within the narrow setting of the training environment, "black" is in the web of connotations of "cat". In the more general environment of current real world, it is not, but "has paws" and "gives birth to live young" are in the web of connotations. As is, to a lesser extent, "has fur".

Note that in the past, "has fur" was more strongly in the web of connotations of "cat", but this connection has become weaker. Conversely, in the future, we may see things currently in the web moving out; for example, it's perfectly plausible that within a century or so, most or all cats will be created in artificial wombs.

But, returning to the neural net example, the classification failures show that no neural net, today, has an internal variable that corresponds well with "cat" in the real world environment.

Temperature

How well can agents represent temperature? Assume we have three "agents": a thermostat, a human, and some idealised superintelligent robot that is highly motivated to record the correct temperature. We have four internal mental variables:

  • , the value of internal temperature variable in the the thermostat, given by, say, a current coming in along a wire from a thermometer.
  • , the human value of "this feels hot here".
  • , the human variable that covers the estimate of a human that is highly motivated to report the correct temperature. They can make use of thermometers and similar tools.
  • , the variable inside the robot that measures temperature.

Let be the average temperature around the agent. The first thing to note is that is a poor predictor of , in almost any set of environments. It lacks discrimination power, and it can easily be led astray by fever, or chills, or having just moved from a hot area to a cold one (or vice versa). Things like anger, are enough to get our body temperature to rise.

Why mention it at all, then? Because, in a sense, it is the original definition from which temperature itself derives. In the language of this post, the human feeling of heat was defined in typical environments, and temperature was a narrow extension of that definition - an extension that turned out to not map very well onto the original feeling, but has other things going for it, such as a wonderfully rigorous intensional definition.

What about the other variables? Well, let's start by specifying a very narrow set of environments, maybe within a lab setting. In this set, all of , , and correspond to .

Let's generalise a bit more, to , the set of all typical environments - environments which we wouldn't find particularly unusual. The and the are still going fine - the is likely more precise than the , but they're both still pretty correlated with - but can have some problems.

For example, the thermostat's thermometer could be left in the sun, causing it to mis-read the temperature. If a human or robot was in charge of the thermometer, then they could move it into shade to get a correct reading, but the thermometer has no understanding of this, so will read an overly high temperature. Similarly, if the wire into the thermostat was replaced by another wire, would diverge completely from .

If we define as the variable denoting the current in the wire going into the thermostat, then the correlation between and is much higher, in , than between and . In , both correlations were almost perfect, and and were within each other's web of connotations. But they come apart in , so we can say that thermostat is not `really' measuring temperature: measuring current is a much better description.

For , let's allow for the environments that are slightly adversarial. A rather dumb agent is trying to fool our agents. The human will generally try and protect the intgrity of its measurements, as will the robot. The thermostat, on the other hand, is completely hopeless.

For , we allow a very intelligent (but not superintelligent) adversary. We expect, at this point, that will become decorrelated from , while the robot is sufficiently smart to see through the manipulations and keep close to .

At this point, should we say that demonstrates that humans fail the symbol grounding problem - that we don't understand what temperature really is? Perhaps. We certainly don't understand enough about our adversary to undo their manipulations and return to the true temperature estimate, so we are failing to understand something. But maybe if we were given the right information, we could correct for this, whereas there is no "right information" that would make the thermostat behave correctly. The human mind is limited, though, and as the intelligence of the adversary increased, we would find it harder and harder to even understand the tricks it was playing on us. It's possible that, in , we truly don't understand temperature.

What about the fully general set of all environments? Given that the robot is a computable agent there is certainly some sort of no free lunch theorem here, so in some subsets the robot will fail; we have reached the limits of even superintelligent understanding of temperature.

Natural and Good

In the previous example, the environment variable was the precisely defined temperature. It can be instructive to consider what happens when the environmental variable is itself more complicated to define.

  • .
  • .
  • .

As articulated in a previous post, isn't well defined outside of very narrow sets of environments. A failure to understand an concept that doesn't make sense is not really a failure to understand.

The variable is actually well-defined in a large class of environments. As long as we restrict "human" to meaning something very close to current homo sapiens, the web of connotations about "good" will be preserved, so we can try and classify behaviours that are good, kind, considerate, loyal, etc... and classify the human as good if it scores highly on most of these. The picture will be similar to , above, except that will be much more complicated than , even if it doesn't feel that this is the case. But, when using and when confronted by , with a clever adversary, it feels more natural to say that humans just don't have an understanding of "good" in this circumstance.

The variable is more complicated; though its values are relatively clear when the agent is a human, when applying it to a general agent, we have multiple choice about how to extend it. We could go for the narrow/distilled "EAGood" that moves "good" closer to a temperature, or for a broad "MundaneGood" that tries to preserve the web of connotations around good.

So before we claim that an agent doesn't understand in an unusual environment, we should first check that is unambiguously defined in that environment.

Finding the variables

So far, I assumed the variables were given; but what if all we have is the agent's algorithm (or the agent itself) and need to infer their internal variables? And what about biased/incorrect beliefs? I'll look at those in a subsequent post.

25

Ω 4