Syntax vs semantics: alarm better example than thermostat

by Stuart_Armstrong 2 min read4th Mar 20191 comment

14

Ω 5


Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

I had a post on empirically bridging syntax and semantics. It used the example of temperature, building on McCarthy and Searle's dispute about the beliefs of thermostats.

But temperature wasn't an ideal illustration of my points, as humans are not fine in their temperature sensitivity, so I'm presenting a better example here: detecting an intruder.

Internal and external variable

The external variable is a boolean which corresponds to whether there is any human in a certain initially empty greenhouse.

There are five different "agents" with internal variables :

  • An alarm on the door of the greenhouse, which goes off it a circuit is broken by the door being opened; internal variable .
  • A heat-detecting camera that starts an alarm if there is a something vaguely human sized and human-temperature inside the greenhouse (which is made out of sapphires, obviously); internal variable .
  • A motivated human guard who periodically looks into the greenhouse; internal variable .
  • A resourceful human with a lot of time and money, solely dedicated to detecting any intrusion into the green-house; internal variable .
  • A superintelligent robot version of the resourceful human; internal variable .

Then all the correlate well with the in a lot of circumstances. If a passerby or a naive burglar get into the greenhouse, they will trigger the door alarm and the heat alarm, while the guard, the resourceful human, and the robot will all see the intruder.

It is, however, pretty easy to fool the door alarm: simply go through a window. Conversely, someone could open the door without entering (or the wind or an earthquake could do so), causing the alarm to trigger with no-one in the greenhouse. So and are correlated in a relatively narrow set of environments . And if we consider instead the variable "the electric circuit that goes through the door is unbroken", then it's clear that and are much better correlated than and ; if there's a semantic meaning to , then it's far closer to than it is to .

The heat-camera can also be fooled. Simply spray the lense with some infrared-opaque paint, then enter at your leisure. For the converse, maybe a entering bear could trigger the alarm. It seems clear that is correlated with in a much wider set of environments, .

The human guard is hard to fool in either direction. We humans are very good at figuring out when other humans are around, so, assuming the guard is moderately attentive, tricking the guard in either direction requires a lot of work - though it is probably easier to trigger a false positive (the guard mistakenly thinks that there's a person in the greenhouse) than a false negative (the guard doesn't notice someone actually in the greenhouse). Confusing or overwhelming the guard becomes possible for intelligent adversaries. Still, the set of environments where is correlated to is much larger.

The resourceful human is even harder to fool, because they have all the advantages of the guard, plus any extra precautions they may have taken (such as adding alarms, cameras, crowds of onlookers, etc...). So is larger still.

Finally, bringing in a superintelligence really extends the accuracy of , even against intelligent adversaries, so is again much larger than any of the previous sets of environments.

Not strict inclusion, not perfect correlation

The agents above are on a hierarchy: every one of them has a much larger set of environments where is correlated with , than do any of the ones before that agent.

But none of the inclusions are strict. If someone sprays the heat-sensitive camera but then walks in through the door, the door-alarm will detect the intrusion even as the camera misses it. If someone disguises themselves as a table, they might be able to fool the guard but be caught by the camera. The resourceful human has their own personality, so there might be some manipulation of them that would fall flat for the guard.

And finally, even a superintelligence is computable, so the No Free Lunch theorems imply that there are some, stupidly complicated, environments in which , , , and are all equal to , but is not.

Since no computable agent can have a perfect correlation with the variable in question, there is a sense in which no symbol can be perfectly grounded (this gets even more obvious when you start slicing into the definition, and start wondering about the meanings of "human" and "a certain greenhouse" in ).

But, despite the lack of perfect inclusion and perfect correlation, there is a strong sense in which the later agents are better correlated than the earlier ones. Assume that we have a sensible computer language to pick a complexity prior in, and update on the world being roughly as we believe it to be. Then I'd be willing to wager that the posterior probabilities of the environments in which there are correlations, will be ordered:

  • .

14

Ω 5