Finding the variables

byStuart_Armstrong19d4th Mar 20191 comment


Ω 3

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

In a previous post on bridging syntax and semantics, I mentioned how to empirically establish that the internal symbols represented the variables in the environment: if the have high mutual information with the . This basically ask whether you can find out about the values of the by knowing the . See also Luke Muelhauser's mention of "representation" and the articles linked therein.

At the end of that post, I mentioned the problem of finding the variables if they were not given. This post will briefly look over that problem, and the related problem of finding the .

Waterfall and variables in the world

Given the internal variable , it is almost certainly possible to find a variable in the outside world that correlates with it (even if we assume a Cartesian separation between the agent and the world, so we can't just do the lazy thing and set ).

In the example of detecting an intruder in a greenhouse, look at , the internal variable of a guard that peers into the greenhouse to see an intruder.

Then we can certainly come up with a variable that correlates with . This could be a variable that correlates with whether there is an intruder in the greenhouse in situations where the guard can see it, and then correlates with all the issues that might fool the guard: mannequins, delusion-inducing gases, intruders disguised as tables, etc...

But we don't even need to be anything like the variables that was 'supposed' to measure. If we have a chaotic system in the vicinity - say a nearby waterfall - then we can just list all the states of that system that happen when vs those that happen when , and set to be or in those states.

That is a variant of Scott Aaronson's waterfall argument: if you have enough variety of states, and you can construct definitions of arbitrary complexity, then you can "ground" any model in these definitions. To avoid this, we have to penalise this definitional complexity the definition is doing all the work here, and is itself a highly complicated algorithm to implement.

So pick the so that:

  • the complexity of defining the is low, and
  • the have intrinsically relevant definitions, definitions that make sense without direct or indirect knowledge of .

There are some edge cases of course - if a human has being their estimate of whether a swan is around, it might be useful to distinguish between there is a swan and there is a white swan, as this tells us whether the human was conceptualising black swans as swans. But in general, the should be defined by concepts that make sense on their own, and don't take into account.

Variables in the mind

Now assume that the are something reasonable. What of the ? Well, imagine a superintelligence had access to an agent's entire sensory input. If the superintelligence had a decent world model, it could use that input to construct a best estimate as to the value of - and call that estimate, which is a function of the internal state of the agent, . Even if we limited the superintelligence to only accessing some parts of the agent - maybe just the short term memory, or the conscious states - it could still construct an that is likely a far better correlate of than anything the agent could construct/naturally has access to.

For example, if were temperature (as in this post), then an AI could deduce temperature information from human sensory data much better than our subjective "it feels kinda hot/cold in here".

So the should be selected according to other criteria than correlation with . For algorithms, we could look at named variables within them. For humans, we could also look at variables that correspond to names or labels (for example, when you ask a human "are you feeling hot?", what parts of the brain are triggered when that question is asked, and what parts correspond to the articulated answer being "yes").

Unless we are specifically interested in speech acts, we can't just say " corresponds to the human answering 'yes' when asked about how hot they feel". Nevertheless, when attempting to define a "feeling of hotness" variable, we should be defining it with all our knowledge (and the human's knowledge) of what that means: for example the fact that humans often answer 'yes' to that question when they indeed do feel hot.

So the should be defined by taking some concept and seeking to formalise how humans use it/implement it, not by correlating it with the .

We can sometimes justify a more correlated , if the concept is natural for the human in question. For example, we could take a human and train them to estimate temperature. After a while, they will develop an internal temperature estimator which is more highly correlated with the temperature , but which corresponds naturally to something the human can consciously access; we could check this, by, for example, getting the human to write down their temperature estimate.

We can also imagine the variable , which is an untrained human's estimate of temperature; we'd expect this to be a bit better than , just because the human can explicitly take into account things like fever, or temperature acclimatisation. But it's not clear that is really an intrinsic variable in the brain, or something constructed specifically by the human to answer that question at that moment.

Things can get more murky if we allow for unconscious feelings. Suppose someone has a relatively accurate gut instinct as to whether other people are trustworthy, but barely makes use of that instinct consciously. Then it's tricky to decide whether that instinct is a natural internal variable (which is highly correlated with trustworthiness), or an input into the human's conscious estimate (which is weakly correlated with trustworthiness).

Investigation, not optimisation

So this method is very suitable for checking the correlations between internal variables and external ones, variables that we have defined though some other process. So it can answer questions like:

  • "Is a human's subjective feeling of heat a good estimate of temperature?" (not really).
  • "Is a trained human's temperature guess a good estimate of temperature?" (somewhat).
  • "Is a human's subjective feeling of there being someone else in the room a good estimate of the presence of an intruder"? (yes, very much so).
  • "Does this brain activity mean that the human detects an intruder?" (possibly).

But it all falls apart if we try and use the correlation as an optimisation measure, shifting to better measure or vice-versa.


Ω 3