Perceptual Entropy and Frozen Estimates

8RobinGoins

3Davidmanheim

2Nebu

1kilobug

0Davidmanheim

1OrphanWilde

2Davidmanheim

0ChristianKl

3Lumifer

0Davidmanheim

1ChristianKl

0Davidmanheim

3ChristianKl

0Davidmanheim

0ChristianKl

0Gunnar_Zarncke

2Davidmanheim

0Gunnar_Zarncke

0Silver_Swift

4arundelo

0Silver_Swift

0Davidmanheim

New Comment

**Feedback:**

Need an example? Sure! I have two dice, and they can each land on any number, 1-6. I’m assuming they are fair, so each has probability of 1/6, and the logarithm (base 2) of 1/6 is about -2.585. There are 6 states, so the total is 6* (1/6) * 2.585 = 2.585. (With two dice, I have 36 possible combinations, each with probability 1/36, log(1/36) is -5.17, so the entropy is 5.17. You may have notices that I doubled the number of dice involved, and the entropy doubled – because there is exactly twice as much that can happen, but the average entropy is unchanged.) If I only have 2 possible states, such as a fair coin, each has probability of 1/2, and log(1/2)=-1, so for two states, (-0.5*-1)+(-0.5*-1)=1. An unfair coin, with a ¼ probability of tails, and a ¾ probability of heads, has an entropy of 0.81. Of course, this isn’t the lowest possible entropy – a trick coin with both sides having heads only has 1 state, with entropy 0. So unfair coins have lower entropy – because we know more about what will happen.

I've had to calculate information entropy for a data compression course, so I felt like I already knew the concepts you were trying to explain here, but I was not able to follow your explanation at all.

the logarithm (base 2) of 1/6 is about -2.585. There are 6 states, so the total is 6* (1/6) * 2.585 = 2.585.

The total what? Total entropy for the two dice that you have? For just one of those two dice? log(1/6) is a negative number, so why do I not see any negative numbers used in your equation? There are 6 states, so I guess that sort of explains why you're multiplying some figure by 6, but why are you dividing by 6?

If I only have 2 possible states, such as a fair coin, each has probability of 1/2, and log(1/2)=-1, so for two states, (-0.5*-1)+(-0.5*-1)=1.

Why do you suddenly switch from the notation 1/2 to the notation 0.5? Is that significant (they're referring to different concepts who coincidentally happen to have equal values)? If they actually refer to the same value, why do we have the positive value 1/2, but negative value -0.5?

**Suggestion:**

- Do fair coin first, then fair dice, then trick coin.
- Point out that a fair coin has 2 outcomes when flipped, each with equal probability, so it has entropy [-1/2 log2(1/2)] + [-1/2 log2(1/2)] = (1/2) + (1/2) = 1.
- Point out a traditional fair dice has 6 outcomes when rolled, each of equal probability, and so it has entropy ∑n=1 to 6 of -1/6 log2(1/6) =~ 6 * -1/6 * -2.585 = 2.585.
- Point out that a trick coin that always comes up heads has 1 outcome when flipped, so it has entropy -1 log2(1/1) = 0.
- Point out that a trick coin that always comes up heads 75% of the time has entropy [-3/4 log2(3/4)]+[-1/4 log2(1/4)] =~ 0.311 + 0.5 = 0.811.
- Consistently use the same notation for each example (I sort of got lazy and used ∑ for the dice to avoid writing out a value 6 times). In contrast, do not use
`6 * (1/6) * 2.585 = 2.585`

for one example (where all the factors are positive) and then`(-0.5*-1)+(-0.5*-1)=1`

for another example (where we rely on pairs of negative factors to become positive).

A little nitpicking about the "2 dice" thing : usually when you throw you two dices, it doesn't matter which dice gives which result. Sure you could use colored dices and have the "blue 2, red 3" be different than "blue 3, red 2", but that's very rarely the case. Usually you do the sum (or look for patterns like doubles) but "2, 3" and "3, 2" are equivalent, and in that case the entropy isn't the double, but lower.

What you wrote is technically right - but goes against the common usage of dices, so it would be worth adding a footnote or precision about that, IMHO.

I wanted to avoid going too deep into that example - the other LW and linked posts are better, but I wanted to at least introduce it.

Thanks for the feedback.

It sounds like you're talking about informational entropy, and the model you're describing sounds -very- similar to a neural net, which uses something like entropy to arrive at conclusions. Have you investigated either of these topics, and/or am I misreading you?

I am talking about informational entropy, but using the analogy to pressure as a intuition pump. This applies whether we are using neutral nets or other PGMs. In fact, I was thinking of causal BNs as ideal approximations of human rationality, and noting the additional fact that there seems to be some cost to maintaining uncertainty that seems to fit the earlier analogy. (Sorry if I'm unclear - I will reread and try to clarify when I'm not on my phone.)

I like the article.

I remember temperature to be used in this way when speaking about Monte Carlo algorithms. At high temperature the algorithm searches for solutions that are very different from the current solution. At low temperature the algorithm searches smaller variation of the current solution. Lowering the temperature at which such an algorithm operates doesn't change the amount of calculations per minute that might in your model be energy.

Similarly I don't believe that "energy" is a limiting factor for changing the temperature in human belief formation. In Zen Buddhism, keeping the temperature high is labeled as keeping "beginners mind". Human brains are capable of that feat but don't do so automatically.

At high temperature the algorithm searches for solutions that are very different from the current solution. At low temperature the algorithm searches smaller variation of the current solution.

The technique is known as simulated annealing.

I will push back, and say that I intuitively am fairly sure that there is *some* cost to doing this. This is because uncertainty makes the mental models more complex, and if there is no computational price for that complexity, but brains are simulate-able by Turing machines, I'm intuiting some deep problems with the theory.

But I think that beginners mind is exactly the sort of thing that is being discussed - in the analogy, it's equivalent to keeping the total energy level of the system high, instead of withdrawing the energy and letting the system settle into a frozen state prematurely. The technique does require a change in how you process things, which can be done.

Lastly, /u/Luminfer noted that this is simulated annealing - I'm glad that there are people here who understand this stuff well enough to draw these connections; I've used simulated annealing, but I didn't understand it. (I should make myself the source of all useful knowledge...)

This is because uncertainty makes the mental models more complex, and if there is no computational price for that complexity

I don't see a intuitive reason that adding information to a mental model should be more expensive than deleting information.

There are a bunch of limiting beliefs that I consider to be bad but I still have them in my mind. I think I know how to go and cut them. It's not that I lack energy to do so nor the ability to do so. The problem is rather that I worry about mental stability when I would go through a list of 100 fundamental beliefs that have effects on my action and cut them all out.

This clarified something for me, so I appreciate the discussion!

Changing the model is like a phase transition cost - any change in the state will require some cost, whether it is simplifying or complexifying the model. So you're right that I have no clear reason to think that "adding information to a mental model should be more expensive than deleting information" - though it's an interesting question.

But the cost of utilizing a model, as opposed to modifying it, is always lower with lower entropy models. A meta-model with uncertainty about which of 3 possible models to use is necessarily more complex to compute the result of than any of the 3 underlying models.Minds cannot have arbitrarily complex representations that they constantly use, because it would take forever to decide anything.

Your last point, about mental stability, is not clear to me yet - if you want to expand on it,. that would be great.

Take a limiting belief that a lot of people hold: "I shouldn’t have to ask my partner for what I want."

What happens if you suddenly drop that believe? You find yourself in the challenging situation of putting what you want into words. You actually have to admit to your desires. It might hurt more when you ask explicitly for what you want and your partner then doesn't give it to you. That's all uncertainty.

Imagine that a person get's rid of the belief "I don't deserve to be happy." Getting rid of such a belief has actual consequences.

Both of those beliefs aren't one's I hold, but there are tons of beliefs in that category that still linger in my mind. Deleting 100 of them could change a lot and produce a lot of chaos.

Those are changing your beliefs in a complex situation. But you're not simplifying any model - you are substituting one model for another; "I shouldn’t have to ask my partner for what I want" is replaced by "I should have to ask my partner for what I want." There is no reason to think that that is a lower entropy model - it's just a new model instead of the old one. The question is whether the model "I'm uncertain whether I need to ask my partner for what I want or not" is more or less complex than "I shouldn’t have to ask my partner for what I want."

you are substituting one model for another; "I shouldn’t have to ask my partner for what I want" is replaced by "I should have to ask my partner for what I want."

Given my experiences with working with beliefs I don't think that's the case.

Unfortunately most of the evidence I have for that is personal experience and talking to other people with personal experience so I can't link you to a resource making that point.

This image sequence is strange. Or maybe I am strange. By now I have played with it quite a bit. And for me it is not easy to see the man in the picture until image 7. The image stops clearly looking like a man about at image 4. And going the reverse the women stops looking clearly like a women about at panel six. I can follow the man along longer but it means to consciously disregard the other aspects. In a meditation-like state I can perceive the whole of the man in image 7 and almost in 8 and in reverse alike. But I have to defocus, not stare and be more dreamy.

Note that I can flip this http://www.youramazingbrain.org.uk/images/supersenses/young_or_old.jpg and this http://www.youramazingbrain.org.uk/images/supersenses/necker_cube.gif (the latter takes about 3 seconds to flip.

The trick is to look at them in sequence, one at a time; an animated gif might have been better in some ways.

In either case, the image works better on one line rather than 2, which I have, but can't upload. When I cross-post this to the LA Rationality Blog (once it's working,) the image will be on one line.

Of course I did look at them in sequence - otherwise I have even more difficulty to see either. Jumping back and forth. But the key point is that for me the overlapping/freezing you mention doesn't seem to occur.

For me, the pictures in the op stop being a man at around panel 6, going back they stop being a woman at around 4. I can flip your second example by unfocusing and refocusing my eyes, but in your first example I can't for the life of me see anything other than a young woman looking away from the camera (I'm amusing there is an old woman in there somewhere based on the image name).

Could you give a hint as to how to flip it? I'm assuming the ear turns into an eye or something, but I've been trying for about half an hour now and it is annoying the crap out of me.

- The young woman's ear is the old woman's left eye.
- The young woman's chin is the old woman's nose.
- The young woman's choker necklace is the old woman's mouth.

The old woman is looking down.

A line drawing version might be easier.

Ah yes, that did it (and I think I have seen the line drawing before) but it still takes a serious conscious effort to see the old woman in either of those. Maybe some Freudian thing where my mind prefers looking at young girls over old women :P

## A Preface

During the 1990’s, a significant stream of research existed around how people process information, which combined very different streams in psychology and related areas with explicit predictive models about how actual cognitive processes differ from the theoretical ideal. This is not only the literature by Kahneman and Tversky about cognitive biases, but includes research about memory, perception, scope insensitivity, and other areas. The rationalist community is very familiar with some of this literature, but fewer are familiar with a masterful synthesis produced by Richards Heuer for the intelligence community in 1999[1], which was intended to start combating these problems, a goal we share. I’m hoping to put together a stream of posts based on that work, potentially expanding on it, or giving my own spin – but encourage reading the book, Psychology of Intelligence Analysis, itself (PDF) as well[2]. (This essay is based on Chapter 3.)

This will hopefully be my first set of posts, so feedback is especially welcome, both to help me refine the ideas, and to refine my presentation.

## Entropy, Pressure, and Metaphorical States of Matter

Eliezer recommends updating incrementally but has noted that it’s hard. The central point, that it is hard to do so, is one that some in our community have experienced and explicated, but there is deep theory I’ll attempt to outline, via an analogy, that I think explains how and why it occurs. The problem is that we are quick to form opinions and build models, because humans are good at pattern finding. We are less quick to discard them, due to limited mental energy. This is especially true when the pressure of evidence doesn’t shift overwhelmingly and suddenly.

I’ll attempt to answer the question of how this is true by stretching a metaphor and create an intuition pump for thinking about how our minds might be perform some think using uncertainty.

## Frozen Perception

Heuer notes a stream of research about perception, and notes that “once an observer has formed an image – that is, once he or she has developed a mind set or expectation concerning the phenomenon being observed – this conditions future perceptions of that phenomenon.” This seems to follow a standard Bayesian practice, but in fact, as Eliezer noted, people fail to update. The following set of images, which Heuer reproduced from a 1976 book by Robert Jervis, show exactly this point;

Looking at each picture, starting on the left, and moving to the right, you see a face slowly change. At what point does the face no longer seem to appear? (Try it!) For me, it’s at about the seventh image that it’s clear it morphed into a sitting, bowed figure. But what if you start at the other end? The woman is still clearly there long past the point where we see a face, starting in the other direction. What’s going on?

We seem to attach too strongly to our first approach, decision, or idea. Specifically, our decision seems to “freeze” once it get to one place, and needs much more evidence to start moving again. This has an analogue in physics, to the notion of freezing, which I think is more important than it first appears.

## Entropy

To analyze this, I’ll drop into some basic probability theory, and physics, before (hopefully) we come out on the other side with a conceptually clearer picture. First, I will note that cognitive architecture has some way of representing theories, and implicitly assigns probabilities to various working theories. This is some sort of probability distribution over sample theories. Any probability distribution has a quantity called entropy[3], which is simply the probability of each state, multiplied by the logarithm of that probability, summed over all the states. (The probability is less than 1, so the logarithm is negative, but we traditionally flip the sign so entropy is a positive quantity.)

Need an example? Sure! I have two dice, and they can each land on any number, 1-6. I’m assuming they are fair, so each has probability of 1/6, and the logarithm (base 2) of 1/6 is about -2.585. There are 6 states, so the total is 6* (1/6) * 2.585 = 2.585. (With two dice, I have 36 possible combinations, each with probability 1/36, log(1/36) is -5.17, so the entropy is 5.17. You may have notices that I doubled the number of dice involved, and the entropy doubled – because there is exactly twice as much that can happen, but the average entropy is unchanged.) If I only have 2 possible states, such as a fair coin, each has probability of 1/2, and log(1/2)=-1, so for two states, (-0.5*-1)+(-0.5*-1)=1. An unfair coin, with a ¼ probability of tails, and a ¾ probability of heads, has an entropy of 0.81. Of course, this isn’t the lowest possible entropy – a trick coin with both sides having heads only has 1 state, with entropy 0. So unfair coins have lower entropy – because we know more about what will happen.

## Freezing, Melting, and Ideal Gases under Pressure

In physics, this has a deeply related concept, also called entropy, which in the form we see it on a macroscopic scale, just temperature. If you remember your high school science classes, temperature is a description of how much molecules move around. I’m not a physicist, and this is a bit simplified[4], but the entropy of an object is how uncertain we are about its state – gasses expand to fill their container, and the molecules could be anywhere, so they have higher entropy than a liquid, which stays in its container, which still has higher entropy than a solid, where the molecules don’t more much, which still has higher entropy than a crystal, where the molecules are sort of locked into place.

This partially lends intuition to the third law of thermodynamics; “the entropy of a perfect crystal at absolute zero is exactly equal to zero.” In our terms above, it’s like that trick coin – we know exactly where everything is in the crystal, and it doesn’t move. Interestingly, a perfect crystal at 0 Kelvin cannot exist in nature; no finite process can reduce entropy to that point; like infinite certainty, infinitely exact crystals are impossible to arrive at, unless you started there. So far, we could build a clever analogy between temperature and certainty, telling us that “you’re getting warmer” means exactly the opposite of what it does in common usage – but I think this is misleading[5].

In fact, I think that information in our analogy doesn’t change the temperature; instead, it reduces the volume! In the analogy, gases can become liquids or solids either by lowering temperature, or by increasing pressure – which is what evidence does. Specifically, evidence constrains the set of possibilities, squeezing our hypothesis space. The phrase “weight of evidence” is now metaphorically correct; it will actually constrain the space by applying pressure.

I think that by analogy, this explains the phenomenon we see with perception. While we are uncertain, information increases pressure, and our conceptual estimate can condense from uncertain to a relatively contained liquid state – not because we have less probability to distribute, but because the evidence has constrained the space over which we can distribute it. Alternatively, we can settle on a lower energy state on our own, unassisted by evidence. If our minds too-quickly settle on a theory or idea, the gas settles into a corner of the available space, and if we fail to apply enough energy to the problem, our unchallenged opinion can even freeze into place.

Our mental models can be liquid, gaseous, or frozen in place – either by our prior certainty, our lack of energy required to update, or an immense amount of evidential pressure. When we look at those faces, our minds settle into a model quickly, and once there, fail to apply enough energy to re-evaporate our decision until the pressure of the new pictures is relatively immense. If we had started at picture 3 or 6, we could much more easily update away from our estimates; our minds are less willing to let the cloud settle into a puddle of probable answers, much less freeze into place. We can easily see the face, or the woman, moving between just these two images.

When we begin to search for a mental model to describe some phenomena, whether it be patterns of black and white on a page, or the way in which our actions will affect a friend, I am suggesting we settle into a puddle of likely options, and when not actively investing energy into the question, we are likely to freeze into a specific model.

## What does this approach retrodict, or better, forbid?

Because our minds have limited energy, the process of maintaining an uncertain stance should be difficult. This seems to be borne out by personal and anecdotal experience, but I have not yet searched the academic literature to find more specific validation.

We should have more trouble updating away from a current model than we do arriving at that new model from the beginning. As Heuer puts it, “Initial exposure to… ambiguous stimuli interferes with accurate perception even after more and better information becomes available.” He notes that this was shown in Brunder and Potter, 1964 “Interference in Visual Recognition,” and that “the early but incorrect impression tends to persist because the amount of information necessary to invalidate a hypothesis is considerably greater than the amount of information required to make an initial interpretation.”

## Potential avenues of further thought

The pressure of evidence should reduce the mental effort needed to switch models, but “leaky” hypothesis sets, where a class of model is not initially considered, should allow the pressure to metaphorically escape into the larger hypothesis space.

There is a potential for making this analogy more exact, but discussing entropy in graphical models (Bayesian Networks), especially in sets of graphical models with explicit uncertainty attached. I don’t have the math needed for this, but would be interested in hearing from those who did.

[1] I would like to thank both Abram Demski (Interviewed here) from providing a link to this material, and my dissertation chair, Paul Davis, who was able to point me towards how this has been used and extended in the intelligence community.

[2] There is a follow up book and training course which is also available, but I've neither read it nor seen it online. A shorter version of the main points of that book is here (PDF), which I have only glanced through.

[3] Eliezer discusses this idea in Entropy and short codes, but I’m heading a slightly different direction.

[4] We have a LW Post, Entropy and Temperature that explains this a bit. For a different, simplified explanation, try this: http://www.nmsea.org/Curriculum/Primer/what_is_entropy.htm. For a slightly more complete version, try Wikipedia: https://en.wikipedia.org/wiki/Introduction_to_entropy. For a much more complete version, learn the math, talk to a PhD in thermodynamics, then read some textbooks yourself.

[5] I think this, of course, because I was initially heading in that direction. Instead, I realized there was a better analogy – but if we wanted to develop it in this direction instead, I’d point to the phase change energy required to changed phases of matter as a reason that our minds have trouble moving from their initial estimate. On reflection, I think this should be a small part of the story, if not entirely negligible.