Unnatural Categories Are Optimized for Deception

Zack_M_Davis

Followup to: Where to Draw the Boundaries?

There is an important difference between having a utility function defined over a statistical model's performance against specific real-world data (even if another mind with different values would be interested in different data), and having a utility function defined over features of the model itself.

Arbitrariness in the map doesn't correspond to arbitrariness in the territory. Whatever criterion your brain is using to decide which word you want, is your non-arbitrary reason ...

So the one comes back to you and says:

That seems wrong—why wouldn't I care about the utility of having a particular model? I agree that categories derive much of their usefulness from "carving reality at the joints"—that's one very important kind of consequence of choosing to draw category boundaries in a particular way. But other consequences might matter too, if we have some moral reason to value drawing our categories a particular way. I don't see why I shouldn't be willing to trade off one unit of categorizational nonawkwardness for units of morality, even if trading off a million units of categorizational nonawkwardness for the same $X$ units of morality would be bad.

I once read about an analogy between category boundaries and national borders. Imagine a diplomat trying to come up with a proposal for a two-state solution to the Israeli–Palestinian conflict. There's no such thing as the "correct" border between Israel and Palestine, but there are consequences of choosing one border or another. For example, awarding territory to one side risks angering the other. For another, if the West Bank and Gaza Strip are to be part of Palestine, but Tel-Aviv and the southern city of Eilat are to be part of Israel, then topology forces you to decide which of Israel and Palestine gets to be continuous, and which will be split into two parts, because a "land bridge" between Gaza and the West Bank would separate Tel Aviv and Eilat, and vice versa. Since borders can't be "true" or "false", the diplomat's task is and can only be to weigh these kinds of trade-offs.

Analogously, I think of language, following Eliezer Yudkowsky's "A Human's Guide to Words", as being a human-made project intended to help people understand each other. It draws on the structure of reality, but has many free variables, so that the structure of reality doesn't constrain it completely. This forces us to make decisions, and since these are not about factual states of the world—what the definition of a word really is, in God's dictionary—we have nothing to make those decisions on except consequences.

... okay, I think I see the problem. I see how one might have gotten that out of "A Human's Guide to Words"—if you skipped all the parts with math. I am now prepared to explain exactly what's wrong here in more detail than my previous attempt: not just that this position is not in harmony with the hidden Bayesian structure of language and cognition, but how the hidden Bayesian structure of language and cognition explains why an intelligent system might find this particular mistake tempting in the first place, and what breaks as a result.

Category "boundaries" are a useful visual metaphor for helping explain the cognitive function of categorization. If you have the visualization but you don't have the math, you might think you have the freedom to "redraw" the category "boundaries". Simple, compact boundaries might tend to be more useful, but more complicated boundaries aren't false and therefore aren't forbidden if you have some non-epistemic reason to prefer them ... right?

Only in the sense that no hypothesis is "false"! Categories, words, correspond to hypotheses—probabilistic models that make predictions. If I see a dolphin in the water, and I say, "Hey, there's a dolphin!", and you understand me, that enables you to predict quite a lot about there being this-and-such kind of aquatic mammal with fins, a tail, &c. in the water.

This AI capability of "speech" is not only very powerful; it's also easy to understand the cause-and-effect evidential entanglement which explains how it works—at least at a very high level.

Photons bounce off the dolphin and hit my eyes. I recognize the photons as forming an image that matches a concept that I associate with the word/symbol "dolphin" (implementation details omitted). I emit a "dolphin" signal composed of sound waves which hit your eardrum. By a convention that culturally evolved due to our predecessors having a shared interest in communicating with each other, you map the "dolphin" signal to an internal concept that closely resembles the one I associate with that same signal. This works because we happen to live in a world where the distribution of creatures has cluster-structure whereby dolphins have lots of things in common with each other, such that it's possible to use observations about an entity to infer that it "is a dolphin", and then use the dolphin concept to make good predictions about aspects of the entity that have not yet been observed; we owe our confidence that we've learned "the same" dolphin model to the fact that dolphins actually exist.

But the dolphin concept/model/hypothesis is subject to the universal mathematical laws of reasoning under uncertainty. In particular, probability-mass flows between hypotheses: as long as you never assign a probability of zero (which is a log-odds of negative infinity), nothing you believe can ever be definitively (infinitely) "falsified"—it "just" makes quantitatively worse predictions as compared to other hypotheses.

Because category "boundaries" are merely a visualization for a probabilistic model that makes predictions about the real world, you can't "redraw the boundaries" associated with a communication signal without messing with the model that generates them, which means messing with your predictions about the real world.

Might there be some non-epistemic reason for an agent to prefer a model that makes worse predictions? Sure! Correct maps are useful for steering reality into configurations ranked higher in your preference ordering—but causing a different agent to have incorrect maps might make them mis-navigate reality in a way that benefits you! We call this deception.

In a related phenomenon, a poorly-designed agent might get confused and end up manipulating its own beliefs: optimizing its map to inaccurately portray a high-value territory (rather than optimizing the territory to be high-value by using a map that reflects the territory), a kind of self-deception. We call this wireheading.

The laws of probability and information theory allow us to calculate how information can be efficiently encoded and transmitted from one place to another. Given some distribution of random variables, and some specification of what information about those variables you want to transmit, some encodings—some ways of "drawing" category "boundaries"—quantitatively perform better than others. Agents that want to communicate with each other will tend to invent or discover conventions that efficiently encode the information they're trying to communicate. Agents that communicate in ways that systematically depart from efficient encodings are better modeled as trying to deceive each other or wirehead themselves.

Let's walk through a simple example. Imagine that you have a peculiar job in a peculiar factory: specifically, you're a machine-learning engineer tasked with automating away the jobs of humans who sort objects from a mysterious conveyor belt.

Another engineer has already written a system that processes camera and sensor data about the objects into more convenient "features": color (measured on an eight-point blueness scale), shape (measured on an eight-point "eggness" scale), and vanadium content (a boolean Yes or No). Your task is to further process this information into a format suitable for giving commands to other systems—for example, the robot arm that will physically move the objects into appropriate bins.

The feature data consists of the blueness–eggness–vanadium-content joint distribution given by this 128-entry table:

blueness–eggness–vanadium joint distribution

This seems like ... not the most useful representation? The data is all there, so in principle, you could code whatever you needed to do based off the full table, but it seems like it would be an unmaintainable mess: you'd sooner resign than write a 128-case switch statement. Furthermore, when the system is deployed, you hope to typically be able to give the binning robot messages based on only the color and shape observations, because the Sorting Scanner that the vanadium readings come from is expensive to run. You could just do a Bayesian update on the entire joint distribution, of course, but it seems like it should be possible to be more efficient by exploiting regularities in the data, not entirely unlike how your colleague's system has already made your job much simpler by giving you blueness and eggness feature scores rather than raw camera data. Eyeballing the table, you notice it seems to have a lot of redundancy: most of the probability-mass is concentrated in two regions where the blueness and eggness scores are either both high or both low—and vanadium is only found when both blueness and eggness are high.

O tragedy O the stars! If only there were some more convenient and flexible way to represent this knowledge—some kind of deep structural insight to rescue you from this cruel predicament!

... alright, dear reader—I shouldn't patronize. You already know how this story ends. The distribution factorizes!

$\sum c a t e g o r y P (c a t e g o r y) \cdot P (b l u e n e s s | c a t e g o r y) \cdot P (e g g n e s s | c a t e g o r y) \cdot P (v a n a d i u m | c a t e g o r y)$

(The distribution in this made-up toy example factorizes exactly, but in a messy real-world application, you might have a spectrum of approximate models to choose from.)

We can simplify our representation of our observations by using a naïve Bayes model, a "star-shaped" Bayesian network where a central "category" node is posited to underlie all of our observations: we believe that each object either "is a blegg" (and therefore contains vanadium and has high blueness and eggness scores) with probability 0.48, "is a rube" (and therefore has no vanadium and low blueness and eggness scores) with probability 0.48, or belongs to a catch-all "other"/error class with probability 0.04. (Maybe the camera is buggy sometimes, or maybe there are some other random objects mixed in with the rubes and bleggs?)

factorized object distribution

The full joint distribution had 127 degrees of freedom (a table of $8 \cdot 8 \cdot 2 = 128$ separate probabilities, constrained to add up to 1), whereas the naïve-Bayes representation only needs 57 parameters ( $3 \cdot 1$ prior probabilities for the categories, plus $3 \cdot 8 = 24$ , $3 \cdot 8 = 24$ , and $3 \cdot 2 = 6$ -entry conditional probability tables for each of the features). The advantage would be much larger for more complicated problems: the joint distribution table grows exponentially with more features, quickly becoming infeasible to store and represent, let alone learn.

It must be stressed that our "categories" here are a specific mathematical model that makes specific (probabilistic) predictions. Suppose we see a black-and-white photo of an egg-shaped object: specifically, one with an eggness score of 7. Given that observation of $e g g n e s s = 7$ , we can update our probabilities of category-membership.

$P (c a t e g o r y = c | e g g n e s s = 7) = \frac{P (e g g n e s s = 7 | c a t e g o r y = c) P (c a t e g o r y = c)}{\sum_{d \in {b l e g g, r u b e, ? ?}} P (e g g n e s s = 7 | c a t e g o r y = d) P (c a t e g o r y = d)}$

We think the egg-shaped object is almost certainly a blegg (specifically, with probability 0.96), even if the black-and-white photo doesn't directly tell us how blue it is, because

$P (c a t e g o r y = b l e g g | e g g n e s s = 7) = \frac{\frac{1}{4} \cdot \frac{12}{25}}{\frac{1}{4} \cdot \frac{12}{25} + 0 \cdot \frac{12}{25} + \frac{1}{8} \cdot \frac{1}{25}} = \frac{24}{25} = 0.96$

We can then use our updated beliefs about category membership (0.96 blegg/0 rube/0.04 unknown, as contrasted to the 0.48/0.48/0.04 prior) to get our updated posterior distribution on the 0–7 blueness score (0.005/0.005/0.005/0.005/0.005/0.245/0.485/0.245—left as an exercise for the reader).

In addition to categories facilitating efficient probabilistic inference within the system that you're currently programming, labels for categories turn out to be useful for communicating with other systems. The robot arm in the Sorting room puts bleggs in a blegg bin, which gets taken to a room elsewhere in the factory where there's sophisticated vanadium-ore-processing machinery that has to handle both bleggs and gretrahedrons.

But suppose the binning arm doesn't need to know about the blueness and eggness scores: it can close its claws around rubes and bleggs alike, and you only need to program it to pick up an object from a certain spot on the conveyor belt and place it into the correct bin. However, the vanadium-ore-processing machine does need to do further information processing before it can operate on an object—perhaps it needs to vary its drill speed in proportion to the density of a particular blegg's flexible outer material (which it can estimate based on how brightly the blegg glows in the dark), but it uses a different drilling pattern for gretrahedrons.

If you need to send commands to both the binning arm and the ore-processing machine, it's a more efficient communication protocol to just be able to send the 28-byte JSON payload {"object_category": "BLEGG"} and let the other machines do their work using their own models of bleggs, rather than having to send over the raw camera data plus the binary code of the Bayesian network and feature extractors that you initially used to identify bleggs. Intelligence is prediction is compression: our ability to find an encoding that compresses the length of the message needed to convey information about the objects is fundamental to our having learned something about the distribution of objects.

The {"object_category": "BLEGG"} message is a useful shorthand for "linking up" the models between different machines. Different machines might not use the same model: the classifier system uses blueness and eggness scores to identify bleggs, but the ore-processing machine, having been told that an object is a blegg, can take its approximate blueness and eggness for granted and only needs to reason about its luminescence and vanadium content.

But this trick of using a signal to correlate the models between different machines only works because and insofar as both models are pointing to the same cluster-structure in reality. If the model in the classifier system doesn't meaningfully match the model in the ore-processing system—if the classifier code sends the {"object_category": "BLEGG"} message given a object with blueness score between 5 and 7, but the ore-processor, upon receiving the {"object_category": "BLEGG"} message, positions its drills in the expectation of processing an object with an eggness score between 0 and 2—then the factory doesn't work.

As a human learning math, it's helpful to examine multiple representations of the same mathematical object. We've already seen our blueness–eggness–vanadium model represented as a table, and factorized into a graphical model. We've done also some algebraic calculations with it. But we can also visualize it: the set of camera observations that the model classifies as a blegg with probability $\geq 0.96$ can be thought of a area with a boundary in two-dimensional blueness–eggness space:

("With probability $\geq 0.96$ " because our catch-all "other"/error category can also generate examples with high blueness and eggness scores; we can't say things like "Everything inside the boundary in the diagram is a blegg" when we're talking about a formal model where some of the categories generate overlapping observations in whatever subspace the diagram is depicting.)

If you were trying to teach someone about the hidden Bayesian structure of language and cognition, but thought your audience was too stupid or lazy to understand the actual math, you might be tempted to skip the part about factorizing a joint distribution into a star-shaped Bayesian network and just talk about "drawing" "boundaries" in configuration space for human convenience, perhaps with a hokey metaphor about national borders. Then the audience might walk away with the idea that there's no reason not to replace the old blegg concept and its boring compact boundary, with a new blegg* concept that has an exciting squiggly border.

Alaska isn't even contiguous with the rest of the United States. If that's okay, why can't the borders of bleggness be a little squiggly?

Because the "national borders" metaphor is just a metaphor. It immediately breaks down as soon as you try to do any calculations.

When we say that the United States purchased Alaska from the Russian Empire, that means that this-and-such physical area on the Earth's surface went from being the territory of the Russian government, to being territory of the United States government, where land being the "territory of" a "government" is a complicated idea that has something to do Schelling points over who gives orders to policemen and soldiers in that area.

When you reprogram your machine-learning system to send an {"object_category": "BLEGG"} message when it sees an object with an eggness score of 2 and a blueness score of 1, then your vanadium-ore-processing machine wears down its drill bits trying to process a rube.

Other than the fact that some aspects of both of these situations can be usefully visualized as changes to a two-dimensional diagram depicting an area with a boundary, what do these situations have to do with each other? They don't. Countries aren't Bayesian networks. They just aren't. When we depict a country on a map, we're not talking about a cognitive system that can use observations of latitude to estimate probabilities of country-membership and then use that distribution on country-membership to get an updated probability distribution on longitude. (I mean, given a world map, you could program such a thing, but it seems kind of useless—it's not clear why anyone would want that particular program.) Why would you expect to understand an AI-theory concept by telling a story about national borders?

So, that's what's wrong with the national-borders metaphor. But we haven't yet really explained the problem with "unnatural" categories—those that you would visualize as a squiggly, "gerrymandered" boundary. The squiggly blegg* boundary doesn't have the nice property of corresponding to the category labels in our nice factorized naïve Bayes model, but it still contains information. You can still do a Bayesian update on being told that an object lies within a squiggly boundary in configuration space. If that update eliminates half of your probability-mass, that's one information-theoretic bit, no matter how the category is shaped in Thingspace.

If you only care about how much probability you assign to the exact answer, then a bit is a bit. But if an approximate answer is approximately as good—if your answerspace has a metric on it, so that "approximate" can mean something—then some bits can be more valuable than others.

Suppose some random variable $X$ is uniformly distributed on the set ${1, 2, 3, 4, 5, 6, 7, 8}$ . You have the option of being told either whether an observation $x$ sampled from $X$ is even or odd, or whether $x$ is greater or less than 4.5. Either way, you eliminate half of your hypotheses: the entropy of your probability distribution goes from ${log}_{2} 8 = 3$ to ${log}_{2} 4 = 2$ . Either way, you've learned 1 bit.

Still, if you have to make a decision that depends on "how big" $x$ is, it seems like the "1–4 or 5–8" category system is going to be more useful than the "even/odd" category system, even though they both provide the same amount of information about the exact answer. If you learn that $x \in {1, 2, 3, 4}$ , then you know that $x$ is "small", but if you learn that $x$ is odd, you haven't learned much about how big it is: it could be 1, but it could just as well be 7.

To formalize this, let's measure how "good" a category is using the expected squared error. "Error" is how much a prediction is wrong by: if you guessed $x$ was 2, but it was actually 5, your error would be $5 - 2 = 3$ , and your squared error would be the square of that, $3^{2} = 9$ . The expected squared error of a probability distribution is, on average, the square of how much your guess about a sample from that distribution will be wrong. (The squared error has nicer mathematical properties than the absolute error.)

For our example of $x$ sampled from $X$ uniformly distributed on ${1, 2, 3, 4, 5, 6, 7, 8}$ , your best-guess estimate $^x$ of $x$ is going to be the expected value

$\sum x \in {1...8} P (X = x) \cdot x = \frac{1 + 2 + 3 + 4 + 5 + 6 + 7 + 8}{8} = 4.5$

And the initial expected squared error is

$E[(x−^x)2]=∑x∈{1...8}P(X=x)⋅(x−^x)2=(1−4.5)2+(2−4.5)2+...+(8−4.5)28=5.25$

Suppose you then learn whether $x$ is even or odd.

With probability 0.5, you learn that $x$ is even. In that case, your new estimate $^x$ taking that into account would be

$\sum x \in {2, 4, 6, 8} P (X = x) \cdot x = \frac{2 + 4 + 6 + 8}{4} = \frac{20}{4} = 5$

and your new expected squared error (in the "even" possible world) would be

$E[(x−^x)2]=∑x∈{2,4,6,8}P(X=x)⋅(x−^x)2=(2−5)2+(4−5)2+(6−5)2+(8−5)24=9+1+1+94=204=5$

With probability 0.5, you learn that $x$ is odd. Similar calculations (left as an exercise) also give a new expected squared error of 5 in the "odd" possible world. Averaging over both cases (trivially, $0.5 \cdot 5 + 0.5 \cdot 5 = 5$ ), learning whether $x$ is even or odd only brought our expected squared error down from 5.25 to 5, barely changing at all.

In contrast, if you learn whether $x$ is 1–4 or 5–8, your expected squared error plummets to 1.25. (Exercise.) By being compact, the "1–4 or 5–8" category system is much more useful for getting close to the right answer than the "even/odd" category system.

The same goes for natural categories versus squiggly category "boundaries" in configuration space; we just need to supply some metric to define what "close" means.

For our blueness–eggness–vanadium distribution, suppose we use the Euclidean distance on blueness-score ✕ eggness-score ✕ 1-if-vanadium-present-else-0. (So, for example, the "distance" between the typical blegg and the typical rube is $\sqrt{(6 - 1)^{2} + (6 - 1)^{2} + (1 - 0)^{2}} = \sqrt{25 + 25 + 1} = \sqrt{51} \approx 7.14$ under this metric.)

Then our expected squared error before being told anything about an object is about 13.63. On being told whether an object is a blegg, rube, or other (according to the categories in our nice factorized naïve Bayes model), our expected squared error plummets to 1.38.

But suppose that, instead of our nice factorized naïve Bayes model, we use a category system based on drawing squiggly "boundaries" in configuration space: everything inside the blegg* boundary in the diagram is a blegg*, everything within the rube* boundary in a rube*, and anything outside belongs to a catch-all "other*" category.

On learning whether an object is a blegg*, rube*, or other*, our expected squared error only goes down to about 4.12.^[1]

In this sense, the gerrymandered blegg* concept is quantitatively less informative than the original, compact blegg concept. The metric we assigned to blueness–eggness–vanadium space was our choice, and could depend on our values: for example, if we simply don't care about predicting how blue an object is, we could disregard the blueness score and only define a concept on the eggness–vanadium subspace (in which case our initial expected squared error is about 6.94, plummets to 0.69 given knowledge of blegg/rube/other category-membership, but only goes down to about 1.81 given knowledge of the gerrymandered blegg*/rube*/other* category). Or if we don't care about predicting blueness very much, we could calculate our error score with respect to a metric that gave blueness very little weight. (Exercise.)

But given a metric on the variables that you care about predicting and using to inform predictions, which categories are cognitively useful depends on the the distribution of data in the world. You can't define a word any way you want.

The dependence on a choice of metric on configuration space—and really, a choice of the space—gives a sense in which optimal categories are value-laden, but it's a specific kind of lawful dependence between your values and the distribution of data in the world, not an atomic preference for using a particular encoding for its own sake.

The cognitive function of categorization is to group similar things together so that we can make similar decisions about them. A function measuring the extent to which things are "similar" has to take the things as input, but the extent to which things are decision-relevantly similar also depends on what you're trying to accomplish with your decisions, and that can be algorithmically complex. It might not be just a matter of only looking at some decision-relevant subspace of a natural, "obvious" configuration space that's available to all possible minds (like not caring what color your toothbrush handle is—um, if we pretend that all possible minds had human-like color vision); the dimensions of the space you do your similarity-clustering in might themselves be complicated features (in the sense of machine learning) of which agents with different values would have no reason to logically pinpoint that particular criterion by which things may be judged. How you should define words depends on what you want, but that's not the same as defining words any way you want.

For example, poison isn't a natural category to a generic mind studying chemistry: we group cyanide and hemlock together as poison because we value human health, and so we want to have a category for scary chemicals that disrupt human metabolism, causing death or serious illness. But this determination depends on the intricate details of human biochemistry. (The theobromine in chocolate is okay for humans at typical doses, but potentially fatal to dogs, which are actually pretty close to us in animalspace.) The compact category "boundary" that minimizes predictive error on human-healthspace, corresponds to a squiggly "boundary" in the chemicalspace you would be looking at if you've never seen a human and just want to make predictions about the chemicals themselves.

Or tiny molecular smileyfaces and real human smiles might be grouped together as similar as far as an image-classifier's curve detector is concerned, even if they're not similar as far as the abstracted idealized dynamic of human morality is concerned.

The technical sense in which optimal categories can be value-laden doesn't alter the basic morals of our basic Bayesian philosophy of language. Your values can give you a particular configuration space and a metric on the space, but given that, sane agents want to "carve it at the joints" in order to get a communication system that minimizes predictive error. If you're trying to find an efficient encoding of your observations, there's no reason to want squiggly, gerrymandered categories in the decision-relevant space.

The one replies:

You're still not addressing my crux! I don't doubt what you say about minimizing prediction error with respect to some squared metric thingy. But what if that's not what I care about? My utility function assigns high value to using the squiggly blegg* category boundary—such that the utility of using my preferred category outweighs the disutility of making less accurate predictions. You can define a word any way you want—if you're willing to pay the costs.

So, what, you just intrinsically assign high utility to using the same communication signal to encode eggness-2/blueness-1 observations as eggness-6/blueness-6 observations, given the joint distribution specified in my story problem about sorting objects in a factory? Really?

"... yes!"

Okay, but where would that kind of exotic utility function come from? How would it arise naturally in an intelligent system?

There's a trivial sense in which you can interpret any action taken by an agent as being taken because the agent values taking that action. This theory is compatible with all possible behaviors and therefore explains nothing.

The value of decision-theoretic utility functions isn't that "Because utility!" serves as an all-purpose excuse for any possible behavior. It's that simple coherence desiderata imply that an agent's behavior should be describable as maximizing expected utility for some utility function—with corresponding constraints on the shape of that behavior.

Situations like the Allais paradox illustrate what these constraints look like. Consider an AI faced with playing the following game. There's a switch that can be turned On or Off, that starts out on in the Off position. At midnight, a coin is flipped. If the coin comes up Tails, the game ends. If the coin comes up Heads, then at a quarter past midnight, if the switch is Off, then the AI gets paid $100, and if the switch is On, a six-sided die is rolled, and the AI gets paid $110 if the die doesn't come up 6.

Suppose that, before midnight, the AI is willing to pay a dollar to flip the switch On (as if it thought that winning $110 with a probability of 5/12 is better than winning $100 with a probability of 1/2). Suppose the coin comes up Heads, and the AI is then willing to pay another dollar to flip the switch Off again (as if it thought that $100 with certainty is better than $110 with probability 5/6). Then the AI is two dollars poorer in exchange for the switch being in the same position it started in.

These gambling preferences violate the independence axiom of the von Neumann–Morgenstern utility theorem. You can't have a utility function $U$ for which

$\frac{1}{2} \cdot U ($ 100) < \frac{5}{12} \cdot U ($ 110)$

and

$U ($ 100) > \frac{5}{6} \cdot U ($ 110)$

because the sides of the second inequality are just those of the first multiplied by two, and multiplying by two should preserve the direction of inequality.

Having shown this, can we say that an AI with such behavior is "irrational"? But what does that even mean? If, for some reason, you specifically programmed the AI to prefer options it considers "certain", or to want switches to be "On" before midnight but "Off" after midnight, then it would be functioning as designed.

What we can say about such an AI, is that it doesn't have a utility function in terms of money, and is therefore not coherently optimizing for acquiring money. Recall that we say that a system is an optimizer if it systematically steers the future into configurations that rank higher with respect to some preference ordering. This helps us make predictions about what effects the system has, without having to model the details of how it brings those effects about. A well-designed agent that was optimizing for acquiring money would be expected to obey the independence axiom.

If the AI playing this game isn't coherently optimizing for acquiring money, what is it optimizing for? To tell, we'd need to observe its behavior in different environments and see how it responds to perturbations. If it is trying to acquire money but is just biased to prefer certainty (in violation of the von Neumann–Morgenstern axioms), then we'd expect it to make choices that result in money but continue to exhibit Allais-like glitches around gambles involving probabilities close to 1. If it just likes switches to be off after midnight, then we'd expect it to turn switches off at that time even if there's no gambling game going on.

This methodology for attributing goals to an agent—consider it to be "optimizing for" outcomes that it systematically achieves across a variety of environments—applies to the behavior of sending communication signals, just as it does to the behavior of flipping switches.

Back to the factory. Our classifier system sends a {"object_category": "BLEGG"} message when it gets feature data corresponding to the compact blegg concept. This behavior is optimized for sending messages that allow other systems to minimize the expected squared error of their predictions of objects with respect to our standard metric on blueness–eggness–vanadium space. We don't intrinsically "assign utility" to using that particular category system; the category is the solution to an optimization problem about how to efficiently get blueness–eggness–vanadium information from one place to another.

A system that sends a {"object_category": "BLEGG"} message when it gets camera data corresponding to the gerrymandered blegg* concept would be optimized for ... what? If you don't intrinsically assign utility to using that particular category system, then why would you program the system that way? What could possibly be the problem for which the gerrymandered category is an optimized solution?

Well. Suppose that, besides your dayjob as a machine-learning engineer, you also happen to own a side interest in the firm that supplies bleggs and rubes to this very factory. And suppose that vanadium fetches higher market prices than palladium, such that the factory is to pay the supplier $2 per blegg but only $1 per rube—and that the accounts-payable records are to be compiled based on how much the classifier you're currently programming sends {"object_category": "BLEGG"} and {"object_category": "RUBE"} messages, not how much metal actually gets harvested.

You can't help but notice that you stand to make more money if the system you're programming sends BLEGG messages more often. You can't just make it send BLEGG messages all the time—someone would notice and you'd get fired. But the ore-processing room can cope with a few suboptimally-sorted objects. Surely it's no big deal if you just ... adjusted the category boundary of BLEGG-ness a bit?

We saw earlier that the blegg concept does better than the blegg* concept with respect to mean squared error (given a metric on the feature space).

That's not the only possible scoring function with which one could formalize how "good" a category system is. Suppose that instead we score our category system by which one best minimizes the expected squared error minus supplier revenue in cents. With respect to this criterion, accurate predictions are still good, but supplier revenue is also good.

Learning whether an object is a blegg, rube, or other (according to the "natural" categories in our naïve Bayes model) yields a squared-error-minus-revenue score of about −142.62. (Don't ask me what the units are on this.) But learning whether an object is a blegg*, rube*, or other* yields a squared-error-minus-revenue of −151.57, which is lower (which is better, because we formulated this as a minimization problem). So with respect to that scoring function, the blegg* category "boundary" is preferable.

The one says:

But now it sounds like you're agreeing with me! The compact blegg category serves the factory owner's goals better, which you formalized in terms of minimizing average squared error. The squiggly blegg* boundary makes the factory perform less well, but it serves the moonlighting engineer's goals better, which you formalized in terms of minimizing squared error minus supplier revenue. There's no rule of rationality against the engineer programming the system using the blegg* category boundary if it suits their goals better.

Only in the sense that there's no rule of rationality against lying! Suppose I'm selling you some number of gold and silver bars, but you can't examine the metal yourself until later; you can only hope that the receipt I give you is accurate. Consider the following two scenarios.

In the first scenario, I lie: the receipt says I delivered 60 gold bars and 20 silver bars, but I actually delivered 40 gold bars and 40 silver bars. You live in a low-trust world where lying is very common and contract enforcement isn't really a thing: a third of the time an object is claimed to be gold, it turns out to be silver. So when you discover the fraud, you feel disappointed but not surprised: you would have preferred to get what you paid for, but you can't say you anticipated it.

In the second scenario, I tell the truth—with respect to a category system that suits my goals. The receipt says I delivered 60 gold bars and 20 silver bars—and I did. It's just that what I prefer to call "gold bars", you prefer to call "gold bars, or silver bars with odd serial numbers", and what I call "silver bars", you call "silver bars with even serial numbers". You know this, so when you examine the actual contents of the delivery, you feel disappointed but not surprised: you would have preferred to transact under your definitions of 'gold' and 'silver', but you can't say you anticipated it.

We might question whether these are two different scenarios, or two descriptions of the same scenario: the same physical receipt, the same physical metal, the same buyer anticipations about the metal conditional on observing the receipt. If we just pay attention to the evidential entanglements instead of being confused by words, then there's no functional difference between saying "I reserve the right to lie p% of the time about whether something belongs to category C", and adopting a new, less-accurate category system that misclassifies p% of instances with respect to the old system.

Minimizing the squared-error score is about map–territory correspondence: ways of communicating that help the factory machines make better predictions about the objects, get a higher score.

Minimizing the squared-error-minus-supplier-revenue score is a compromise between map–territory correspondence and saying whatever makes the supplier the most money.

The degree of compromise is quantitative: there's a continuum of possible scoring functions between "minimize expected squared error, only" (for which the naïve-Bayes categorizer is a good solution), and "maximize supplier revenue, only" (for which "always say BLEGG" is the optimal solution). If always saying whatever profits you and not revealing any information about the territory is deception pure and simple, then the intermediate points on a continuum with that can be thought of as partially deceptive.

Depending on your goals, deception can be rational! If you don't care about other agents having accurate models and just want to intervene on them to make them believe whatever makes them behave in a way that benefits you—or whatever makes them happy—then you can do that! There's no God to stop you. But in order to help you decide whether deceiving people is the right thing to do, it helps to notice that what you're doing is deceiving people.

It helps to notice what you're doing—if you're trying to be an agent that coherently steers the future in some direction. But who does that, really? Maybe you just want to feel good! And not even coherently steer the universe into configurations where you feel good, either!

Rational agents should want to have true beliefs: the map that reflects the territory, is the map that is useful for navigating the territory. But you don't—can't—have unmediated access to the world; you can only infer what the world is like from sensory data, and effectively live in your model of the world. Given the tricky indirection involved, it's not surprising that poorly-designed agents like humans sometimes get confused and "wirehead" themselves: if you don't notice the difference, it's tempting to fabricate a fake map that falsely portrays the territory as being good, instead of making a map that reflects the territory (which you can use to figure out how to improve the territory).

Similarly, if you don't notice the difference, it's tempting to choose language that makes the world sound good, than to have your language accurately describe the world (which description you can use to figure out how to make the world better).

Suppose I want people to think I'm funny. Funny is a value-laden concept in the specific lawful sense described earlier: non-human agents would have no motive to evaluate the particular fixed computation of humor. It's also a fuzzy concept: we don't have a simple test to precisely measure in standard units exactly how funny a joke is, but there's enough regularity in how people use the word "funny" for the word to be a useful communication signal. It's also a two-place concept: people have different senses of humor, so that what I consider funny isn't exactly the same as what you consider funny.

Given all these complications, one could imagine being tempted to think that humor is "subjective", and that therefore I can define it any way I want, and that therefore, if I feel sad about not being "funny", I can fix that by changing my definition of the word "funny" such that it includes my jokes. Because definitions can't be "false", right!? There's no rule of rationality prohibiting this boundary-redrawing project—and since I want so desperately to be "funny", there's every rule of human decency in favor of it, right?!

So, this obviously doesn't work. (Okay, it "works" if you deliberately choose to define the word "work" such that it works, but it doesn't actually work.) Yes requires the possibility of no: redefining X to make "Is it X?" come out true no matter what, loses the purpose of asking the question in the first place. The proposal to redefine the word "funny" came with the purported justification that words don't have intrinsic meanings, so it can't be "wrong" to redefine it. But precisely because words don't have intrinsic meanings, there's no reason to want to redefine an existing word, except to piggyback off the meaning people are already using that signal for.

(Note that this, in itself, isn't necessarily deceptive. Sometimes, coining new senses of a word that piggyback off an existing meaning can be a powerful tool for extending our vocabulary to cover new phenomena that we don't already have words for—as long as we're careful to specify which meaning is intended when it's not clear from context.)

It's not plausible to suppose that I want to be "funny" because I like five-letter words that start with the letter f; I want to be funny because of what that communication signal is already understood to refer to in common usage. The redefinition might (or might not) succeed at making me feel better about myself, but if it does, it only works by means of confusing me: using strategic equivocation to arbitrage the hedonic gap between my new definition, and the old definition (which I still mentally associate with the word).

If it does succeed at making me feel better about myself, is the redefinition "rational"? Happiness is good, right? Should not rationalists win?

I do not frame an answer: that would depend on how you draw the category boundaries of "rational", which is not an interesting question. (As it is written of a virtue which is nameless: if you speak overmuch of the Way, you will not attain it.)

What I can say, however, is that redefining the concept of humor is not a procedure that uses a map that reflects the territory to systematically achieve goals across a wide range of environments. If there's anything I can do to become funnier (like practicing telling jokes in a mirror, or studying great comedians to imitate their timing and delivery), I would seem less likely to notice and execute on such a plan after having sabotaged the concept I would need to notice the problem in the first place.

The map is not the territory ... but for real agents embedded in the physical universe, the map is part of the territory. This presents some complications to applications of our anti-wireheading moral. We don't want to wirehead ourselves by making the map look good at the expense of undermining our ability to navigate the territory—but there's no bright-line distinction demarcating which configurations of atoms are "the map". From the perspective of the eternal, it's all just territory.

In the previous post, we considered the case of an assembly line (well, sorting line) worker in the blegg–rube factory being excited about an ostensible promotion to the position of Vice President of Sorting—only to be aggrieved on finding out that it's a promotion literally in name only, with no changes in pay, authority, or work tasks.

If we interpret the title as part of "the map", a communication signal with the function of encoding information about the person's job, then we want to say that the new title is substantively misleading (even if it's not technically a "lie"): when you hear that someone's job is being a "Vice President", you predict that their work involves managing people and making high-level executive decisions for the firm. Your probability that the "Vice President" has to spend all day moving objects from a conveyor belt into one of two bins based on the object's color and shape (a task that should probably be automated), is lower than before you heard the person's title: hearing the title made you update in the wrong direction.

But if we interpret the title as part of "the territory", a feature of the job itself, rather than a communication signal about the job—then it's not misleading and can't be misleading. The job happens to be one that has the symbols "Vice President" printed on the accompanying business cards and employee roster, much like how bleggs are objects that happen to be blue. You can't say the blue is "lying"; that doesn't make any sense!

The function of words is to serve as signals for communication, so it seems safe to say that language should usually be construed as part of "the map". Changing names and only names, without altering the things that the names refer to, as in the phony "Vice President" example, is probably deceptive. But for other features associated with a category, it may not always be obvious when we should construe them as "map" rather than "territory": using a feature to infer category-membership is formally equivalent to regarding it as a signal sent by senders of that category. Is that man pretending to be a doctor, or does he just happen to be wearing a lab coat?

The concept we're groping towards, and hoping to formulate an elegant reduction of, is that of mimicry. Suppose there is some existing category of entity, an original, typified by some cluster of traits. A mimic is an entity optimized to approximately match the distribution of the original in many, but not all traits, thereby being part of the same cluster as the original in some subspace of the space the original category is defined in, but not the space as a whole. For example, if the vector $[4, 4, 4, 4, 4] \in R^{5}$ is the original, then an optimization process trying to construct a mimic of it in the subspace spanned by $x_{1}$ , $x_{4}$ , and $x_{5}$ might choose $[4, 0, 0, 4, 4]$ : if you only look at the first, fourth, and fifth coordinates, then $[4, 4, 4, 4, 4]$ and $[4, 0, 0, 4, 4]$ "look the same"—they are the same in that subspace, but not the same if you include the second and third coordinates.

We can find examples in nature. Suppose one type of butterfly has evolved to be toxic to a type of predator, and also has distinctive wing markings that function as an honest warning signal to that predator: this butterfly is not good to eat. This provides an "opportunity" (in evolutionary time) for a second species of butterfly to develop similar wing markings, so that predators will confuse it for the first type of butterfly, despite the second butterfly not paying the metabolic cost of producing toxins. This kind of situation is called Batesian mimicry.

Is Batesian mimicry deceptive? (In our usual functionalist sense, which is obviously not a claim about butterfly psychology.) Is the second butterfly's very existence a kind of lie?

In some sense, yes! The mimic butterfly has been optimized by evolution to look like the first butterfly because of the fitness payoff of being categorized by the predator as the first, toxic, kind of butterfly. The "categorized by the predator as toxic" category is a natural, compact region in wing-marking-space, but "comes apart" into two clusters in the broader wing-markings–actual-toxicity space.

Furthermore, the evolutionary dynamics create an asymmetric relationship between the two categories, that isn't captured by just the two trait-clusters themselves. The reason for the mimic butterfly to have those particular wing-markings is in order to manipulate the predator's predictions of toxicity (which was learned from encounters with the original), so if the original's wing-markings were to change as a result of some new selection pressure, the mimic would be subjected to selection pressure to "keep up" by changing its wing-markings accordingly.

That's not true in the other direction: if the mimic's markings were to change, the original wouldn't "follow": the original would instead benefit from the probabilistic strength of its warning signal not being parasitically diluted by the mimic anymore. Thus, the asymmetric terminology of "original" and "mimic" is appropriate: it's not just that these two species happen to look like each other; one of them was there first, and the other looks like it.

Is mimicry always deceptive? Not necessarily—there might be some situations where the relevant set of variables are among those where the mimic matches the distribution of the original.

Suppose you and I are feeding some ducks in the park. I say, "I love feeding these ducks!"

You say, "Wrong! These aren't all ducks. This park is where a local inventor tests out his Anatid-oid robots that are designed to look and act like ducks. Therefore, you can't say, 'I love feeding these ducks'; you need to say 'I love feeding these ducks and Anatidoid robots'."

"Wow, they're so realistic!" I say. "I can't even tell which ones are really robots! In fact," I continue, "since I can't tell, I'm inclined to just keep calling them all ducks; it would be pretty awkward to refer to each one as a duck-or-Anatidoid-robot."

"But it is possible to tell," you claim. "For example, if you get really close to one of the Anatidoid robots, and there's not a lot of ambient noise, you can hear the gears inside, turning."

"Okay," I say, "but I can't hear the gears from here. Since I have no way of telling the difference between ducks and Anatidoid robots without doing the more expensive evidence-gathering of cornering one in a quiet place, it makes sense for me to talk and think about the robots as being a kind of duck."

"But that's a lie! Ducks and Anatidoid robots may look and act similarly, but they're actually very different! Ducks are made of flesh and blood inside and are fated to die, whereas Anatidoid robots have a plastic interior and are immortal. And the ducks digest and gain nutrients from the scraps of bread we're feeding them, whereas the Anatidoid robots merely store the bread in an internal compartment that later gets dumped as they recharge wirelessly in the inventor's lab."

"Sure," I agree. "And if I were interacting with these entities in a context where I wanted to minimize the expected squared error of my predictions about their internal makeup, energy sources, or ultimate fate, then I would want to make that distinction. But I just want to watch some cool ducks in the park, and in the context of that activity, I only need to minimize the expected squared error of my predictions about appearance and behavior."

This is the origin of the famous duck test: if it looks like a duck, and quacks like a duck, and you can model it as a duck without making any grievous prediction errors, then it makes sense to consider it a member of the category duck in the range of circumstances where your model continues to perform well.

The features for which mimics fail to match the original need not be hidden (like gear sounds that you can't hear in a noisy park) in order for mimics to not be deceptive; they only need to be irrelevant in the context the category is being used. Squirt guns aren't guns—and are usually manufactured in unrealistic colors specifically to prevent being confused with real guns—but in the context of a water fight, the utterance "Don't point that gun at me" (without the privative adjective squirt) is understood perfectly well.

Nondeceptive mimicry is fragile, however: it works in contexts where the all the relevant features are ones where the mimic matches the original. Mimics that don't match the distribution of the original along relevant features are deceptive in the sense that agents that observe the mimic and assign it to the same mental category as the original on the basis of the matching features, will use that categorization to make predictions about unobserved but nonmatching features, and be wrong. And they'll be wrong because the mimic is optimized to "look like" the original (to match on many observable features).

If different agents using a shared language disagree on what features are "relevant", they may have an incentive to fight about how scarce and valuable short codewords should be defined in their common language, in order to exert control over what inferences and decisions agents using that language can easily make and coordinate on.

Let's consider how this might apply to a real-world issue. From moral perspectives that place a lot of value on the welfare of nonhuman animals, factory farming is an ongoing moral catastrophe. Unfortunately (for the farmed animals), meat-eaters and the global agriculture industry they support aren't going to change their ways because of anyone's desperate cry at the horror of suffering or carefully-reasoned appeal to the global utilitarian calculus. Animal-rights advocates can sway behavior on the margin, but there's just too much biological and cultural inertia favoring the consumption of animal products for it to be feasible to outlaw factory farming the way chattel slavery was outlawed. It's not that humans hate farm animals; they're just ... made out of tissue that we can use for other things.

An alternative strategy for ending factory farming is to prioritize the development of artificial substitutes that mimic real meat, eggs, dairy, &c. along the consumption-relevant dimensions of taste, texture, nutrition, &c., but are produced in a lab or factory rather than from the tissues of sentient creatures. In the limit of arbitrarily capable physical manufacturing technology, carnivores and factory-farming opponents alike could both be satisfied: if two steaks are indistinguishable by any physical means whatsoever, then a meat-eater has no reason to care which one came from an actual cow's flesh, and which one was molecularly assembled by nanobots. Perhaps a Society of hunter–gatherers that attached cultural significance and ritual to the labor of killing one's own meal would have a reason to object, but modern folk for whom food comes from the supermarket have no basis within their experience to say that the nanoassembled steak isn't "real".

Unfortunately, we do not have arbitrarily capable physical manufacturing technology. Although progress continues, modern animal product substitutes are sufficiently unsuccessful mimics that they are usually not considered to belong to "the same" category as the original. Veggie burgers are not burgers in the sense that a customer who ordered "a burger" at a restaurant and was served a veggie burger would be likely to notice and complain—and in particular, would probably not be satisfied if the waiter were to reply, "Well, if you specifically wanted a burger made from cow flesh, you should have said that."

As technology to make plausible mimics/substitutes improves, however, different interest groups might face a temptation to fight over the meanings of words that was not present when the mimics weren't plausible enough for a dispute to arise. If you have the power of setting the default extension of a word that people are already using to communicate with, you can exert some amount of control over the decisions people make while trying to think using that word. Should the meaning change, then a restaurant customer who wants to make sure they receive a burger under the old definition now has to use more words, while those who don't have a strong preference or are too shy to complain will accept the restaurant's interpretation of the order.

Thus, if a fight breaks out about the meaning of the word meat, animal rights activists have a moral incentive to draw the category "boundaries" to include even substitutes that are very bad (on the empirical merits of successfully mimicking the original), whereas existing agricultural interests have a financial incentive to draw the "boundaries" to exclude even substitutes that are very good. (This kind of dispute is not hypothetical, and isn't necessarily limited to just words: in the late 19th century, dairy farmers pushed for laws that required margarine to be dyed pink to prevent consumers from confusing it for butter—the law effectively interpreting color as a communication signal, rather than a property of the good itself.)

If a fight breaks out about the meaning of the word meat, rationalists may not all take the same side, but we can at least strive for objectivity in describing the conflict—and in particular, to notice the difference between definitions motivated by describing reality, and definitions motivated by the positive or negative effects (such as profitably deceiving other agents) of choosing one description or another.

If some think that some meat substitute should be considered meat because the "taste" dimension is genuinely most relevant to the true meaning of meat, and some oddities in the texture don't matter, but others think vice versa, the philosophy articulated on this post has nothing to say to either side: the math of minimizing expected squared error by putting labels on clusters doesn't say which subspace to look for clusters in.

But if some think that some meat substitute should be considered meat because saving nonhuman animals from a life of torture is more important than conceptual parsimony ... I can't prove that that's not the right the answer to the decision problem of what verbal behavior to perform. The stakes are genuinely high.

What I can say is that the hidden Bayesian structure of language and cognition makes no reference to the stakes, and departing from the structure extracts a price that isn't up to us.

If, empirically, being generous about what counts as "meat" can prevent massive suffering (by altering the social defaults around consumption behavior), then maybe that's the right thing to do.

And if you live in an absurd thought experiment where saying "2 + 2 = 5" could save 3↑↑↑3 lives, maybe saying "2 + 2 = 5" is the right thing to do. But the empirical question of whether you happen to live in that particular thought experiment, doesn't change the laws that govern what you have when you take ●●-many plus another ●●-many, no matter what symbols are used to communicate this fact, and no matter the consequences for communicating it.

For these reasons it is written of the third virtue of lightness: you cannot make a true map of the category by drawing lines upon paper according to impulse; you must observe the joint distribution and draw lines on paper that correspond to what you see. If, seeing the category unclearly, you think that you can shift a boundary just a little to the right, just a little to the left, according to your caprice, this is just the same mistake.

And as it is written of a virtue which is nameless: perhaps your conception of rationality is that it is rational to believe the words of the Great Teacher, who lives in an area where claiming that the sky is blue would be political suicide.

And the Great Teacher says, "Some people I usually respect for their willingness to publicly die on a hill of facts, now seem to be talking as if color references are necessarily a factual statement about frequencies of light. But using language in a way you dislike, is not lying. You're not standing in defense of Truth if you insist on a word, brought explicitly into question, being used with some particular meaning." And you look up at the sky and see blue.

If you think: "It may look like the sky is blue, such that I'd ordinarily think that someone who said 'The sky is green' was being deceptive, but surely the Great Teacher wouldn't egregiously mislead people about the philosophy of language when being egregiously misleading happens to be politically convenient," you lose a chance to discover your mistake.

How will you discover your mistake? Not by comparing your description to itself.

But by comparing it to that which you did not name.

(Thanks to Jessica Taylor, Abram Demski, and Tsvi Benson-Tilson for discussion and feedback.)

The source code of the Python script used for these calculations is available. ↩︎

In the real world, I got those numbers from the Python expression ', '.join(str(d) for d in [(round(normal(2, 4), 2), round(normal(-1, 3), 2)) for _ in range(10)]) (using scipy.random.normal). ↩︎

[-]iceman1y340Review for 2021 Review

Zack's series of posts in late 2020/early 2021 were really important to me. They were a sort of return to form for LessWrong, focusing on the valuable parts.

What are the parts of The Sequences which are still valuable? Mainly, the parts that build on top of Korzybski's General Semantics and focus hard core on map-territory distinctions. This part is timeless and a large part of the value that you could get by (re)reading The Sequences today. Yudkowsky's credulity about results from the social sciences and his mind projection fallacying his own mental quirks certainly hurt the work as a whole though, which is why I don't recommend people read the majority of it.

The post is long though, but it kind of has to be. For reasons not directly related to the literal content of this essay, people seem to have collectively rejected the sort of map-territory thinking that we should bring from The Sequences into our own lives. This post has to be thorough because there are a number of common rejoinders that have to be addressed. This is why I think this post is better for inclusion than something like Communication Requires Common Interests or Differential Signal Costs, which is much shorter, but only addresses a subset of the problem.

Since the review instructions ask how this affected my thinking, well...

Zack writes generally, but he writes because he believes people are not correctly reasoning in a current politically contentious topic. But that topic is sort of irrelevant: the value comes in pointing out that high status members of the rationalist community are completely flubbing lawful thinking. That made it thinkable that actually, they might be failing in other contexts.

Would I have been receptive to Christiano's point that MIRI doesn't actually have a good prediction track record had Zack not written his sequence on this? That's a hard counterfactual, especially since I had already lost a ton of respect for Yudkowsky by this point, in part because of the quality of thought in his other social media posting. But I think it's probable enough and these series of posts certainly made the thought more available.

[-]johnswentworth3y140

This comment is mostly a placeholder for a post I owe you on how words work, and in particular the "rules of the game" for socially-constructed categories. I'll just give a few quick high-level descriptions on how the models in that eventual post diverge from the models here. Apologies in advance for not-very-good explanations.

First, "clusters in thingspace" is a metaphor. There isn't really a canonical "thingspace" with pre-defined features along the axes; figuring out what features to use is most of the problem from the start. So, what is the (mathematical) concept for which "clusters in thingspace" is a metaphor? I think the main answer here is conditional independence. The defining feature of a cluster (under the usual Bayesian setup) is that the points comprising the cluster are independent given the summary statistics of the cluster itself. Likewise for the concepts to which we attach words: the concept-of-tree contains/points to all the summary data about particular trees which is relevant to other trees, making each tree independent given the tree-summary-statistics. For complicated real-world objects like trees, those summary statistics are high-dimensional and no human even knows all of them, but they're still a lot lower dimensional then all the atoms in any particular tree.

Now, we combine that with the natural abstractions hypothesis, which in this context basically says that natural-concept-space is (approximately) discrete rather than continuous. Natural concepts are not arbitrarily close together. If the concept of "tree" points to the summary statistics of bunch of correlated chunks of the world, inducing independence between those chunks with minimal extraneous info, then there is not another arbitrarily-close summary/concept which also induces independence with minimal extraneous info. This is the key property which lets people reasonably-confidently "talk about the same thing" without needing infinitely many examples to coordinate.

So those are the rough rules governing words.

To talk about "rules of the game" for socially-constructed categories, the next step is Parable of the Dammed - I had intended to include that in the eventual post on how words work, but ended up spinning it off. The main idea there is that people can move around Schelling points by changing the underlying territory. In the case of words, the Schelling points are the natural concepts - minimal summary data which induces independence between different chunks of the world. So, the "game" is to move around those natural concepts - i.e. make some other information necessary to induce independence between world-chunks.

In particular, an interesting way to do this is to create new chunks which partially match the old natural concept. In the cluster analogy, this would mean adding new points within the cluster but biased toward one particular side. (In the trees example: we could imagine creating new tree species, or driving old tree species to extinction, or dramatically shifting the mix of trees.) Then humans get to debate whether these new points should "count" or not - the old points are still independent under the old concept-definition, but making the new points also independent requires adding new information to the concept, and some might even advocate for ignoring some of the old points and just making the concept induce independence on the new points.

As for unnatural categories...

Once we have the idea that natural concept space is discrete, and natural concepts are Schelling points for words, then the questions around unnatural categories are:

is the unnatural category a Schelling point at all?
if so, how are people recognizing the Schelling point, rather than as a natural concept?

In some cases, the unnatural category might just not be a Schelling point at all. We could imagine a variant of the gold/silver bars example where people have actually-different ideas of what the words mean, so maybe there just isn't a Schelling point and people entering contracts will need to meet the legal requirements for a "meeting of minds" some other way - i.e. writing definitions out in excruciating detail.

Alternatively, the Schelling point can be established via some mechanism other than natural concepts - e.g. passing laws about what a word means, establishing norms, etc. (Though note that these mechanisms still need some way of dealing with the very high dimensionality of word-specification space; there still needs to be some efficient way to coordinate on a high-dimensional word-meaning, so it will probably eventually ground out in other natural concepts.)

[-]Zack_M_Davis3y80

(Thinking out loud about how my categorization thing will end up relating to your abstraction thing ...)

200-word recap of my thing: I've been relying on our standard configuration space metaphor, talking about running some "neutral" clustering algorithm on some choice of subspace (which is "value-laden" in the sense that what features you care about predicting depends on your values). This lets me explain how to think about dolphins: they simultaneously cluster with fish in one subspace, but also cluster with other mammals in a different subspace, no contradiction there. It also lets me explain what's wrong with a fake promotion to "Vice President of Sorting": the "what business cards say" dimension is a very "thin" subspace; if it doesn't cluster with anything else, then there's no reason we care. As my measurement of what makes a cluster "good", I'm using the squared error, which is pretty "standard"—that's basically what, say, k means clustering is doing—but also pretty ad hoc: I don't have a proof of why squared error and only squared error is the right calculation to be doing given some simple deciderata, and it probably isn't. (In contrast, we can prove that if you want a monotonic, nonnegative, additive measure of information, you end up with entropy: the only free choice is the base of the logarithm.)

What I'm hearing from the parent and your reply to my comment on "... Ad Hoc Mathematical Definitions?": talking about looking for clusters in some pre-chosen subspace of features is getting the actual AI challenge backwards. There are no pre-existing features in the territory; rather, conditional-independence structure in the territory is what lets us construct features such that there are clusters. Saying that we want categories that cluster in a "thick" subspace that covers many dimensions is like saying we want to measure information with "a bunch of functions like , sin(Y), $e^{X} + 2 X - 1$ , &c., and require that those also be uncorrelated": it probably works, but there has to be some deeper principle that explains why most of the dimensions and ad hoc information measures agree, why we can construct a "thick" subspace.

To explain why "squiggly", "gerrymandered" categories are bad, I said that if you needed to make a decision that depended on how big an integer is, categorizing by parity would be bad: the squared-error score quantifies the fact that 2 is more similar to 3 than 12342. But notice that the choice of feature (the decision quality depending on magnitude, not parity) is doing all the work: 2 is more similar to 12342 than 3 in the mod-2 quotient space!

So maybe the exact measure of "closeness" in the space (squared error, or whatever) is a red herring, an uninteresting part of the problem?—like the choice of logarithm in the definition of entropy. We know that there isn't any principled reason why base 2 or base e is better than any others. It's just that we're talking about how uncertainty relates to information, so if we use our standard representation of uncertainty as probabilities from 0 to 1 under which independent events multiply, then we have a homomorphism from multiplication (of probability) to addition (of information), which means you have to pick a base for the logarithm if you want to work with concrete numbers instead of abstract nonsense.

If this is a good analogy, then we're looking for some sort of deeper theorem about "closeness" and conditional independence "and stuff" that explains why the configuration space metaphor works—after which we'll be able to show that the choice of metric on the "space" will be knowably arbitrary??

[-]johnswentworth3y120

Yup, this seems basically right.

I have a (still incomplete) draft here which specifically addresses why the configuration space metaphor works. Short version: the key property of (Bayesian) clustering is that the points in a cluster are conditionally independent given the summary data of the cluster. For instance, if I have Gaussian clusters, then each point within a given cluster is independent given the mean and variance of that cluster. The prototypical "clustering problem" is to assign points to clusters in such a way that this works. So, for instance, the Gaussian clustering problem is to assign points to clusters in such a way that the points in each cluster are independent given the cluster mean and variance. Since the Gaussian distribution is maxentropic subject to mean and variance constraints (i.e. it is the unique distribution for which mean and variance are sufficient statistics), this fully characterizes Gaussian clustering.

Generalizing to the "object type discovery" problem of abstraction, we want to find sets of chunks-of-the-world which are independent given some summary statistics of the chunks. So the analogy is quite strong - in fact, Bayesian clustering isn't just an analogy, it's an example of the problem (albeit with some additional assumptions typically thrown in, e.g. about the specific forms of the clusters).

BTW, if you buy this view and figure out a good way to explain it, you are more-than-welcome to take whatever you want from that draft and scoop me on it.

[-]tailcalled3y50

Your model uses correlational notions like "conditional independence" to make sense of it. But I think one could perhaps come with an alternate model using causal notions?

Specifically: If two variables X and Y are correlated, then they usually are so due to confounding, because there are a lot more ways that things can be confounded than that they can be causally related. So it makes sense to assume that they are confounded.

You could approximate all of the confounders of a suitably chosen set of observable variables by postulating a new variable, which affects all of the observables. This confounder then turns into your feature axis (if continuous) or cluster (if discrete).

[-]johnswentworth3y20

This is exactly right; we can interpret the abstraction model essentially along these lines as well.

[-]Zack_M_Davis3y20

So, I like this, but I'm still not sure I understand where features come from.

Say I'm an AI, and I've observed a bunch of sensor data that I'm representing internally as the points (6.94, 3.96), (1.44, -2.83), (5.04, 1.1), (0.07, -1.42), (-2.61, -0.21), (-2.33, 3.36), (-2.91, 2.43), (0.11, 0.76), (3.2, 1.32), (-0.43, -2.67).

The part where I look at this data and say, "Hey, these datapoints become approximately conditionally independent if I assume they were generated by a multivariate normal with mean (2, -1), and covariance matrix [[16, 0], [0, 9]]^[1]; let me allocate a new concept for that!" makes sense. (In the real world, I don't know how to write a program to do this offhand, but I know how to find what textbook chapters to read to tell me how.)

But what about the part where my sensor data came to me already pre-processed into the list of 2-tuples?—how do I learn that? Is it just, like, whatever transformations of a big buffer of camera pixels let me find conditional independence patterns probably correspond to regularities in the real world? Is it "that easy"??

[-]johnswentworth3y110

Is it just, like, whatever transformations of a big buffer of camera pixels let me find conditional independence patterns probably correspond to regularities in the real world? Is it "that easy"??

Roughly speaking, yes.

Features are then typically the summary statistics associated with some abstraction. So, we look for features which induce conditional independence patterns in the big buffer of camera pixels. Then, we look for higher-level features which induce conditional independence between those features. Etc.

[-]Zack_M_Davis3y40

This gave me a blog story idea!

[-]johnswentworth3y60

It's not that LW doesn't have emoji reactions. It's just that it has to be worth a BIG emoji reaction.

[-]Zack_M_Davis2y50

"Feature Selection"

[-]tailcalled3y30

It's funny that you should mention this, because I've considered working on a machine learning system for image recognition using this principle. However, I don't think this is necessarily all of it. I bet we come pre-baked with a lot of rules for what sorts of features to "look for". To give an analogy to machine learning, there's an algorithm called pi-GAN, which comes pre-baked with the assumption that pictures originate from 3D scenes, and which then manages to learn 3D scenes from the 2D images it is trained with. (Admittedly only when the images are particularly nice.)

[-]Wei Dai3mo90

I think there is at least one other, more benign objective that "unnatural categories" are sometimes optimized for. Consider this example. Today we have electrical and fuel burning fireplaces. One day someone invents a "neural fireplace", a device that if installed in a home, remotely induces in everyone a realistic hallucination of a fireplace. Let's say that most people agree that (ignoring costs) these are close substitutes as far as their utility functions are concerned, such that people regularly say to their architects "please include a fireplace in my house" and it's assumed to mean putting in any one of the three types of devices in the building plans while minimizing overall cost. I think you'll agree that "fireplace" here is "unnatural" but also there's no deception happening?

To generalize from this, it seems that when different natural categories are close substitutes in many people's utility functions, it would make sense to assigned them a common codeword, to aid communication efficiency when transmitting instructions. Given this, I think "trans women are women" isn't necessarily motivated by deception of which sex cluster someone belongs to, but instead a signal of local values, trying to imply something like "people around here do not distinguish between trans women and cis women in their values, at least in most circumstances". This could still be a deception (it's costly to use the same codeword for two categories if they're not actually close substitutes, but it could be worth paying this price in order to hide your real values), but it would be mainly a deception about values, not about sex, and it would require investigating people's actual values to determine whether the signal is really deceptive or not.

[-]Zack_M_Davis3mo60

Right. What's "natural" depends on which features you're paying attention to, which can depend on your values. Electric, wood-burning, and neural fireplaces are similar if you're only paying attention to the subjective experience, but electric and wood-burning fireplaces form a cluster that excludes neural fireplaces if you're also considering objective light and temperature conditions.

The thesis of this post is that people who think neural fireplaces are fireplaces should be arguing for that on the merits—that the decision-relevant thing is having the subjective experience of a fireplace, even if the hallucinations don't provide heat or light. They shouldn't be saying, "We prefer to draw our categories this way because otherwise the CEO of Neural Fireplaces, Inc. will be really sad, and he's our friend."

[-]Wei Dai3mo70

Hmm, what is the difference between these two types of arguments? I could recast the latter argument in terms of "features to pay attention to" or "what's decision relevant": If we pay attention to the feature of "things that the CEO of Neural Fireplaces wants people to treat as interchangeable to the greatest extent possible" then electric, fuel-burning, and neural fireplaces form a natural cluster. The CEO is our friend so it's highly decision relevant to consider what things he wants us to treat as interchangeable.

In the OP you talk about how redrawing categories for non-epistemic reasons would interfere with Bayesian reasoning, but that applies the former type of argument as well. If we decide to include neural fireplaces in the "fireplace" category based on that the decision-relevant thing is having the subjective experience of a fireplace, it equally interferes with Bayesian reasoning: we can no longer safely infer "generates objective light with high probability" upon hearing "fireplace", and some people may well make erroneous inferences during a transition period before everyone got on the same page.

So hopefully I'm not being willfully obtuse, but I'm not sure what principle you're drawing on to say that the former type of argument is ok but the latter is not.

[-]Zack_M_Davis3mo40

But presumably the reason the CEO would be sad if people didn't consider neural fireplaces to be fireplaces is because he wants to be leading a successful company that makes things people want, not a useless company with a useless product. Redefining words "in the map" doesn't help achieve goals "in the territory".

The OP discusses a similar example about wanting to be funny. If I think I can get away with changing the definition of the word "funny" such that it includes my jokes by definition, I'm less likely to try interventions that will make people want to watch my stand-up routine, which is one of the consequences I care about that the old concept of funny pointed to and the new concept doesn't.

Now, it's true that, in all metaphysical strictness, the map is part of the territory. "what the CEO thinks" and "what we've all agreed to put in the same category" are real-world criteria that one can use to discriminate between entities.

But if you're not trying to deceive someone by leveraging ambiguity between new and old definitions, it's hard to see why someone would care about such "thin" categories (simply defined by fiat, rather than pointing to a cluster in a "thicker", higher-dimensional subspace of related properties). The previous post discusses the example of a "Vice President" job title that's identical to a menial job in all but the title itself: if being a "Vice President" doesn't imply anything about pay or authority or job duties, it's not clear why I would particularly want to be a "Vice President", except insofar as I'm being fooled by what the term used to mean.

[-]Wei Dai3mo40

But presumably the reason the CEO would be sad if people didn’t consider neural fireplaces to be fireplaces is because he wants to be leading a successful company that makes things people want, not a useless company with a useless product. Redefining words “in the map” doesn’t help achieve goals “in the territory”.

I see, I think this makes sense, but it depends on the CEO's actual goals/values, right? What if the CEO wants to leverage his friendships to make money, and doesn't mind people buying neural fireplaces partly or wholly out of care/sympathy for him? And everyone is (or most people are) happy to do this out of genuine care/sympathy for the CEO? In that case there is seemingly no deception involved, and redefining words “in the map” does help achieve goals “in the territory”.

Which of these two analogies is closer to the transgender situation involves empirical questions that I lack the knowledge to discuss. But it occurs to me that maybe your disagreement with Eliezer/Scott is based on you thinking that the first analogy is closer, and them thinking that the second analogy is closer? In other words, maybe they think that trans people would be happy enough with people treating them as their preferred sex/gender out of care/sympathy, and not necessarily "on the merits" in some way?

[-]DanielFilan3y50

If we just pay attention to the evidential entanglements instead of being confused by words, then there's no functional difference between saying "I reserve the right to lie p% of the time about whether something belongs to category C", and adopting a new, less-accurate category system that misclassifies p% of instances with respect to the old system.

It is true that there is a version of re-drawing the categories and lying a proportion of the time that are functionally identical. But I think that many cases are in fact not functionally identical. Suppose I say "I've decided to use the word 'book' for what you used to call books, and also vegan chocolate bars". Well now you're less sure what I'm talking about when I say the word 'book' and don't know whether reading or eating is the appropriate response to receiving one, but you still have a decent sense. Now, instead, suppose I say "I'm going to use the word 'book' for what you used to call books, but also a small number of random other things". Now you have much less idea what I'm talking about when I talk about 'books'! Maybe they could kill you! I could change on-the-spot my categorization system without contradicting what I said, and the 'fake books' could be a variety of crazy different things rather than all coming from the same old category. To me, this seems basically like a case of blatant lies being the best kind.

[-]Raemon3y50

I don’t know how cruxy this is for the main points of the post. I did find this post quite helpful overall. I quite liked that the final section explored real world examples that showed where this actually mattered.

But fwiw I mostly had not been thinking of the squiggly bordered nations as a metaphor for squiggly bordered concepts. Or, maybe I had, but, also I think ‘where is this nations border?’ an actually important question. The section that just dismissed that as metaphor was fairly surprising and disappointing to me, since that was an objection I’d been particularly interested in. (In particular because it was the initial framing question for the post)

[-]Zack_M_Davis3y50

Author's Meta Note

(I continue to maintain that this is fun and basic hidden-Bayesian-structure-of-language-and-cognition stuff that shouldn't be "political", but—if we need it—the "Containment Thread on the Motivation and Political Context for My Philosophy of Language Agenda" is now available for talking about the elephant in the room.)

(I intend to eventually reply to all substantive critical comments on this post and the containment thread, but I might be very slow to respond due to life events and priorities outside of this website. Your patience is deeply appreciated.)

[-]romeostevensit3y40

Audio compression libraries aren't image compression libraries. Simulacra level 1 compression libraries aren't Simulacra level 3 libraries. So I might say compression libraries are teleologically situated. Purpose space is upstream of concept and thing-space. This leads to confusion about reductionism vs idealism but that's just because of how we're wired.

[-]tailcalled2y20

One problem with mean squared error as a measure of communication accuracy is that it gives no incentive to accurately communicate the variance around the stated outcome. That is, suppose you are creating some precise electronics equipment where it is very important that you work with pure gold. In that case, you don't just want the buyer to choose the description that minimizes the squared error (perhaps "1 kg pure gold"), but instead a description that minimizes e.g. negative log probability (perhaps "0.999-1.001 kg pure gold, 0.001 kg misc contaminants").

Negative log probability is equivalent to squared error when working with univariate normal distributions with a fixed standard deviation/variance structure. (Proof: just take the logarithm of the normal distribution pdf and reduce the expression.) However, when working with distributions with varying variances, or with non-normal distributions, talking about negative log prob could be said to give better incentives for some aspects of information.

(There's probably even better approaches if one e.g. takes into account the likely effects of the information. While your approach fails to give an incentive to accurately communicate uncertainty, your approach is better at incentivizing coming up with a categorization system that doesn't lump things together if they behave differently (assuming the metric chosen in the beginning is sensible), perhaps taking into account the effects of the information could give the best of both worlds, incorporating both uncertainty and lumping.)

Most practical purposes that one might have for things probably satisfy some sort of convexity property: if 0.999 kg gold is OK, and 1.001 kg gold is OK, then 1.000 kg gold is also OK.

Insofar as this is true, restricting oneself to some notion of convex probability distribution might be useful.

[-]Slider3y00

It seems to me that the reason why the non-gerry mandered concept is preferable is connected to drill wearyness being very linear in original observables. In my own mouth i would say that drill weariness is measured in "toughness" and if eggness is a good proxy for "toughness" then it might make sense to configure the drill to be "expecting" an eggness score. Whether a drill wears out seems to ambivalent about the signaling used. This makes it okay to use to judge a signaling scheme. But how would you balance the dye machine weariness against drill weariness? It would seem there would exist a signaling scheme that is optimal in weariness times the cost of replacement of the machine. And this would not be optimal in respect to only one machine type.

One could also hypothetise that toughness is squiggly in eggness. Maybe even numbered eggnesses are tough and odd numbered eggnesses are soft. Overusing the mean squared error might be making the assumption that small deviation in your sensory organs would correspond to small deviations in your decision efficiency. In a way this might be plausible, it is more easy to be a prosperous animal if you have sensory organs that are easy to make life-promoting choices in. But in another way it is implausible. Sensory organs have trigger conditions defined in conditions that are in respect to their structure while most decisions to be made are structured differently than your bodily organs. It is rare for "see red" -> "eat","see blue"->"don't eat" to be the dominant strategy.

If you are a limited vision seer that hears well (such as a bat) and deal with a four color seer that hears less well (such as a ultraviolet seeing bird) does that mean that bird concepts are deceptive to you as a bat? If one would hope there would be one set of concepts that would be optimal for both bats and birds that might not be the case. But if the bat and the bird agree on what batness is and what birdness is there shouldn't be disagreement about suitability, it just doesn't form a uniform ladder.

[-]Zack_M_Davis3y00

Mods: I'm confused that this isn't showing up in "Latest" even when I set "Personal Blog" to "Required" in a Private-mode window?! Could the algorithm be using "created by" date rather than "published on" date (an earlier version was sitting as a draft for a few months so I could preview some LaTeX), or ...?!

[-]DanielFilan3y20

FYI it currently shows up for me.

[-]Raemon3y20

Hmm. If you had ever published it briefly that might be why? We’ll look into it.

LESSWRONG
LW

Unnatural Categories Are Optimized for Deception

89

89