Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

"she has often seen a cat without a grin but never a grin without a cat"

Let's have a very simple model. There's a boolean, , which measures whether there's a cat around. There's a natural number , which counts the number of legs on the cat, and a boolean , which checks whether the cat is grinning (or not).

There are a few obvious rules in the model, to make it compatible with real life:

  • .
  • .

Or, in other words, if there's no cat, then there are zero cat legs and no grin.

And that's true about reality. But suppose we have trained a neural net to automatically find the values of , , and . Then it's perfectly conceivable that something might trigger the outputs and simultaneously: a grin without any cat to hang it on.

Adversarial examples

Adversarial examples often seem to behave this way. Take for example this adversarial example of a pig classified as an airliner:

Imagine that the neural net was not only classifying "pig" and "airliner", but other things like "has wings" and "has fur".

Then the "pig-airliner" doesn't have wings, and has fur, which are features of pigs but not airliners. Of course, you could build an adversarial model that also breaks "has wings" and "has fur", but, hopefully, the more features that need to be faked, the harder it would become.

This suggests that, as algorithms get smarter, they will become more adept at avoiding adversarial examples - as long as the ultimate question is clear. In our real world, the categories of pigs and airliners are pretty sharply distinct.

We run into problems, though, if the concepts are less clear - such as what might happens to pigs and airliners if the algorithm optimises them, or how the algorithm might classify underdefined concepts like "human happiness".

Myths and dreams

Define the following booleans: detects the presence of a living human head, a living human body, a living jackal head, a living jackal body.

In our world real world we generally have and . But set the following values:

and you have the god Anubis.

Similarly, what is a dragon? Well, it's an entity such that the following are all true:

And, even though those features never go together in the real world, we can put them together in our imagination, and get a dragon.

Note that "is flying" seems more fundamental to a dragon than "has wings", thus all the wingless dragons that fly "by magic[1]". Our imagination seem comfortable with such combinations.

Dreams are always bewildering upon awakening, because they also combine contradictory assumptions. But these combinations are often beyond what our imaginations are comfortable with, so we get things like meeting your mother - who is also a wolf - and handing Dubai to her over the tea cups (that contain milk and fear).

"Alice in Wonderland" seems to be in between the wild incoherence of dream features, and the more restricted inconsistency of stories and imagination.


  1. Not that any real creature that size could fly with those wings anyway. ↩︎

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 3:55 PM

Thanks! Good insights there. Am reproducing the comment here for people less willing to click through:

I haven't read the literature on "how counterfactuals ought to work in ideal reasoners" and have no opinion there. But the part where you suggest an empirical description of counterfactual reasoning in humans, I think I basically agree with what you wrote.

I think the neocortex has a zoo of generative models, and a fast way of detecting when two are compatible, and if they are, snapping them together like Legos into a larger model.

For example, the model of "falling" is incompatible with the model of "stationary"—they make contradictory predictions about the same boolean variables—and therefore I can't imagine a "falling stationary rock". On the other hand, I can imagine "a rubber wine glass spinning" because my rubber model is about texture etc., my wine glass model is about shape and function, and my spinning model is about motion. All 3 of those models make non-contradictory predictions (mostly because they're issuing predictions about non-overlapping sets of variables), so the three can snap together into a larger generative model.

So for counterfactuals, I suppose that we start by hypothesizing some core of a model ("a bird the size of an adult blue whale") and then searching out more little generative model pieces that can snap onto that core, growing it out as much as possible in different ways, until you hit the limits where you can't snap on any more details without making it unacceptably self-contradictory. Something like that...

Why do you think adversarial examples seem to behave this way? The pig equation seems equally compatible with fur or no fur recognized, wings or no wings. Indeed, it plausibly thinks the pig an airliner because it sees wings and no fur.

Then it has a wrong view of wings and fur (as well as a wrong view of pigs). The more features it has to get right, the harder the adversarial model is to construct - it's not just moving linearly in a single direction.

Surely, the adversary convinces it this is a pig by convincing it that it has fur and no wings? I don't have experience in how it works on the inside, but if the adversary can magically intervene on each neuron, changing its output by d by investing d² effort, then the proper strategy is to intervene on many features a little. Then if there are many layers, the penultimate layer containing such high level concepts as fur or wings would be almost as fooled as the output layer, and indeed I would expect the adversary to have more trouble fooling it on such low-level features as edges and dots.

I'm wondering what this thesis of this post is.

Artwork doesn't have to be about reality?

"How to think about features of models and about consistency", in a relatively fun way as an intro to a big post I'm working on.