I went back and re-read these. I think the main anthropomorphic power that AI-you uses is that it already models the world using key human-like assumptions, where necessary.

For example, when you think about how an AI would be predisposed to break images down into "metal with knobs" and "beefy arm" rather than e.g. "hand holding metal handle," "knobs," and "forearm," (or worse, some combination of hundreds of edge detector activations), it's pretty tricky. It needs some notion of human-common-sense "things" and what scale things tend to be. Maybe it needs some notion of occlusion and thing composition before it can explain held-dumbbell images in terms of a dumbbell thing. It might even need to have figured out that these things are interacting in an implied 3D space before it can draw human-common-sense associations between things that are close together in this 3D space. All of which is so obvious to us that it can be implicit for AI-you.

Would you agree with this interpretation, or do you think there's some more important powers used?

Reply

[-]Stuart_Armstrong5yΩ130

I think that might be a generally good critique, but I don't think it applies to this post (it may apply better to post #3 in the series).

I used "metal with knobs" and "beefy arm" as human-parsable examples, but the main point is detecting when something is out-off-distribution, which relies on the image being different in AI-detectable ways, not on the specifics of the categories I mentioned.

Reply

[-]Charlie Steiner5yΩ130

I don't think this is necessarily a critique - after all, it's inevitable that AI-you is going to inherit some anthropomorphic powers. The trick is figuring out what they are and seeing if it seems like a profitable research avenue to try and replicate them :)

In this case, I think this is an already-known problem, because detecting out-of-distribution images in a way that matches human requirements requires the AI's distribution to be similar to human distribution (and conversely, mismatches in distribution allow for adversarial examples). But maybe there's something different in part 4 where I think there's some kind of "break down actions in obvious ways" power that might not be as well-analyzed elsewhere (though it's probably related to self-supervised learning of hierarchical planning problems).

LESSWRONG
LW

LESSWRONG
LW

35

If I were a well-intentioned AI... I: Image classifier

35

Ω 15

35

Ω 15

Introduction: If I were a well-intentioned AI...

Overlapping problems, overlapping solutions

Distributional shift for image recognition

Recognising different things

Adversarial examples

AI-me vs multiply-defined images

Detecting out of distribution images

What to do with the information

AI-me vs adversarial examples

But what is an adversarial example?