What does deception look like from the outside? I notice I am confused.
Imagine you have some algorithm which learns to predict sensory inputs, like a human. The internal structure of this will come to correspond to the external world in some way, as a part of generating a predictive model. Imagine the algorithm is walking around a building, looking at the objects in it. Most of the objects are ordinary, made-of-atoms objects. The algorithm learns to predict where the objects are, how they interact, and basically gets very low input-prediction error.
Then it comes across a teapot. The teapot isn't a real object, but in fact a hallucination projected onto the sensory arrays of the algorithm by a daemon. At first, the daemon only has to project some visual data. But then the algorithm picks up the teapot, so the daemon must project some tactile data.
For a "successful" deception, the algorithm must not "notice" that their purely-physical model of the world is being violated.
When the algorithm decides to make a cup of tea, the daemon must not only fake the sensations of a full, brewing pot of tea, but hide the sight of the water and teabag falling onto the floor. Later, the algorithm falls over by slipping on the spilled water, so the daemon must from then on fabricate all the sensory data that the algorithm gets.
Or, the daemon could fabricate some other reason for the sensory data. The water on the floor could have leaked in from a crack in the ceiling. For this to work, the algorithm must have already been uncertain about whether there was a crack in the ceiling or not, and the daemon must have known this.
This highlights two ways you can deceive a learning algorithm. Both involve control of all the information flowing between a system (in this case the whole world) and an algorithm. One requires modelling of the system, the other requires modelling the algorithm.
How else can we look at this? It seems like if there's a daemon making a fake teapot, the process of moving information world→model(world) has a discontinuity around the part of the algorithm defining a model of the teapot. The reason that a real teapot leads to a model of the teapot in the algorithm follows a different-looking causal chain to the reason a daemon leads to a model of the teapot in the algorithm.
This discontinuity defines a region which grows as the fake object interacts with the real ones. This can be expanded to fill the whole of the system external to the algorithm, or be contracted to zero. I hypothesize that these are the only two stable states of the region, every other boundary is broken by leaky abstractions.
The reason this feels like deception is that we also have a bit of self-knowledge which looks like model(world→model(world)). When we're deceived this is violated on a specific, local scale. Hence the discontinuity in world→model(world).
Whether or not this discontinuity can be distinguished from other discontinuities in the world→model(world) function I'm not sure. At a first glance it seems different but I notice I am confused.
If these properties hold for everything humans would consider deception, this is important. One problem with translating ontologies (which is importantly relevant for the Eliciting Latent Knowledge problem) is how to identify deception. If for any learning algorithm we can specify a single world-plus-algorithm-state which has no deception, and specify which processes for changing that into other world-plus-algorithm preserve the no-deception property, then we can specify a "natural ontology" for that learning algorithm.
This reminds me of the proverb:
Oh what a tangled web we weave, when first we practice to deceive.
Oh what a tangled web we weave, when first we practice to deceive.
The more the algorithm probes the deception, the more effort the deceiver has to put into maintaining the illusion. Reality is completely self-consistent, so the accumulation of enough evidence into a sufficiently coherent world model will always resolve these types of deceptions if no further effort is applied to maintain them.
When I think of deception, though, I typically imagine a deceiver trying to create a strategic mismatch between their beliefs and another agent's beliefs. This information asymmetry is typically over something that the deceiver does not expect the agent to be able to investigate easily. It then allows them to get away with something that is in the deceiver's interest but against (what they think is probably) the other agent's interest.
It's usually something like a claim about the health benefits of snake oil (where the agent lacks the time or the resources to perform or look up a randomized controlled trial), rather than a claim about the existence of a physical object that's visible and within reach.
Maybe magic tricks go that route, though. Although those sorts of deceptions seem to be more about creating a superstimulus for the audience's curiosity drive, since the audience knows that their senses are being deceived and yet go out of their way to experience that deception.
You make some really excellent points here.
The teapot example is atypical of deception in humans, and was chosen to be simple and clear-cut. I think the web-of-lies effect is hampered in humans by a couple of things, both of which result from us only being approximations of Bayesian reasoners. One is the limits to our computation, we can't go and check a new update that "snake oil works" against all possible connections. Another part (which is also linked to computation limits) is that I suspect a small enough discrepancy gets rounded down to zero.
So if I'm convinced that "snake oil is effective against depression". I don't necessarily check it against literally all the beliefs I have about depression, which limits the spread of the web. Or if it only very slightly contradicts my existing view of the mechanism of depression, that won't be enough for me to update the existing view at all, and the difference is swept under the rug. So the web peters out.
Of course the main reason snake oil salesmen work is because they play into people's existing biases.
But perhaps more importantly:
This information asymmetry is typically over something that the deceiver does not expect the agent to be able to investigate easily.
This to me seems like regions where the function world→model(world) just isn't defined yet, or is very fuzzy. This means rather than a web of lies we have some lies isolated from the rest of the model by a region of confusion. This means there is no discontinuity in the function, which might be an issue.
First, it seems worthwhile to try taboo-ing the word 'deception' and see whether the process of building precision to re-define it clears up some of the confusion. In particular, it seems like there's some implicit theory-of-mind stuff going on in the post and in some of the comments. I'm interested if you think the concept of 'deception' in this post only holds when there is implicit theory-of-mind going on, or otherwise.
As a thought experiment for a non-theory-of-mind example, let's say the daemon doesn't really understand why it gets a high reward for projecting the image of teapot (and then also doing some tactile projection somehow later) but it thinks that this is a good way to get high reward / meet its goals / etc. It doesn't realize (or at least not in an obviously accessible way, "know") that there is another agent/observer who is updating their models as a result of this. Possibly, if it did know this, it would not project the teapot, because it has another component of its reward function of "don't project false stimulus to other observing agents".
In this thought experiment, is the daemon 'deceiving' the observer? In general, is it possible to deceive someone who you don't realize is there? (Perhaps they're hiding behind a screen, and to you it just looks like you're projecting a teapot to an empty audience)
Aside: I think there's some interesting alignment problems here, but a bunch of our use of language around the concepts of deception hasn't updated to a world where we're talking about AI agents.
I think there's also some theory of mind used by us outside observers when we label some events "deception."
We're finding it easy to predict what the "evil demon" will do by using a simple model that treats it as an agent trying to affect the beliefs of another agent. If the stuff in the environment didn't reward the intentional stance like this, we wouldn't see the things as agents, let alone see them as doing deception.
The difference between deception and other types of error is the adversary. All modeling is lossy - our beliefs about the world don't completely match the world, and never can. In the case of inputs trigger our matching algorithms for one thing but actually are another, we can learn wrong things. For natural environments, these are generally pretty lightweight - a cloud that looks like a teapot isn't going to fool us. A broken teapot might, but might not - it'd depend on the tests we try.
For adversarial cases, where someone/something is TRYING to fool us, a whole lot depends on the level of sophistication in their model of us, and in our model of the world (including what types of deception we should be on the lookout for). A much smarter entity than you can model whether you'll actually try to use the teapot, and only make a good fake if needed, using just flat images for background and things you won't bother to check.
As other commenters eluded to already, deception is an adversarial attack on the world model of another observer - providing another observer with inputs specifically crafted to introduce a desired error into their world model. Obviously, this is easier to accomplish when the attacker has a good insight into what the victim's world model was to begin with, how they think, what they can and cannot observe, etc (all other things being equal).