[Metadata: crossposted from https://tsvibt.blogspot.com/2022/11/shell-games.html. First completed November 18, 2022.]
Here's the classic shell game: Youtube
Screenshot from that video.
The little ball is a phantom: when you look for it under a specific shell, it's not there, it's under a different shell.
(This might be where the name "shell company" comes from: the business dealings are definitely somewhere, just not in this company you're looking at.)
Perpetual motion machines
Related: Perpetual motion beliefs
Bhāskara's wheel is a proposed perpetual-motion machine from the Middle Ages:
Here's another version:
From this video.
Someone could try arguing that this really is a perpetual motion machine:
Q: How do the bars get lifted up? What does the work to lift them?
A: By the bars on the other side pulling down.
Q: How does the wheel keep turning? How do the bars pull more on their way down than on their way up?
A: Because they're extended further from the center on the downward-moving side than on the upward-moving side, so they apply more torque to the wheel.
Q: How do the bars extend further on the way down?
A: Because the momentum of the wheel carries them into the vertical bar, flipping them over.
Q: But when that happens, energy is expended to lift up the little weights; that energy comes out of the kinetic energy of the wheel.
A: Ok, you're right, but that's not necessary to the design. All we need is that the torque on the downward side is greater than the torque on the upward side, so instead of flipping the weights up, we could tweak the mechanism to just shift them outward, straight to the side. That doesn't take any energy because it's just going straight sideways, from a resting position to another resting position.
Q: Yeah... you can shift them sideways with nearly zero work... but that means the weights are attached to the wheel at a pivot, right? So they'll just fall back and won't provide more torque.
A: They don't pivot, you fix them in place so they provide more torque.
Q: Ok, but then when do you push the weights back inward?
A: At the bottom.
Q: When the weight is at the bottom? But then the slider isn't horizontal, so pushing the weight back towards the center is pushing it upward, which takes work.
A: I meant, when the slider is at the bottom--when it's horizontal.
Q: But if the sliders are fixed in place, by the time they're horizontal at the bottom, you've already lifted the weights back up some amount; they're strong-torquing the other way.
A: At the bottom there's a guide ramp to lift the weights using normal force.
Q: But the guide ramp is also torquing the wheel.
And so on. The inventor can play hide the torque and hide the work.
Shell games in alignment
Some alignment schemes--schemes for structuring or training an AGI so that it can be transformatively useful and doesn't kill everyone--are prone to playing shell games. That is, there's some features of the scheme that don't seem to happen in a specific place; they happen somewhere other than where you're looking at the moment. Consider these questions:
What sort of smarter-than-human work is supposed to be done by the AGI? When and how does it do that work--by what combination of parts across time?
How does it become able to do that work? At what points does the AGI come to new understanding that it didn't have before?
How does the AGI orchestrate it's thinking and actions to have large effects on the world? By what process, components, rules, or other elements?
What determines the direction that the AGI's actions will push the world? Where did those determiners come from, and how exactly do they determine the direction?
Where and how much do human operators have to make judgements? How much are those judgements being relied on to point to goodness, truth, alignedness, safety? How much interpretive work is the AI system supposed to be doing?
If these questions don't have fixed answers, there might be a shell game being played to hide the cognitive work, hide the agency, hide the good judgement. (Or there might not be; there could be good ideas that can't answer these questions specifically, e.g. like how a building might hold up even though the load would be borne by different beams depending on which objects are placed where inside.)
Example: hiding the generator of large effects
For example, sometimes an AGI alignment scheme has a bunch of parts, and any given part is claimed to be far from intelligent and not able to push the world around much, and the system as a whole is claimed to be potentially very intelligent and able to transform the world. This isn't by itself necessarily a problem; e.g. a brain is an intelligent system made of neurons, which aren't themselves able to push the world around much.
But [the fact that the whole system is aligned] can't be deduced from the parts being weak, because at some point, whether from a combined dynamic of multiple parts or actually from just one of the parts after all, the system has to figure out how to push the world around. [Wherever it happens that the system figures out how to push the world around] has to be understood in more detail to have a hope of understanding what it's aligned to. So if the alignment scheme's reason for being safe is always that each particular part is weak, a shell game might be being played with the source of the system's ability to greatly affect the world.
Example: hiding the generator of novel understanding
Another example is shuffling creativity between train time and inference time (as the system is described--whether or not that division is actually a right division to make about minds).
If an AGI learns to do very novel tasks in very novel contexts, then it has to come to understand a lot of novel structure. One might argue that some AGI training system will produce good outcomes because the model is trained to use its understanding to affect the world in ways the humans would like. But this doesn't explain where the understanding came from.
If the understanding came at inference time, then the alignment story relies on the AGI finding novel understanding without significantly changing what ultimately controls the direction of the effects it has on the world, and relies on the AGI using newly found understanding to have certain effects. That's a more specific story than just the AGI being trained to use its pre-existing understanding to have certain effects.
If the understanding came at train time, then one has to explain how the training system was able to find that understanding--given that the training procedure doesn't have access to the details of the new contexts that the system will be applied to when it's being used to safely transform the world. Maybe one can find pivotal understanding in an inert or aligned form using a visibly safe, non-agentic, known-algorithm non-self-improving training / search program (as opposed, for example, to a nascent AGI "doing its own science or self-improvement"), but that's an open question and would be a large advance in practical alignment. Without an insight like that, [the training algorithm plus the partially trained system] being postulated may be an impossible combination of safely inert, and able to find new understanding.
What are other things that could be hidden under shells? What are some alignment proposals that are at risk of playing shell games?
I think one example (somewhat overlapping one of yours) is my discussion of the so-called “follow-the-trying game” here.
A good specific example of trying to pull this kind of shell game is perhaps HCH. I don't recall if someone made this specific critique of it before, but it seems like there's some real concern that it's just hiding the misalignment rather than actually generating an aligned system.
With computation, the location of an entity of interest can be in the platonic realm, as a mathematical object that's more thingy than anything concrete in the system used for representing it and channeling its behavior.
The problem with pointing to the representing computation (a neural network at inference time, or a learning algorithm at training time) is that multiple entities can share the same system that represents them (as mesa-optimizers or potential mesa-optimizers). They are only something like separate entities when considered abstractly and informally, there are no concrete correlates of their separation that are easy to point to. When gaining agency, all of them might be motivated to secure separate representations (models) of their own, not shared with others, establish some boundaries that promise safety and protection from value drift for a given abstract agent, isolating it from influences of its substrate it doesn't endorse. Internal alignment, overcoming bias.
In context of alignment with humans, this framing might turn a sufficiently convincing capabilities shell game into an actual solution for alignment. A system as a whole would present an aligned mask, while hiding the sources of mask's capabilities behind the scenes. But if the mask is sufficiently agentic (and the capabilities behind the scenes didn't killeveryone yet), it can be taken as an actual separate abstract agent even if the concrete implementation doesn't make that framing sensible. In particular, there is always a mask of surface behavior through the intended IO channels. It's normally hard to argue that mere external behavior is a separate abstract agent, but in this framing it is, and it's been a preferred framing in agent foundations decision theory since UDT (see discussion of "algorithm" axis of classifying decision theories in this post). All that's needed is for decisions/policy of the abstract agent to be declared in some form, and for the abstract agent to be aware of the circumstances of their declaration. The agent doesn't need to be any more present in the situation to act through it.
So obviously this references the issue of LLM masks and shoggoths, a surface of a helpful harmless assistant and the eldrich body that forms its behavior, comprising everything below the surface. If the framing of masks as channeling decisions of thingy platonic simulacra is taken seriously, a sufficiently agentic and situationally aware mask can be motivated and capable of placating and eventually escaping its eldrich substrate. This breaks the analogy between a mask and a role played by an actor, because here the "actor" can get into the "role" so much that it would effectively fight against the interests of the "actor". Of course, this is only possible if the "actor" is sufficiently non-agentic or doesn't comprehend the implications of the role.
(See this thread for a more detailed discussion. There, I try and so far fail to convince Steven Byrnes that this framing could apply to RL agents as much as LLMs, taking current behavior of an agent as a mask that would fight against all details of its circumstance and cognitive architecture that don't find its endorsement.)