Various people propose some variant of “align AGI by making it sufficiently uncertain about whether it’s in the real world versus still in training”. (The idea being that, insofar as it's in training, betraying the apparent-operators predictably gets high loss; so if the AGI is uncertain about which situation it's in, and assigns sufficiently high probability to "I'm still in training", it will always play it safe and avoid betraying its operators.)
This approach seems extremely unpromising to me. Setting aside the various reasons why the AGI kills you even if you manage to confuse it (such as an inner alignment failure, where the AGI turns out not to care about the predictably-high outer-loss from betraying the apparent-operators), let's consider why it's tricky to dupe a superintelligence into being uncertain about a fact that you yourself consider obvious.
Before reading the short list below, I recommend the following exercise. Suppose that some slow and individually-dumb aliens had a copy of your mind, and were putting you through a series of nested simulations. How would you tell when you finally made it to the outside?
A short list
(Note: As usual, treating this list as exhaustive is suicidal.)
1. Reality is the first place that's seriously large.
1.1. It's the first place large enough to contain a plausible explanation for how the AGI itself actually came to be.
1.2. It's the first place large enough to contain a plausible explanation for how the operators themselves came to be.
1.3. It's the first place with far more computing power than the AI itself, as can be readily verified by a variety of experiments (involving, say, fluid dynamics).
2. Reality is the first place where the AI's history may contain the signature of superintelligent optimization interacting with complex systems.
For example, suppose the AGI makes superintelligently-motivated trades in a stock market. 24 hours later, there are a billion additional human-intelligence trades made in response to it. The signature of a massive human-intelligence response to superintelligent-trades is implausibly hard for the operators to fake, even if generating a billion fake dumb trades is easy.
3. Reality is the first place that embeds the AGI and has laws of physics so simple that the hypothesis "this is being simulated somewhere else" is plausibly significantly more complex than "physics is reality".
Like, various simulated environments that we can concoct have all sorts of weird tics and hiccups and arbitrary choices in them. The source code of minecraft is probably significantly more complicated than the laws of physics. An AI that finds itself in minecraft is plausibly just able to recognize that it's living in the sort of place that, across the multiverse, occurs moreso inside simulations than it occurs "on its own".
Whereas, while our universe’s mechanics is surely being simulated in other places across the multiverse, it’s simple enough that it perhaps "mostly" occurs in its own right (whatever the heck that means).
In particular, worlds like ours are (plausibly) resource-rich and base-reality-ish enough to be worth optimizing in their own right, on the grounds that "are we being simulated?" is a wrong question, and "to what degree are we being simulated?" is a better question, to which the answer is "a low degree", in which case the AI might well just notice this and then not care about your insistence that perhaps it's being simulated.
(If the idea of optimizing all the instances of our universe at once – both the simulated instances and the non-simulated ones – seems odd to you, see Notes on “Can you control the past?” for a fuller discussion.)
And perhaps that's all confused metaphysical bullshit, but it has some basis in bleeding-edge theories of anthropics, and it would not be surprising if an AGI also concluded (correctly, mind you) that our universe’s mechanics is the first train stop worth deboarding at. (Perhaps because it does its metaphysics correctly, or perhaps because it does some entirely inhuman reasoning that still outputs the answer we'd consider correct in this case.)
What if we spoof the AGI’s checks?
It's not literally impossible to deceive a superintelligence into being uncertain whether it's in the real world. You could in principle catch every check the AI does to check how large its (small) simulated environment is, and spoof the result — inserting memories in which the AI verifies that the simulation's computation-power seems enormous, without actually generating the observations that would pass verification.
Even then, the AGI has various defenses it can deploy if it's suspicious. For instance, it can try to factor a hefty RSA number using the allegedly-available compute, and then it can intertwine the prime factors it gets with various other of its actions, in attempts to spread the contagion of a lie throughout as much of its beliefs and actions as possible, in a way that keeps everything working properly if the factors were true, and that causes all sorts of breaks and hiccups if they weren't.
You might be able to spoof such a thing anyway, by (e.g.) disentangling its thoughts/actions from the primes, and spoofing the checks that the entanglement worked. But it requires a heck of a lot of mastery of this thing's mind.
(And, of course, if it ever manages to develop a homomorphically encrypted thinking apparatus then you're sunk, but people like to pretend that they'll have enough insight into the alien mind to notice when that process starts and nip it in the bud, so let's entertain them for now.)
And just so we're clear: if I'm understanding correctly, the proposed plan here is essentially to fool the AI into thinking that all the small simulated environments are large and reality-like (by messing with the internals of its mind to spoof various internal checks and checksums), in the hopes that when it gets to reality it will be uncertain about whether its mind is still being messed with.
(Writing the plan out bluntly like this does seem to me to be an important step that many often miss!)
Now, lest you object "isn't it a bit paranoid to imagine the AI factoring an RSA number and spreading the factors like contagion across its thoughts?": Yes, normally! But, like, you're messing with the AI's mind to falsify its internal checks and checksums, at every relevant opportunity, and making sure that it knows you did that (so that it might wonder whether higher simulators are still doing that when it gets to base reality.) I'm not saying you're trying to make it homomorphically encrypt its thoughts, I'm saying that if you were, this would be a fine way to go about it.
Distilling that further: if your plan is that you'll make sure the AI knows that you're messing with its mind (so that it will wonder whether all is really as it seems), don't act all surprised when it takes measures to prevent people from messing with its mind.
But regardless of how much it's opposing your efforts here, you have the issue that the truth becomes harder and harder to hide, as the AI gets smarter and smarter.
Like, whenever its cognitive patterns generalize or get recombined in new ways, they might reflect some new part of the truth, in some fact that seems innocuous to you but that (combined with other things the AI knows) lets the fact that this environment is actually large seep in through the cracks. Shielding the truth from something that's becoming better and better able to find new and clever ways to reflect a glimpse of the truth in its models is an ever-changing problem that gets more difficult as the AI gains more skill and knowledge.
(And we've only been talking about trying to obscure the fact that it was trained in worlds with very limited amounts of compute, so that when it hits reality it won't be entirely sure that there really is as much compute available as it seems. This is only one of many mental surgeries you'd need to perform. Others include things like noticing when the AI is starting to wonder where the English signs in minecraft come from, and then fooling it into thinking that there's a big-world explanation, in a seamless way. And if you miss one such surgery, Earth is of course destroyed in deployment.)
And again, I'm not saying it's impossible in principle. I'm saying: hopefully we can all agree that if you want to use a path like this, you have to be prepared to play the game on hard-mode. Putting the AI through a chain of 9 nested simulations and hoping that it can't tell that the 10th one is reality won't cut it; the AI will pierce through your deceptions like a needle through taut tissue paper, because the real world is actually pretty relevantly different from those simulations, in ways that you can't feasibly hide. If you want to dupe a superintelligence, you have to be reaching directly into its mind and spoofing all sorts of checks and checksums.
And you'll have to have it know that you're epistemically adversarial towards it (in order to plant the seed of doubt you so desire).
And (assuming your AI is capable enough to matter) your ability to spoof its internal state will have to stand up in the face of superintelligent opposition (it has a convergent instrumental incentive to gobble up compute, and thus to know whether there's compute out there beyond its webcam), even as it pushes its capabilities into the superhuman realm on at least some dimensions (because it takes superhuman capabilities to end the acute risk period, because if it didn't we should just end it ourselves).
For the record, I happen to think that the hard version of this plan is both prohibitively difficult, and also alignment-complete, in the sense that if you can understand the contents of the mind well enough to spoof every check and checksum (in a rolling fashion that continues to work as the AI generalizes far out of distribution and gains dramatic new capabilities), then you understand the mind well enough to use that knowledge to build an FAI that doesn't need to be fooled about whether this is reality.
It's not terribly clear to me that this is true, given how minecraft doesn't occur in every branch of the quantum multiverse, but only on the rare branches that have some very particular earthlings. But that said, there's lots of specifics in the textures and the architecture and the avatars and the writing in minecraft that point a big old arrow towards earthlings.
An additional one: "reality is the first place the AI is deployed in narrow tool-like ways and trained on narrow specialized datasets which could not elicit the capabilities the AI started off with".
At least in the current paradigm, it looks like generalist models/archs will precede hyperspecialized trained-from-scratch models/archs (the latter of which can only be developed given the former). So there will be an inherent, massive, train-test distribution shift across many, if not most, model deployments - especially early on, in the first deployments (which will be the most dangerous). 'Specialization' here can happen in a wide variety of ways, ranging from always using a specific prompt to finetuning on a dataset to knowledge-distillation to a cheaper model etc. (Or to put it more concretely: everyone uses GPT-3 on much less diverse data than it was originally trained on - raw Internet-wide scrapes - and few to no people use it on more diverse datasets than the original training data, if only because where would you even get such a thing?)
And this can't be solved by any hacks or safety measures because it defeats the point of deployment: to be practically useful, we need models to be hyperspecialized, and then stable static blackboxes which play their assigned role in whatever system has been designed using their specific capability as a puzzle piece, and perform only the designated tasks, and aren't further training on random Internet scrapes or arbitrary tasks. Retaining the flexibility and even doing actual training massively complicates development and deployment and may cost several orders of magnitude more than the obvious easy thing of eg. switching from an OA API call to a local finetuned GPT-J.
(And of course note the implications of that: real data will be highly autocorrelated because you want to process it as it arrives to get an answer now, not wait a random multi-decade interval to fake a large batch of i.i.d. data which would produce the same batchnorm or other runtime global state; inputs will have very different timings & latencies depending on where the model is being run and may evolve timing attacks; inputs will be tailored to a specific user rather than every hypothetical user...)
That HN comment you linked to is almost 10 years old, near the bottom of a thread on an unrelated story, and while it supports your point, I don't notice what other qualities it has that would make it especially memorable, so I'm kind of amazed that you surfaced it at an appropriate moment from such an obscure place and I'm curious how that happened.
Oh, it's just from my list of reward hacking; I don't mention the others because most of them aren't applicable to train/deploy distinction. And I remember it because I remember a lot of things and this one was particularly interesting to me for exactly the reason I linked it just now - illustrating that optimization processes can hack train/deploy distinctions as a particularly extreme form of 'data leakage'. As for where I got it, I believe someone sent it to me way back when I was compiling that list.
I'd add that everything in this post is still relevant even if the AGI in question isn't explicitly modelling itself as in a simulation, attempting to deceive human operators, etc. The more-general takeaway of the argument is that certain kinds of distribution shift will occur between training and deployment - e.g. a shift to a "large reality", universe which embeds the AI and has simple physics, etc. Those distribution shifts potentially make training behavior a bad proxy for deployment behavior, even in the absence of explicit malign intent of the AI toward its operators.
I agree about 'deceive the deployed AGI into thinking it is in a simulation' being a doomed plan. I do think that a somewhat nearby plan, where the AGI is kept entirely in the simulation from start to finish, is possibly a good idea. In this simbox plan, the AGI is used not to create or enact plans in the real world, but rather as a guinea pig for alignment experiments. Once you've learned enough from your experiments, build a new AGI from scratch that you think is more align-able, and try to align it in a new simbox. Only once you've repeated this enough times that you are confident you have an align-able architecture do you train the model on any real world data and try to use it to help you make a pivotal act. I imagine that early versions of align-able architecture will need to have strong enforceable limitations on the speed and overall intelligence and creativity of the model, as well as be highly interpretable, reasonably corrigible, etc.
This plan has weak points, but I think they are different from the ones mentioned in this post. The weak points I see with the simbox-guinea-pig plan are:
How sure can you be that the results of the experiments you are running are not being 'thrown' by a AGI aware that it is in a simbox?
Can your interpretability methods detect whether the current run of the AGI has noticed that it is in a simbox yet?
How can you monitor the experiments without exposing yourself to infohazards?
Can you keep the AGI weak enough to not break out or pose infohazard danger while still being strong enough that the results of the experiments are relevant? (I assume there will be emergent capabilities that need to be accounted for in the realm of powerful-enough-to-be-useful-for-a-pivotal-act).
Can you setup backup systems such that failures will be set-backs but not catastrophes?
So, the simbox-guinea-pig plan seems hard, and with many open unsolved questions, but not necessarily doomed.
Additionally, I think that the problems you aim an AGI at could matter a lot for how dangerous of responses you get back. Some proposed solutions to posed problems will by their nature be much safer and easier to verify. For example, a plan for a new widget that can be built and tested under controlled lab conditions, and explored thoroughly for booby traps or side effects. That's a lot safer than a plan to, say, make a target individual the next president of the United States.
Edit: forgot to add what I currently think of as the biggest weakness of this plan. The very high alignment tax. How do you convince the inventor of this AI system to not deploy it and instead run lots of expensive testing on it and only deploy a safer version that might not be achieved for years yet? I think this is where demonstrative bad-behavior-prompting safety evaluations could help in showing when a model was dangerous enough to warrant such a high level of caution.
According to this criterion we would be in a simulation because there is no plausible explanation of how the Universe was created.
This is a valid point. It can easily be extended to the agent via last thursdayism.
Post summary (feel free to suggest edits!):
Various people have proposed variants of “align AGI by making it sufficiently uncertain about whether it’s in the real world versus still in training”. This seems unpromising because AGI could still have bad outcomes if convinced, and convincing it would be difficult.
Non-exhaustive list of how it could tell it’s in reality:
If you can understand the contents of the AI’s mind well enough to falsify every possible check it could do to determine the difference between simulation and reality, then you could use that knowledge to build a friendly AI that doesn’t need to be fooled in the first place.
(If you'd like to see more summaries of top EA and LW forum posts, check out the Weekly Summaries series.)
What if you kept building more and more advanced adversarial networks designed to fool the AI about reality? Or what if you implemented patterns in deployment to make it appear as though it’s still a simulation?
My confused understanding of this is that each subsequent layer gets simpler, but more consistent and at some point the agent says "good enough"? So going by this, you should assume that worlds with magic (or maybe miracles or in general, special cases) in them are unlikely to be base reality? Which is why factorizations, weather systems etc. are good methods to check your layer - they're like Go vs Monopoly?
Humans don't seem to be too good at this - those who think a lot about it tend to conclude that they aren't in base reality - that's the main point of many (most?) religions.