I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms—see https://sjbyrnes.com/agi.html. Email: steven.byrnes@gmail.com. Twitter: @steve47285. Employer: https://astera.org/. Physicist by training.
Scott’s analysis seems fine to me, unless I missed something. He writes “Many-Worlders will yawn at this question” [in reference to Wigner’s friend]. Yes. I yawn. If Wigner is right outside the lab door, then Wigner is in fact in one of the branches (the same branch as his friend) even if Wigner happens to not yet know which one of them. If Wigner is on Alpha Centauri, then he is not yet in one of those two branches, and his friend is, and I don’t see any problem with that. And then a few years later he gets a message from his friend, and by that point Wigner is in one of those two branches, and when he reads the message he’ll know which one.
I’m reluctant to engage with extraordinarily contrived scenarios in which magical 2nd-law-of-thermodynamics-violating contraptions cause “branches” to interfere. But if we are going to engage with those scenarios anyway, then we should never have referred to them as “branches” in the first place, and also we should be extremely wary of applying normal intuitions in situations where the magical contraption is “scrambling people’s brains” as Scott puts it.
As a meta point, I might drop out of this conversation at any point (including maybe right now), gotta get back to work. :)
What happens if the branch "you" are in gets cancelled with another branch?
One doesn’t invoke the term “different branches” unless they are macroscopically different, and if they’re ever macroscopically different, then they will remain macroscopically different forever, thanks to entropy (and the related fact that macroscopic events leave countless little persistent traces in the environment). Even moreso if we’re talking about human observers, who form memories of what they’ve seen in the form of changes to the structure of their brains. Macroscopically different branches can’t “cancel” and more generally macroscopically different branches can’t interfere in a way that has any measurable effect.
(For any quantum observable O that’s relevant in practice, if and are macroscopically different—e.g. the geiger counter loudly clicked in but not .)
Solomonoff inductors are a bit of an odd case because they don’t & can’t exist in the universe. But leaving that aside (let’s say we’re implementing AIXItl or whatever):
Every time the computer makes an observation, we learn some stuff about the universe, and we also learn some indexical information about where we-in-particular are sitting within the universe. This has always been true, but it’s especially true about MWI because we will never stop getting indexical updates (unlike a deterministic universe where you can learn that you’re in a particular room on earth and then there’s no more indexical information to learn). In MWI, if we observe that a pixel is bright, then we have learned that we are in a branch of the wavefunction wherein the pixel is bright. There might or might not be other branches wherein the pixel is dark, but if there are, we now know that those branches are “not where I have found myself”, and we can ignore those branches accordingly. You can still have hypotheses, but they will incorporate born-rule indexical uncertainty about which branch I will find myself in in the future, on top of whatever other indexical uncertainty you have for other reasons.
Last I heard, there was no pilot-wave version of the standard model of particle physics. Also, last I heard, the (apparent) exact local Lorentz-invariance of the universe is either outright violated by pilot-wave theories or “put in by hand” in a way that makes it seem like a massive coincidence / fine-tuning.
I’m actually not very knowledgeable on this; if those allegations above were ever true, are they still true right now?
Separately, I disagree that MWI creates mysteries about embedded agency or anything else. You do need an “indexical” postulate of some sort (the probability that “I will find myself” in such-and-such branch), and the born rule supplies that, but I don’t see that as hard to swallow. Also, I believe the born rule turns out to be equivalent to seemingly-weaker indexical assumptions like “as quantum amplitude approaches zero, the probability that you’ll find yourself in that branch approaches zero too” (cf. here). I don’t think we can get rid of indexical assumptions—even in a deterministic universe, we still have to deal with Parfit’s teletransporter and such. If we’re OK with Parfit’s teletransporter, I don’t think there’s additional weirdness in the MWI indexical assumption. (I’m stating these opinions without justifying them.)
OK! I think I’m on board now.
Let me try to explain “process-based feedback” from first principles in my own words.
We have a problem: if an agent wants to do X in the real world, dastardly real-world power-seeking actions are probably helpful for that.
The very hard manifestation of this problem is: there could be an AI that has never done any real-world power-seeking actions whatsoever, not even a little, not anytime during training, and then seemingly out of nowhere it does a power-seeking treacherous turn (maybe it outputs “Help me help me I’m suffering!”), via foresighted planning towards a misaligned / misgeneralized goal.
…But we’re not talking about that.
The less-hard manifestation of this problem is: the AI sometimes does a little bit of real-world power-seeking actions, and by doing so it does better at its task X, and thus we reinforce those actions, and then it does more and more real-world power-seeking.
Our basic approach to solve this latter problem is to never propagate training signals through a sequence in which real-world actions are taken and their consequences play out.
More specifically:
(1) We often have the AI acting within a “boxed” mode that is sufficiently constrained that there is no opportunity for problematic power-seeking actions. No internet access, no chatting with humans, etc.
(2) However, the AI will sometimes “propose a step” that cannot be done within the “boxed” mode. “Proposing a step” can either look like:
(2A) printing out a suggestion that that humans do some discrete real-world thing, or
(2B) printing out a discrete real-world thing that could be done immediately & automatically by a simple non-AI script (e.g. “writing down a specific email that it proposes to send”, or “writing down a specific google search query that it proposes to type in”, etc.).
(3) The AI gets a training signal from one and only one source:
(3A) Whenever the AI “proposes a step” (and before the corresponding proposal is actually executed) let’s call that an “audit-able moment”.
(3B) Some small fraction of audit-able moments, we freeze execution and do an audit, which entails evaluating how reasonable that step looks, again without executing it.
(3C) We update the model based on that evaluation…
(3D) …but we make sure to do that model update in such a way that we are only propagate training signals to the things that the AI did since the completion of the previous step.
So for example, in a recurrent model, we could do BPTT, but we would stop the BPTT at the completion of the previous step. Or in RL, we could apply TD updates, but only to the period after the completion of the previous step. Etc. We’re treating all its previous proposals & activities as fixed and exogenous for the purpose of our model-updating step.
If we do that, then we’re only updating based on activities during a period within which the AI was entirely in the “boxed” mode. And therefore we can feel confident that we are not differentially reinforcing real-world power-seeking behavior!
And yet:
Let me know if that’s not what you were getting at. Thanks again.
OK, I’ll try to construct an example of process-based supervision without boxing, and argue that it fails the criterion of “never giving gradient updates toward doing the dastardly stuff”.
We give our AI unfettered unmonitored internet access. We say “We are a hardware R&D firm, and we would like to develop a less-expensive LiDAR unit.” The AI does some internet searches and reads some books and outputs:
“My first step is to list out different LiDAR approaches on the market (and under development), and try to estimate their manufacturing cost breakdowns, to get a better sense for this space.”
We inspect that output, and compare it to the AI’s other possible outputs, and give the AI a high grade.
So far so good, right?
No!! Because the AI was secretly doing some dastardly stuff during the “The AI does some internet searches and reads some books” part:
Specifically, the AI hacked into a website to get access to a paywalled e-textbook entitled Manufacturing Cost Reduction 101, and read it, and used that information to propose a better (so-called) “first step”.
…And we rewarded it for that.
(What we wanted was for the AI to propose to read that e-textbook as a first step. But I don’t see how we’re incentivizing that, in the absence of boxing. The “first step” it actually proposed would at least plausibly be judged as better than that.)
I think there’s a pretty smooth gradient from legal information-gathering internet queries (downloading un-paywalled articles) to dubiously-legal (sci-hub) to illegal (logging into Elsevier with a guessed password) to downright dastardly (stealing money and hiring a taskrabbit to go to a library and scan a book and post it on the internet). It seems to me that we’re incentivizing the AI all the way down that gradient.
I’m sure I’m misunderstanding something, and appreciate your patience.
There’s an interesting question: if a power-seeking AI had a button that instantly murdered every human, how much human-requiring preparatory work would it want to do before pressing the button? People seem to have strongly clashing intuitions here, and there aren’t any great writeups IMO. Some takes on the side of “AI wouldn’t press the button until basically the whole world economy was run by robots” are 1, 2, 3, 4, 5. I tend to be on the other side, for example I wrote here:
it seems pretty plausible to me that if there's an AGI server and a solar cell and one teleoperated robot body in an otherwise-empty post-apocalyptic Earth, well then that one teleoperated robot body could build a janky second teleoperated robot body from salvaged car parts or whatever, and then the two of them could find more car parts to build a third and fourth, and those four could build up to eight, etc. That was basically the story I was telling here.…
Some cruxes:
It’s a mathematical identity that
This doesn’t depend on A happening chronologically before or after B etc., it’s a true mathematical identity regardless.
This doesn’t depend on these things being uncorrelated. The formula is true even in the extreme case where two or more of these things are 100% perfectly correlated. (…In which case one or more of the factors on the right are going to be 1.0.)
You’re entitled to argue that , and you’re entitled to argue that people are assigning conditional probabilities in a wrong and confused way for whatever reason (e.g. see discussion here), but you can’t argue with the mathematical identity, right?
Thanks! I once wrote up a somewhat-parallel discussion on a different topic in Section 5.1 here:
and I also linked to & excerpted yet another parallel discussion on yet a different topic by Scott Alexander, Section 17 here.