I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms—see https://sjbyrnes.com/agi.html. Email: steven.byrnes@gmail.com. Twitter: @steve47285. Employer: https://astera.org/. Physicist by training.
FWIW I endorse the second claim when the utility function depends exclusively on the state of the world in the distant future, whereas I endorse the first claim when the utility function can depend on anything whatsoever (e.g. what actions I’m taking right this second). (details)
I wish we had different terms for those two things. That might help with any alleged yay/boo reasoning.
(When Eliezer talks about utility functions, he seems to assume that it depends exclusively on the state of the world in the distant future.)
Thanks! I once wrote up a somewhat-parallel discussion on a different topic in Section 5.1 here:
… So this is the “null hypothesis” of what to expect if there’s no such thing as [blah]. By now there are probably ≈1000 person-years of experimental data created by [blah] researchers. In such a huge mountain of data, there is bound to be lots of “random noise and ad hoc misinterpretations” that happen to line up remarkably with researchers’ prior expectations about [blah]. The question is not “Are there results that seems to provide evidence for [blah]?”, but rather “Is there much more evidence for [blah] than could plausibly be filtered out of 1000 person-years of random noise, misinterpretations, experimental errors, bias, occasional fraud, gross incompetence, weird equipment malfunctions, etc.?” …
and I also linked to & excerpted yet another parallel discussion on yet a different topic by Scott Alexander, Section 17 here.
Scott’s analysis seems fine to me, unless I missed something. He writes “Many-Worlders will yawn at this question” [in reference to Wigner’s friend]. Yes. I yawn. If Wigner is right outside the lab door, then Wigner is in fact in one of the branches (the same branch as his friend) even if Wigner happens to not yet know which one of them. If Wigner is on Alpha Centauri, then he is not yet in one of those two branches, and his friend is, and I don’t see any problem with that. And then a few years later he gets a message from his friend, and by that point Wigner is in one of those two branches, and when he reads the message he’ll know which one.
I’m reluctant to engage with extraordinarily contrived scenarios in which magical 2nd-law-of-thermodynamics-violating contraptions cause “branches” to interfere. But if we are going to engage with those scenarios anyway, then we should never have referred to them as “branches” in the first place, and also we should be extremely wary of applying normal intuitions in situations where the magical contraption is “scrambling people’s brains” as Scott puts it.
As a meta point, I might drop out of this conversation at any point (including maybe right now), gotta get back to work. :)
What happens if the branch "you" are in gets cancelled with another branch?
One doesn’t invoke the term “different branches” unless they are macroscopically different, and if they’re ever macroscopically different, then they will remain macroscopically different forever, thanks to entropy (and the related fact that macroscopic events leave countless little persistent traces in the environment). Even moreso if we’re talking about human observers, who form memories of what they’ve seen in the form of changes to the structure of their brains. Macroscopically different branches can’t “cancel” and more generally macroscopically different branches can’t interfere in a way that has any measurable effect.
(For any quantum observable O that’s relevant in practice, if and are macroscopically different—e.g. the geiger counter loudly clicked in but not .)
Solomonoff inductors are a bit of an odd case because they don’t & can’t exist in the universe. But leaving that aside (let’s say we’re implementing AIXItl or whatever):
Every time the computer makes an observation, we learn some stuff about the universe, and we also learn some indexical information about where we-in-particular are sitting within the universe. This has always been true, but it’s especially true about MWI because we will never stop getting indexical updates (unlike a deterministic universe where you can learn that you’re in a particular room on earth and then there’s no more indexical information to learn). In MWI, if we observe that a pixel is bright, then we have learned that we are in a branch of the wavefunction wherein the pixel is bright. There might or might not be other branches wherein the pixel is dark, but if there are, we now know that those branches are “not where I have found myself”, and we can ignore those branches accordingly. You can still have hypotheses, but they will incorporate born-rule indexical uncertainty about which branch I will find myself in in the future, on top of whatever other indexical uncertainty you have for other reasons.
Last I heard, there was no pilot-wave version of the standard model of particle physics. Also, last I heard, the (apparent) exact local Lorentz-invariance of the universe is either outright violated by pilot-wave theories or “put in by hand” in a way that makes it seem like a massive coincidence / fine-tuning.
I’m actually not very knowledgeable on this; if those allegations above were ever true, are they still true right now?
Separately, I disagree that MWI creates mysteries about embedded agency or anything else. You do need an “indexical” postulate of some sort (the probability that “I will find myself” in such-and-such branch), and the born rule supplies that, but I don’t see that as hard to swallow. Also, I believe the born rule turns out to be equivalent to seemingly-weaker indexical assumptions like “as quantum amplitude approaches zero, the probability that you’ll find yourself in that branch approaches zero too” (cf. here). I don’t think we can get rid of indexical assumptions—even in a deterministic universe, we still have to deal with Parfit’s teletransporter and such. If we’re OK with Parfit’s teletransporter, I don’t think there’s additional weirdness in the MWI indexical assumption. (I’m stating these opinions without justifying them.)
OK! I think I’m on board now.
Let me try to explain “process-based feedback” from first principles in my own words.
We have a problem: if an agent wants to do X in the real world, dastardly real-world power-seeking actions are probably helpful for that.
The very hard manifestation of this problem is: there could be an AI that has never done any real-world power-seeking actions whatsoever, not even a little, not anytime during training, and then seemingly out of nowhere it does a power-seeking treacherous turn (maybe it outputs “Help me help me I’m suffering!”), via foresighted planning towards a misaligned / misgeneralized goal.
…But we’re not talking about that.
The less-hard manifestation of this problem is: the AI sometimes does a little bit of real-world power-seeking actions, and by doing so it does better at its task X, and thus we reinforce those actions, and then it does more and more real-world power-seeking.
Our basic approach to solve this latter problem is to never propagate training signals through a sequence in which real-world actions are taken and their consequences play out.
More specifically:
(1) We often have the AI acting within a “boxed” mode that is sufficiently constrained that there is no opportunity for problematic power-seeking actions. No internet access, no chatting with humans, etc.
(2) However, the AI will sometimes “propose a step” that cannot be done within the “boxed” mode. “Proposing a step” can either look like:
(2A) printing out a suggestion that that humans do some discrete real-world thing, or
(2B) printing out a discrete real-world thing that could be done immediately & automatically by a simple non-AI script (e.g. “writing down a specific email that it proposes to send”, or “writing down a specific google search query that it proposes to type in”, etc.).
(3) The AI gets a training signal from one and only one source:
(3A) Whenever the AI “proposes a step” (and before the corresponding proposal is actually executed) let’s call that an “audit-able moment”.
(3B) Some small fraction of audit-able moments, we freeze execution and do an audit, which entails evaluating how reasonable that step looks, again without executing it.
(3C) We update the model based on that evaluation…
(3D) …but we make sure to do that model update in such a way that we are only propagate training signals to the things that the AI did since the completion of the previous step.
So for example, in a recurrent model, we could do BPTT, but we would stop the BPTT at the completion of the previous step. Or in RL, we could apply TD updates, but only to the period after the completion of the previous step. Etc. We’re treating all its previous proposals & activities as fixed and exogenous for the purpose of our model-updating step.
If we do that, then we’re only updating based on activities during a period within which the AI was entirely in the “boxed” mode. And therefore we can feel confident that we are not differentially reinforcing real-world power-seeking behavior!
And yet:
Let me know if that’s not what you were getting at. Thanks again.
OK, I’ll try to construct an example of process-based supervision without boxing, and argue that it fails the criterion of “never giving gradient updates toward doing the dastardly stuff”.
We give our AI unfettered unmonitored internet access. We say “We are a hardware R&D firm, and we would like to develop a less-expensive LiDAR unit.” The AI does some internet searches and reads some books and outputs:
“My first step is to list out different LiDAR approaches on the market (and under development), and try to estimate their manufacturing cost breakdowns, to get a better sense for this space.”
We inspect that output, and compare it to the AI’s other possible outputs, and give the AI a high grade.
So far so good, right?
No!! Because the AI was secretly doing some dastardly stuff during the “The AI does some internet searches and reads some books” part:
Specifically, the AI hacked into a website to get access to a paywalled e-textbook entitled Manufacturing Cost Reduction 101, and read it, and used that information to propose a better (so-called) “first step”.
…And we rewarded it for that.
(What we wanted was for the AI to propose to read that e-textbook as a first step. But I don’t see how we’re incentivizing that, in the absence of boxing. The “first step” it actually proposed would at least plausibly be judged as better than that.)
I think there’s a pretty smooth gradient from legal information-gathering internet queries (downloading un-paywalled articles) to dubiously-legal (sci-hub) to illegal (logging into Elsevier with a guessed password) to downright dastardly (stealing money and hiring a taskrabbit to go to a library and scan a book and post it on the internet). It seems to me that we’re incentivizing the AI all the way down that gradient.
I’m sure I’m misunderstanding something, and appreciate your patience.
This part isn’t quite right. Here’s some background if it helps.
Part of your brain is a big sheet of gray matter called “the cortex”. In humans, the sheet gets super-crumpled up in the brain, so much so that it’s easy to forget that it’s a single contiguous sheet in the first place. Also in humans, the sheet gets so big that the outer edges of it wind up curved up underneath the center part, kinda like the top of a cupcake or muffin that overflows its paper wrapper.
(See here if you can’t figure out what I’m talking about with the cupcake.)
The outside bit of the cortical sheet (usually) has 3 visible layers under the microscope, and is called “allocortex”. It consists mostly of the hippocampus & piriform cortex. The center part of the cortical sheet (probably 90%+ of the area in humans) is called “isocortex”, and (usually) has 6 visible layers under the microscope. The term “neocortex” is mostly treated as a synonym of “isocortex”, with “isocortex” more common in the technical literature and “neocortex” more common among non-experts.
The isocortex includes lots of things like “visual cortex” and “somatosensory cortex” and “prefrontal cortex” etc. But despite that, you don’t say “there are many cortices”. Grammatically, it’s kinda like how there’s “Eastern Canada” and “Central Canada” and “Western Canada”, but nobody says that therefore there are “many Canadas”. You can say that visual cortex is “a region of the cortex”, but not “a cortex”.