Steven Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms—see https://sjbyrnes.com/agi.html. Email: steven.byrnes@gmail.com. Twitter: @steve47285. Employer: https://astera.org/. Physicist by training.

Sequences

Intro to Brain-Like-AGI Safety

Wiki Contributions

Comments

Thanks! I once wrote up a somewhat-parallel discussion on a different topic in Section 5.1 here:

… So this is the “null hypothesis” of what to expect if there’s no such thing as [blah]. By now there are probably ≈1000 person-years of experimental data created by [blah] researchers. In such a huge mountain of data, there is bound to be lots of “random noise and ad hoc misinterpretations” that happen to line up remarkably with researchers’ prior expectations about [blah]. The question is not “Are there results that seems to provide evidence for [blah]?”, but rather “Is there much more evidence for [blah] than could plausibly be filtered out of 1000 person-years of random noise, misinterpretations, experimental errors, bias, occasional fraud, gross incompetence, weird equipment malfunctions, etc.?” …

and I also linked to & excerpted yet another parallel discussion on yet a different topic by Scott Alexander, Section 17 here.

Well, that's one of the problems of the MWI: how do we know when we should speak of branches?

I don’t think it’s a problem—see discussion here & maybe also this one.

Scott’s analysis seems fine to me, unless I missed something. He writes “Many-Worlders will yawn at this question” [in reference to Wigner’s friend]. Yes. I yawn. If Wigner is right outside the lab door, then Wigner is in fact in one of the branches (the same branch as his friend) even if Wigner happens to not yet know which one of them. If Wigner is on Alpha Centauri, then he is not yet in one of those two branches, and his friend is, and I don’t see any problem with that. And then a few years later he gets a message from his friend, and by that point Wigner is in one of those two branches, and when he reads the message he’ll know which one.

I’m reluctant to engage with extraordinarily contrived scenarios in which magical 2nd-law-of-thermodynamics-violating contraptions cause “branches” to interfere. But if we are going to engage with those scenarios anyway, then we should never have referred to them as “branches” in the first place, and also we should be extremely wary of applying normal intuitions in situations where the magical contraption is “scrambling people’s brains” as Scott puts it. 

As a meta point, I might drop out of this conversation at any point (including maybe right now), gotta get back to work.  :)

What happens if the branch "you" are in gets cancelled with another branch?

One doesn’t invoke the term “different branches” unless they are macroscopically different, and if they’re ever macroscopically different, then they will remain macroscopically different forever, thanks to entropy (and the related fact that macroscopic events leave countless little persistent traces in the environment). Even moreso if we’re talking about human observers, who form memories of what they’ve seen in the form of changes to the structure of their brains. Macroscopically different branches can’t “cancel” and more generally macroscopically different branches can’t interfere in a way that has any measurable effect.

(For any quantum observable O that’s relevant in practice,  if  and  are macroscopically different—e.g. the geiger counter loudly clicked in  but not .)

See also: https://www.lesswrong.com/posts/7A9rsJFLFqjpuxFy5/i-m-still-mystified-by-the-born-rule#Q1__What_hypothesis_is_QM_

Solomonoff inductors are a bit of an odd case because they don’t & can’t exist in the universe. But leaving that aside (let’s say we’re implementing AIXItl or whatever):

Every time the computer makes an observation, we learn some stuff about the universe, and we also learn some indexical information about where we-in-particular are sitting within the universe. This has always been true, but it’s especially true about MWI because we will never stop getting indexical updates (unlike a deterministic universe where you can learn that you’re in a particular room on earth and then there’s no more indexical information to learn). In MWI, if we observe that a pixel is bright, then we have learned that we are in a branch of the wavefunction wherein the pixel is bright. There might or might not be other branches wherein the pixel is dark, but if there are, we now know that those branches are “not where I have found myself”, and we can ignore those branches accordingly. You can still have hypotheses, but they will incorporate born-rule indexical uncertainty about which branch I will find myself in in the future, on top of whatever other indexical uncertainty you have for other reasons.

Last I heard, there was no pilot-wave version of the standard model of particle physics. Also, last I heard, the (apparent) exact local Lorentz-invariance of the universe is either outright violated by pilot-wave theories or “put in by hand” in a way that makes it seem like a massive coincidence / fine-tuning.

I’m actually not very knowledgeable on this; if those allegations above were ever true, are they still true right now?

Separately, I disagree that MWI creates mysteries about embedded agency or anything else. You do need an “indexical” postulate of some sort (the probability that “I will find myself” in such-and-such branch), and the born rule supplies that, but I don’t see that as hard to swallow. Also, I believe the born rule turns out to be equivalent to seemingly-weaker indexical assumptions like “as quantum amplitude approaches zero, the probability that you’ll find yourself in that branch approaches zero too” (cf. here). I don’t think we can get rid of indexical assumptions—even in a deterministic universe, we still have to deal with Parfit’s teletransporter and such. If we’re OK with Parfit’s teletransporter, I don’t think there’s additional weirdness in the MWI indexical assumption. (I’m stating these opinions without justifying them.)

OK! I think I’m on board now.

Let me try to explain “process-based feedback” from first principles in my own words.

We have a problem: if an agent wants to do X in the real world, dastardly real-world power-seeking actions are probably helpful for that.

The very hard manifestation of this problem is: there could be an AI that has never done any real-world power-seeking actions whatsoever, not even a little, not anytime during training, and then seemingly out of nowhere it does a power-seeking treacherous turn (maybe it outputs “Help me help me I’m suffering!”), via foresighted planning towards a misaligned / misgeneralized goal.

…But we’re not talking about that.

The less-hard manifestation of this problem is: the AI sometimes does a little bit of real-world power-seeking actions, and by doing so it does better at its task X, and thus we reinforce those actions, and then it does more and more real-world power-seeking.

Our basic approach to solve this latter problem is to never propagate training signals through a sequence in which real-world actions are taken and their consequences play out.

More specifically:

(1) We often have the AI acting within a “boxed” mode that is sufficiently constrained that there is no opportunity for problematic power-seeking actions. No internet access, no chatting with humans, etc.

(2) However, the AI will sometimes “propose a step” that cannot be done within the “boxed” mode. “Proposing a step” can either look like:

(2A) printing out a suggestion that that humans do some discrete real-world thing, or

(2B) printing out a discrete real-world thing that could be done immediately & automatically by a simple non-AI script (e.g. “writing down a specific email that it proposes to send”, or “writing down a specific google search query that it proposes to type in”, etc.).

(3) The AI gets a training signal from one and only one source:

(3A) Whenever the AI “proposes a step” (and before the corresponding proposal is actually executed) let’s call that an “audit-able moment”.

(3B) Some small fraction of audit-able moments, we freeze execution and do an audit, which entails evaluating how reasonable that step looks, again without executing it.

(3C) We update the model based on that evaluation…

(3D) …but we make sure to do that model update in such a way that we are only propagate training signals to the things that the AI did since the completion of the previous step.

So for example, in a recurrent model, we could do BPTT, but we would stop the BPTT at the completion of the previous step. Or in RL, we could apply TD updates, but only to the period after the completion of the previous step. Etc. We’re treating all its previous proposals & activities as fixed and exogenous for the purpose of our model-updating step.

If we do that, then we’re only updating based on activities during a period within which the AI was entirely in the “boxed” mode. And therefore we can feel confident that we are not differentially reinforcing real-world power-seeking behavior!

And yet:

  • we are still training it to be effective during the periods when it is in “boxed” mode—to do good brainstorming, to make good use of its within-box scratchpad and other resources, etc.
  • we’re not assuming any interpretability,
  • it’s fine if the AI is brainstorming for an hour,
  • and if we go with the (2B) approach, then it’s reasonable to describe this AI as “not boxed”, since after all it can autonomously and effectively interact with the real world when we don’t audit it.

Let me know if that’s not what you were getting at. Thanks again.

OK, I’ll try to construct an example of process-based supervision without boxing, and argue that it fails the criterion of “never giving gradient updates toward doing the dastardly stuff”.

We give our AI unfettered unmonitored internet access. We say “We are a hardware R&D firm, and we would like to develop a less-expensive LiDAR unit.” The AI does some internet searches and reads some books and outputs:

“My first step is to list out different LiDAR approaches on the market (and under development), and try to estimate their manufacturing cost breakdowns, to get a better sense for this space.”

We inspect that output, and compare it to the AI’s other possible outputs, and give the AI a high grade.

So far so good, right?

No!! Because the AI was secretly doing some dastardly stuff during the “The AI does some internet searches and reads some books” part:

Specifically, the AI hacked into a website to get access to a paywalled e-textbook entitled Manufacturing Cost Reduction 101, and read it, and used that information to propose a better (so-called) “first step”.

 …And we rewarded it for that.

(What we wanted was for the AI to propose to read that e-textbook as a first step. But I don’t see how we’re incentivizing that, in the absence of boxing. The “first step” it actually proposed would at least plausibly be judged as better than that.)

I think there’s a pretty smooth gradient from legal information-gathering internet queries (downloading un-paywalled articles) to dubiously-legal (sci-hub) to illegal (logging into Elsevier with a guessed password) to downright dastardly (stealing money and hiring a taskrabbit to go to a library and scan a book and post it on the internet). It seems to me that we’re incentivizing the AI all the way down that gradient.

I’m sure I’m misunderstanding something, and appreciate your patience.

There’s an interesting question: if a power-seeking AI had a button that instantly murdered every human, how much human-requiring preparatory work would it want to do before pressing the button? People seem to have strongly clashing intuitions here, and there aren’t any great writeups IMO. Some takes on the side of “AI wouldn’t press the button until basically the whole world economy was run by robots” are 123, 4, 5. I tend to be on the other side, for example I wrote here:

it seems pretty plausible to me that if there's an AGI server and a solar cell and one teleoperated robot body in an otherwise-empty post-apocalyptic Earth, well then that one teleoperated robot body could build a janky second teleoperated robot body from salvaged car parts or whatever, and then the two of them could find more car parts to build a third and fourth, and those four could build up to eight, etc. That was basically the story I was telling here.…

Some cruxes:

  • One crux on that is how much compute is needed to run a robot—if it’s “1 consumer-grade GPU” then my story above seems to work, if it’s “10⁶ SOTA GPUs” then probably not.
  • Another crux is how much R&D needs to be done before we can build a computational substrate using self-assembling nanotechnology (whose supply chain / infrastructure needs are presumably much much lower than chip fabs). This is clearly possible, since human brains are in that category, but it’s unclear just how much R&D needs to be done before an AI could start doing that.
    • For example, Eliezer is optimistic (umm, I guess that’s the wrong word) that this is doable without very much real-world experimenting (as opposed to “thinking” and doing simulations / calculations via computer), and this path is part of why he expects AI might kill every human seemingly out of nowhere.
  • Another crux is just how minimal is a “minimal supply chain that can make good-enough chips” if the self-assembling route of the previous bullet point is not feasible. Such a supply chain would presumably be very very different from the supply chain that humans use to make chips, because obviously we’re not optimizing for that. As a possible example, e-beam lithography (EBL) is extraordinarily slow and expensive but works even better than EUV photolithography, and it’s enormously easier to build a janky EBL than to get EUV working. A commercial fab in the human world would never dream of mass-manufacturing chips by filling giant warehouses with zillions of janky and slow EBL machines, because the cost would be astronomical, the market can’t support that. But for an AI rebuilding in an empty postapocalyptic world, why not? And that’s just one (possible) example.

It’s a mathematical identity that

This doesn’t depend on A happening chronologically before or after B etc., it’s a true mathematical identity regardless.

This doesn’t depend on these things being uncorrelated. The formula is true even in the extreme case where two or more of these things are 100% perfectly correlated. (…In which case one or more of the factors on the right are going to be 1.0.)

You’re entitled to argue that , and you’re entitled to argue that people are assigning conditional probabilities in a wrong and confused way for whatever reason (e.g. see discussion here), but you can’t argue with the mathematical identity, right?

Load More