Steven Byrnes

I'm an AGI safety researcher in Boston, MA, USA, with a particular focus on brain algorithms. See Email:

Wiki Contributions


Are there alternative to solving value transfer and extrapolation?

Steven Byrnes's position, if we understand it correctly, is that the AI should learn to behave in non-dangerous seeming ways[2].…

This seems a sensible approach. But it has a crucial flaw. The AI must behave in typical ways to achieve typical consequences - but how it achieves these consequences doesn't have to be typical.

I think that problem may be solvable. See Section 1.1 (and especially 1.1.3) of this post. Here are two oversimplified implementations:

  1. The human grades the AI's actions as norm-following and helpful, and we do RL to make the AI score well on that metric.
  2. The AI learn the human concept of "being norm-following and helpful" (from watching YouTube or whatever). Then it makes plans and take actions only when they pattern-match to that abstract concept. The pattern-matching part is inside the AI, looking at the whole plan as the AI itself understands it.

I was thinking of #2, not #1.

It seems like you're assuming #1. And also assuming that the result of #1 will be an AI that is "trying" to maximize the human grades / reward. Then you conclude that the AI may learn to appear to be norm-following and helpful, instead of learning to be actually norm-following and helpful. (It could also just hack into the reward channel, for that matter.) Yup, those do seem like things that would probably happen in the absence of other techniques like interpretability. I do have a shred of hope that, if we had much better interpretability than we do today, approach #1 might be salvageable, for reasons implicit in here and here. But still, I would start at #2 not #1.

I have some draft posts explaining some of this stuff better, I can share them privately, or hang on another month or two. :)

Misc. questions about EfficientZero

That's an interesting thought. My hunch is that hippocampal replay can't happen unconsciously because if the hippocampus broadcasts a memory at all, it broadcasts it broadly to the cortex including GNW. That's just my current opinion, I'm not sure if there's neuroscience consensus on that question.

Here I'm sneaking in an assumption that "activity in the GNW" = "activity that you're conscious of". Edge-cases include times when there's stuff happening in the GNW, but it's not remembered after the fact (at least, not as a first-person episodic memory). Are you "conscious" during a dream that you forget afterwards? Are you "conscious" when you're 'blacked out' from drinking too much? I guess I'd say "yes" to both, but that's a philosophy question, or maybe just terminology.

If we want more reasons that human-vs-EfficientNet comparisons are not straightforward, there's also the obvious fact that humans benefit from transfer-learning whereas EfficientNet starts with random weights.

Misc. questions about EfficientZero

Hmm, I think my comment came across as setting up a horse-race between EfficientNet and human brains, in a way that I didn't intend. Sorry for bad choice of words. In particular, when I wrote "how AI compares to human brains", I meant in the sense of "In what ways are they similar vs different? What are their relative strengths and weaknesses? Etc.", but I guess it sounded like I was saying "human brain algorithms are better and EfficientNet is worse". Sorry.

I could write a "human brain algorithms are fundamentally more powerful than EfficientNet" argument, but I wasn't trying to, and such an argument sure as heck wouldn't fit in a comment. :-)

If EfficientZero gets superhuman data-efficiency by "cheating," well it still got superhuman data-efficiency...

Sure. If Atari sample efficiency is what we ultimately care about, then the results speak for themselves. For my part, I was using sample efficiency as a hint about other topics that are not themselves sample efficiency. For example, I think that if somebody wants to understand AlphaZero, the fact that it trained on 40,000,000 games of self-play is a highly relevant and interesting datapoint. Suppose you were to then say "…but of those 40,000,000 games, fundamentally it really only needed 100 games with the external simulator to learn the rules. The other 39,999,900 games might as well have been 'in its head'. This was proven in follow-up work.". I would reply: "Oh. OK. That's interesting too. But I still care about the 40,000,000 number. I still see that number as a very important part of understanding the nature of AlphaZero and similar systems."

(I'm not sure we're disagreeing about anything…)

Misc. questions about EfficientZero

it seems pretty sample-efficient already

Maybe I'm confused, but I'm not ready to take that for granted. I think it's a bit subtle.

Let's talk about chess. Suppose I start out knowing nothing about chess, and have a computer program that I can play that enforces the rules of chess, declares who wins, etc.

  • I play the computer program for 15 minutes, until I'm quite confident that I know all the rules of chess.
  • …Then I spend 8000 years sitting on a chair, mentally playing chess against myself.

If I understand correctly, "EfficientZero is really good at Atari after playing for only 15 minutes" is true in the same sense that "I am good after chess after playing for only 15 minutes" in the above scenario. The 8000 years doesn't count as samples because it was in my head.

Someone can correct me if I'm wrong…

(Even if that's right, it doesn't mean that EfficientZero isn't an important technological advance—after all, there are domains like robotics where simulation-in-your-head is much cheaper than querying the environment. But it would maybe deflate some of the broader conclusions that people are drawing from EfficientZero, like how AI compares to human brains…)

Soares, Tallinn, and Yudkowsky discuss AGI cognition

If you want your AGI not to manipulate humans, you can have it (1) unable to manipulate humans, (2) not motivated to manipulate humans.

These are less orthogonal than they seem: an agential AGI can become skilled in domain X by being motivated to get skilled in domain X (and thus spending time learning and practicing X).

I think the thing that happens "by default" is that the AGI has no motivations in particular, one way or the other, about teaching itself how to manipulate humans. But the AGI has motivation to do something (earn money or whatever, depending on how it was programmed), and teaching itself how to manipulate humans is instrumentally useful for almost everything, so then it will do so.

I think what happens in some people with autism is that "teaching myself how to manipulate humans, and then doing so" is not inherently neutral, but rather inherently aversive—so much so that they don't do it (or do it very little) even when it would in principle be useful for other things that they want to do. That's not everyone with autism, though. Other people with autism do in fact teach themselves how to manipulate humans reasonably well, I think. And when they do so, I think they do so using their "core of generality", just like they would teach themselves to fix a car engine. (This is different from neurotypical people, for whom a bunch of specific social instincts are also involved in manipulating people.) (To be clear, this whole paragraph is controversial / according-to-me.)

Back to AGI, I can imagine three approaches to a non-human-manipulating AI

First, we can micromanage the AGI's cognition. We build some big architecture that includes a "manipulate humans" module, and then we make the "manipulate humans" module return the wrong answers all the time, or just turn it off. The problem is that the AGI also presumably needs some "core of generality" module that the AGI can use to teach itself arbitrary skills that we couldn't put in the modular architecture, like how to repair a teleportation device that hasn't been invented yet. What would happen is that the "core of generality" module would just build a new "manipulate humans" capability from scratch. I don't currently see any way we would prevent that. This problem is analogous to how (I think) some people with autism learn to model people in a way that doesn't invoke their social instincts.

Second, we could curate the AGI's data and environment such that it has no awareness that humans exist and are useful to manipulate. This is the Thoughts On Human Models approach. Its issues are: avoiding information leakage is hard, and even if we succeed, I don't know what we useful / pivotal things we could do with such an AGI.

Third, we can attack the motivation side. We build a detector that lights up when the AGI is manipulating humans, or thinking about manipulating humans, or thinking about thinking about manipulating humans, or whatever. Whenever the detector lights up, it activates the "This Thought Or Activity Is Aversive" mechanism inside the algorithm, which throws out the thought and causes the AGI to think about something else instead. (This mechanism would corresponding to a phasic dopamine pause in the brain, more or less.) I think this approach is more promising, or at least less unpromising. The tricky part is building the "detector". (Another tricky part is making the AGI motivated to not sabotage this whole mechanism, but maybe we can solve that problem with a second detector!) I do think we can build such a "detector" that mostly works; I'll talk about this in a forthcoming post. The really hard and maybe impossible part is building a "detector" that always works. The only way I know to build the detector is kinda messy (it involves supervised learning) and seems to come with no guarantees.

Omicron Variant Post #2

Maybe a minor point, but to reiterate my previous comment, Christian Althaus's formula (for transmission advantage via immune evasion) assumes a homogeneous population. Really we should be thinking of a sector of the population that has lots and lots of close contact with other people (thanks to work-situation / living-situation / etc.), and those people are disproportionately immune (naturally or via vaccination). So immune-escape implies very rapid spread, more rapid than you'd think just going by the "fraction of the population who is immune" numbers, IIUC.

(In other words, the fraction of South Africans who are immunologically naïve would be interesting to know, but also potentially misleading … We may also be interested in the fraction of South Africans who are immunologically naïve, weighted by how much close contact each person has with other people. That's bound to be a lower number.)

Or sorry if I'm confused.

TTS audio of "Ngo and Yudkowsky on alignment difficulty"

Cool! Just curious: Is there something wrong with the Nonlinear Library version, or had you not heard of Nonlinear Library, or did Nonlinear Library not do those posts?

Why Study Physics?

Yeah, I also seem to have a knack for that (as good as anyone in my cohort at a top physics grad school, I have reason to believe), but I have no idea if I got it / developed it by doing lots of physics, or if I would have had it anyway. It's hard to judge the counterfactual!

Hmm, I do vaguely remember, in early college, going from a place where I couldn't reliably construct my own differential-type arguments in arbitrary domains ("if we increase the charge on the plate by dQ ..." blah blah, wam bam, and now we have a differential equation), to where I could easily do so. Maybe that's weak evidence that I got something generalizable out of physics?

Omicron Variant Post #1: We’re F***ed, It’s Never Over

Oh, I was responding to this part:

We should presume that if something takes over quickly, it has a very large advantage infecting people who are unvaccinated and lack natural immunity.

We should presume that if it also has an additional property of vaccine escape, that seems like quite a coincidence, so it seems unlikely.

We should consider this even more unlikely if the variant started out in places with low vaccination rates.

I'm suggesting that population inhomogeneity weakens this argument, and that immune escape + population inhomogeneity would seem to be a plausible explanation of how omicron appears so much more infectious than delta. I wasn't assuming immune escape / erosion, I was arguing for it being more likely / less unlikely given what we know.

Omicron Variant Post #1: We’re F***ed, It’s Never Over

I'm surprised that you of all people didn't mention population inhomogeneity ... There's a most-exposed tier of the population (based on work situation, living situation, etc.), and by now everyone in that tier has (mostly natural) immunity. An immune-escaping variant would spread rapidly within that subpopulation, right?

Load More