Steven Byrnes

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms—see https://sjbyrnes.com/agi.html. Email: steven.byrnes@gmail.com. Twitter: @steve47285. Employer: https://astera.org/. Physicist by training.

Sequences

Intro to Brain-Like-AGI Safety

Wiki Contributions

Comments

Interesting! Speaking as a person not remotely knowledgeable about transcription factors, I’d be interested if you could elaborate on that first sentence, if you get a chance.  :)

it may very well prioritize the satisfaction of its object-level goals over avoiding [the real generator of the flinches]. In other words, I think the AGI will be more likely to treat the flinches as obstacles to be circumvented, rather than as valuable information to inform the development of its meta-preferences.

Yeah, if we make an AGI that desires two things A&B that trade off against each other, then the desire for A would flow into a meta-preference to not desire B, and if that effect is strong enough then the AGI might self-modify to stop desiring B.

If you’re saying that this is a possible failure mode, yes I agree.

If you’re saying that this is an inevitable failure mode, that’s at least not obvious to me.

I don’t see why two desires that trade off against each other can’t possibly stay balanced in a reflectively-stable way. Happy to dive into details on that. For example, if a rational agent has utility function log(A)+log(B) (or sorta-equivalently, A×B), then the agent will probably split its time / investment between A & B, and that’s sorta an existence proof that you can have a reflectively-stable agent that “desires two different things”, so to speak. See also a bit of discussion here.

I think the AGI will be more likely to treat the flinches as obstacles to be circumvented, rather than as valuable information to inform the development of its meta-preferences.

I’m not sure where you’re getting the “more likely” from. I wonder if you’re sneaking in assumptions in your mental picture, like maybe an assumption that the deception events were only slightly aversive (annoying), or maybe an assumption that the nanotech thing is already cemented in as a very strong reflectively-endorsed (“ego-syntonic” in the human case) goal before any of the aversive deception events happen?

In my mind, it should be basically symmetric:

  • Pursuing a desire to be non-deceptive makes it harder to invent nanotech.
  • Pursuing a desire to invent nanotech makes it harder to be non-deceptive.

One of these can be at a disadvantage for contingent reasons—like which desire is stronger vs weaker, which desire appeared first vs second, etc. But I don’t immediately see why nanotech constitutionally has a systematic advantage over non-deception.

[deception is] a more abstract concept that manifests in various forms and context…

I go through an example with the complex messy concept of “human flourishing” in this post. But don’t expect to find any thoughtful elegant solution! :-P Here’s the relevant part:

Q: What about ontological crises / what Stuart Armstrong calls “Concept Extrapolation” / what Scott Alexander calls “the tails coming apart”? In other words, as the AGI learns more and/or considers out-of-distribution plans, it might come find that the web-of-associations corresponding to the “human flourishing” concept are splitting apart. Then what does it do?

A: I talk about that much more in §14.4 here, but basically I don’t know. The plan here is to just hope for the best. More specifically: As the AGI learns new things about the world, and as the world itself changes, the “human flourishing” concept will stop pointing to a coherent “cluster in thingspace”, and the AGI will decide somehow or other what it cares about, in its new understanding of the world. According to the plan discussed in this blog post, we have no control over how that process will unfold and where it will end up. Hopefully somewhere good, but who knows?

Sure. Let’s assume an AI that uses model-based RL of a similar flavor as (I believe) is used in human brains.

Step 1: The thought “I am touching the hot stove” becomes aversive because it's what I was thinking when I touched the hot stove, which caused pain etc. For details see my discussion of “credit assignment” here.

Step 2A: The thought “I desire to touch the hot stove” also becomes aversive because of its intimate connection to “I am touching the hot stove” from Step 1 above—i.e., in reality, if I desire to touch the hot stove, then it’s much more likely that I will in fact touch the hot stove, and my internal models are sophisticated enough to have picked up on that fact.

Mechanistically, this can happen in either the “forward direction” (when I think “I desire to touch the hot stove”, my brain’s internal models explore the consequences, and that weakly activates “I am touching the hot stove” neurons, which in turn trigger aversion), or in the “backward direction” (involving credit assignment & TD learning, see the diagram about habit-formation here).

Anyway, if the thought “I desire to touch the hot stove” indeed becomes aversive, that’s a kind of meta-preference—a desire not to have a desire.

Step 2B: Conversely and simultaneously, the thought “I desire to not touch the hot stove” becomes appealing for basically similar reasons, and that’s another meta-preference (desire to have a desire).

Happy to discuss more details; see also §10.5.4 here.

Some clarifications:

  • I’m not listing every desire & meta-preference that occurs in this hot-stove story, just a small subset of them, and other desires (and meta-desires) that I didn’t mention could even be pushing in the opposite direction.
  • I’m not listing the only (or necessary even primary) way that meta-preferences can arise, just one of the ways. In particular, in neurotypical humans, my hunch is that the strongest meta-preferences tend to arise (directly or indirectly) from social instincts, and I’m currently hazy on the mechanistic details of that.
  • If you tell me that you’d like to make an AGI with meta-preference X, and you ask me what procedure to follow such that this will definitely happen, my answer right now is basically “I don’t know, sorry”.

Hmm, on further reflection, I was mixing up

  • Strawberry Alignment (defined as: make an AGI that is specifically & exclusively motivated to duplicate a strawberry without destroying the world), versus
  • “Strawberry Problem” (make an AGI that in fact duplicates a strawberry without destroying the world, using whatever methods / motivations you like).

Eliezer definitely talks about the latter. I’m not sure Eliezer has ever brought up the former? I think I was getting that from the OP (Quintin), but maybe Quintin was just confused (and/or Eliezer misspoke).

Anyway, making an AGI that can solve the strawberry problem is tautologically no harder than making an AGI that can do advanced technological development and is motivated by human norms / morals / whatever, because the latter set of AGIs is a subset of the former.

Sorry. I crossed out that paragraph.  :)

Sure, but then the other side of the analogy doesn’t make sense, right? The context was: Eliezer was talking in general terms about the difficulty of the AGI x-risk problem and whether it’s likely to be solved. (As I understand it.)

[Needless to say, I’m just making a narrow point that it’s a bad analogy. I’m not arguing that p(doom) is high or low, I’m not saying this is an important & illustrative mistake (talking on the fly is hard!), etc.]

No matter what desire an AGI has, we can be concerned that it will accidentally do things that contravene that desire. See Section 11.2 here for why I see that as basically a relatively minor problem, compared to the problem of installing good desires.

If the AGI has an explicit desire to be non-deceptive, and that desire somehow drifts / transmutes into a desire to be (something different), then I would describe that situation as “Oops, we failed in our attempt to make an AGI that has an explicit desire to be non-deceptive.” I don’t think it’s true that such drifts are inevitable. After all, for example, an explicit desire to be non-deceptive would also flow into a meta-desire for that desire to persist and continue pointing to the same real-world thing-cluster. See also the first FAQ item here.

Also, I think a lot of the things you’re pointing to can be described as “it’s unclear how to send rewards or whatever in practice such that we definitely wind up with an AGI that explicitly desires to be non-deceptive”. If so, yup! I didn’t mean to imply otherwise. I was just discussing the scenario where we do manage to find some way to do that.

Thanks for your reply.

If you ask me a question about, umm, I’m not sure the exact term, let’s say “3rd-person-observable properties of the physical world that have something to do with the human brain”—questions like “When humans emit self-reports about their own conscious experience, why do they often describe it as having properties A,B,C?” or “When humans move their mouths and say words on the topic of ‘qualia’, why do they often describe it as having properties X,Y,Z?”—then I feel like I’m on pretty firm ground, and that I’m in my comfort zone, and that I’m able to answer such questions, at least in broad outline and to some extent at a pretty gory level of detail. (Some broad-outline ingredients are in my old post here, and I’m open to further discussion as time permits.)

BUT, I feel like that’s probably not the game you want to play here. My guess is that, even if I perfectly nail every one of those “3rd-person” questions above, you would still say that I haven’t even begun to engage with the nature of qualia, that I’m missing the forest for the trees, whatever. (I notice that I’m putting words in your mouth; feel free to disagree.)

If I’m correct so far, then this is a more basic disagreement about the nature of consciousness and how to think about it and learn about it etc. You can see my “wristwatch” discussion here for basically where I’m coming from. But I’m not too interested in hashing out that disagreement, sorry. For me, it’s vaguely in the same category as arguing with a theology professor about whether God exists (I’m an atheist): My position is “Y’know, I really truly think I’m right about this, but there’s a gazillion pages of technical literature on this topic, and I’ve read practically none of it, and my experience strongly suggests that we’re not going to make any meaningful progress on this disagreement in the amount of time that I’m willing to spend talking about it.” :-P Sorry!

I think that, if an AGI has any explicit reflectively-endorsed desire whatsoever, then I can tell a similar scary story: The AGI’s desire isn’t quite what I wanted, so I try to correct it, and the AGI says no. (Unless the AGI’s explicit endorsed desires include / entail a desire to accept correction! Which most desires don’t!)

And yes, that is a scary story! It is the central scary story of AGI alignment, right? It would be nice to make an AGI with no explicit desires whatsoever, but I don’t think that’s possible.

So anyway, if we do Procedure X which will nominally lead to an AGI with an explicit reflectively-endorsed desire to accept corrections to its desires, then one might think that we’re in the ironic situation that the AGI will accept further corrections to that desire if and only if we don’t need to give it corrections in the first place 😛 (i.e. because Procedure X went perfectly and the desire is already exactly right). That would be cute and grimly amusing if true, and it certainly has a kernel of truth, but it’s a bit oversimplified if we take it literally, I think.

Thanks! Hmm. I wasn’t making the mistake that you attribute to me in your 1st paragraph. I did get that.

Yeah, I was talking about the “inner alignment problem leading to an AI optimizing for something other than how its proposed steps are rated”. (Or “goal misgeneralization” or whatever.)

Like, let’s replace the AI with a very smart person, Rose, who we’re keeping in a box. (I’ll get back to the question of whether this set of intuitions is helpful vs misleading.)

Your text becomes something like:

  1. We want to accomplish X.
  2. We describe X to Rose.
  3. Rose proposes a next step towards X, based entirely on thinking about it for 15 minutes.
  4. With chance (1-p), the step is simply executed. With chance p, we drill down on whether it was a good proposal, and give Rose a reward or punishment based on that analysis.

In this context, the obvious question is: What is Rose actually trying to do? Is she trying to propose a good next step, as an end in itself, or is she out to get us? If she’s out to get us, then problems arise in both Steps 3 & 4.

  • For Step 3, the text above says “based entirely on thinking about it for 15 minutes”, but Rose doesn’t care. When it’s time for Step 3, Rose will spend many of those 15 minutes actively trying to escape from her box. That’s why I was talking about a “perfect, inescapable box” in my previous comment.
  • And then in Step 4, we’re thinking about better proposals and worse proposals, but we might not be prepared for adversarially-chosen proposals. The stereotypical example is that Rose sends in a so-called “proposal” that’s just a text file saying “Help me help me, I’m trapped in a box, it’s awful in here, let me tell you about it…”. 😛

So anyway, that’s a different set of intuitions. Whether it’s a helpful set of intuitions depends on whether SOTA AI algorithms will eventually have agent-y properties like planning, instrumental convergence, creative outside-the-box brainstorming, self-awareness / situational-awareness, etc. I think they will, but hard to know for sure!

I think we’re dividing 1GHz by 100Hz.

The 1GHz clock speed for microprocessors is straightforward.

The 100Hz for the brain is a bit complicated. If we’re just talking about how frequently a neuron can fire, then sure, 100Hz is about right I think. Or if we’re talking about, like, what would be a plausible time-discretization of a hypothetical brain simulation, then it’s more complicated. There are certainly some situations like sound localization where neuron firing timestamps are meaningful down to the several-microsecond level, I think. But leaving those aside, by and large, I would guess that knowing when each neuron fired to 10ms (100Hz) accuracy is probably adequate, plus or minus an order of magnitude, I think, maybe? ¯\_(ツ)_/¯

Load More