I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
Hmm, OK, well I do also believe the stronger claim “most people don't act like psychos most of the time, which is surprising” :)
Like, people watch TV. Power-seeking ruthless consequentialists would not watch TV.
I’m not sure how to operationalize this disagreement. Also, it doesn’t seem like there’s much at stake that makes it worth arguing about.
I do think that human long-term consequentialism makes the world go round (see my other comment). I just don’t think human long-term consequentialism is how the median human is spending most of their waking hours.
I made a weak statement “humans do not always act like power-seeking ruthless consequentialists”. If you want to disagree with that, it’s not enough to demonstrate that humans sometimes act like power-seeking ruthless consequentialists; rather, you would need to argue that all humans, always, with no exceptions, act like power-seeking ruthless consequentialists. That’s a very strong statement which seems totally crazy to me. You really believe that?
If so… umm, I’m not really sure where to start. Like, some humans sometimes have sacrificed their lives for others. Some humans sometimes have committed suicide. Some humans sometimes have felt strongly that their dying family member should feel comfortable in their last minutes of life, even when that person’s comfort level could not possibly have any lasting consequences. Some humans sometimes have just been going with the flow, not particularly thinking about the long-term consequences of their actions at all. Some humans sometimes have had no particular idea what long-term consequence of their action they even want to happen—and forget about actually choosing actions by back-chaining from those desired consequences. Etc. Right?
I think it’s literally true, although I could be wrong, happy to discuss.
The hippocampus (HC) is part of the cortex (it’s 3-layer “allocortex” not 6-layer “isocortex” a.k.a. “neocortex”). OK, I admit that some people treat “cortex” as short for isocortex, but I wish they wouldn’t :)
I think HC stores memories via synapses, and that HC memories can last decades. If you think that’s wrong, we can talk about the evidence (which I admit I haven’t scrutinized).
Note that, in mice, people usually talk about the HC as navigation memory; certainly mice remember navigation information for their whole lifetime. Relatedly, we can talk about hippocampal lesion patients. According to this book (IIRC), H.M. had extensive damage affecting not only the hippocampus but also other things. They quote a patient with more targeted hippocampal damage (see “Box 7.2” in the book), and she describes it as “orientational problems” and definitely not “memory problems”.
I think the HC and isocortex generally have different hyperparameters, such that HC reliably edits synapses after a single datapoint, whereas isocortex usually needs to see things multiple times before the synapses change. This is how I would explain the data that purports to show that memories “migrate” from HC to isocortex, as you mentioned—HC recalls them a few times, and isocortex eventually updates. But it never leaves HC, I think.
(…Although I recall reading years ago that there’s at least one part of isocortex that, like HC, reliably edits synapses after a single datapoint. …Then later on I looked for where I had seen that and couldn’t find it, so maybe I was hallucinating.)
I would have guessed that Rohypnol works by preventing the whole cortex (including HC) from editing synapses. But that’s just a guess. Do you have a reason to think otherwise?
There’s a very short-term thing which is “what neurons are active right now” (cf. “working memory”), which is what I would bring up as an analogue of (so-called) in-context learning in LLMs. That doesn’t involve synapses, I think. I do recall that there’s some molecular mechanism that makes those neurons remain more excitable for the subsequent seconds and minutes (maybe hours). We’ll talk about something being “fresh in my mind”, and it will pop into consciousness more readily.
String diagrams. Pretty much every technical diagram you’ve ever seen, from electronic circuits to dependency graphs to ???, is a string diagram. Why is this such a common format for high-level descriptions? If it’s fully general for high-level natural abstraction, why, and can we prove it? If not, what is?
My explanation would be: our feeble human minds can’t track too many simultaneous interacting causal dependencies. So if we want to (e.g.) explain intuitively why the freezing point of methanol is -98°C as opposed to -96°C, we know we can’t, and we don’t even try, we just say “sorry, there isn’t any intuitive explanation of that, it’s just what you get experimentally, and oh it’s also what you get in this molecular (MD) dynamics simulation, here’s the code”. We don’t bother to make a technical diagram of why it’s -98 not -96 because it would be a zillion arrows going every which way and no one would understand it, so there’s no point in drawing it in the first place.
The MD code, incidentally, is a different structure with different interacting entities (variables, arrays, etc.), and is the kind of thing we humans can intuitively understand, and (relatedly) it can be represented pretty well as a flow diagram with boxes and arrows. So physical chemistry textbooks will talk about the MD code but NOT talk about the subtle detailed aspects of interacting methanol molecules that distinguish a -98°C freezing point from -96.
Yes! 🎉 [And lmk, here or by DM, if you think of any rewording / diagrams / whatever that would have helped you get that with less effort :) ]
The only thing I said about timelines in this particular post is “I for one expect such AI in my lifetime…” (i.e., [checks actuarial tables] AGI before 2060ish). No I don’t want to bet on that, for obvious reasons, e.g. I would need to tie up money in escrow until after I’m dead, or else you would have to try to collect from my next-of-kin.
Elsewhere I recently wrote that I expect AGI in ‘5–25 years. …Or maybe less than 5, who knows. …Or maybe more than 25, who knows’. I stand by “AGI between zero and infinity years” with very high confidence, but we can’t bet on that either. :-P
And I can’t bet on “what they supposedly imply” because if my beliefs about AGI are right, then I expect we’ll both be much too dead for a bet to pay out in my favor.
My very diplomatic answer is: the field of Reward Function Design should be a rich domain with lots of ideas. Curiosity drive is one of them, and so is reward shaping, and so is IRL / CIRL, etc. What else should be on that list that hasn’t been invented yet? Well, let’s invent it! Let a thousand flowers bloom!
…Less diplomatically, since you asked, here’s a hot take. I’m not 100% confident, but I currently don’t think IRL / CIRL per se is a step forward for the kinds of alignment problems I’m worried about. Some possible issues (semi-overlapping) include (1) ontology identification (figuring out which latent variables if any correspond to a human, or human values, in a learned-from-scratch unlabeled world-model); (2) “the hard problem of wireheading”; (3) “the problem of fully updated deference”; (4) my guess that the “brain-like AGI” that I’m specifically working on simply wouldn’t be compatible with IRL / CIRL anyway (i.e. I’m worried that IRL-compatible algorithms would be much less powerful); and (5) my lack of confidence in the idea that learning what a particular human wants to do right now, and then wanting the same thing, really constitutes progress on the ASI x-risk problem in the first place.
Cool, good find!
…Too bad that he seems to be making the common mistake of conflating “reward function” with “utility function” / “goals” (AFAICT from skimming how he uses the term in that book). They’re related but different.
If your fixation remains solely on architecture, and you don't consider the fact that morality-shaped-stuff keeps evolving in mammals because the environment selects for it in some way…
It’s true that human moral drives (such as they are) came from evolution in a certain environment. Some people notice that and come up with a plan: “hey, let’s set up AI in a carefully-crafted evolutionary environment such that it will likewise wind up moral”. I have discussed that plan in my Intro series §8.3, where I argued both that it was both a bad plan, and that it is unlikely to happen even if it was a good plan. For example, AIs may evolve to be cruel to humans just as humans are cruel to factory-farmed animals. Humans are often cruel to other humans too.
But your argument is slightly different (IIUC): you’re saying that we need not bother to carefully craft the evolutionary environment, because, good news, the real-world environment is already of the type that mammal-like species will evolve to be kind. I’m even more skeptical of that. Mammals eat each other all the time, and kill their conspecifics, etc. And why are we restricting to mammals here anyway? More importantly, I think there are very important disanalogies between a world of future AGIs and a world of mammals, particularly that AGIs can “reproduce” by instantly creating identical (adult) copies. No comment on whether this and other disanalogies should make us feel optimistic vs pessimistic about AGI kindness compared to mammal kindness. But it should definitely make us feel like it’s a different problem. I.e., we have to think about the AGI world directly, with all its unprecedented weird features, instead of unthinkingly guessing that its evolutionary trajectory will be similar to humans’ (let alone hamsters’).
[if you don’t consider environmental / evolutionary pressures, then] you are just setting yourself up for future problems when the superintelligent AI develops or cheats its way to whatever form of compartmentalization or metacognition lets it do the allegedly pure rational thing of murdering all other forms of intelligence
I’m unclear on your position here. There’s a possible take that says that sufficiently smart and reflective agents will become ruthless power-seeking consequentialists that murder all other forms of intelligence. Your comment seems to be mocking this take as absurd (by using the words “allegedly pure rational”), but your comment also seems to be endorsing this take as correct (by saying that it’s a real failure mode that I will face by not considering evolutionary pressures). Which is it?
For my part, I disagree with this take. I think it’s possible (at least in principle) to make an arbitrarily smart and reflective ASI agent that wants humans and life to flourish.
But IF this take is correct, it would seem to imply that we’re screwed no matter what. Right? We’d be screwed if a human tries to design an AGI, AND we’d be screwed if an evolutionary environment “designs” an AGI. So I’m even more confused about where you’re coming from.
You bifurcate human neurology into "neurotypical" and "sociopath" to demonstrate your dichotomy of RL based decision making vs social reward function decision making, and then stop. That's wrong. There is also an entire category of neurotype called "autistic"…
(Much of my response to this part of your comment amounts to “I don’t actually think what you think I think”.)
First, I dislike your description “RL based decision making vs social reward function decision making”. “Reward function” is an RL term. Both are RL-based. All human motivations are RL-based, IMO. (But note that I use a broad definition of “RL”.)
Second, I guess you interpreted me as having a vibe of “Yay Approval Reward!”. I emphatically reject that vibe, and in my Approval Reward post I went to some length to emphasize that Approval Reward leads to both good things and bad things, with the latter including blame-avoidance, jockeying for credit, sycophancy, status competitions, “Simulacrum Level 3”, and more.
Third, I guess you also assumed that I was also saying that Approval Reward would be a great idea for AGIs. I didn’t say that in the post, and it's not a belief I currently hold. (But it might be true, in conjunction with a lot of careful design and thought; see other comment.)
Next: I’m a big fan of understanding the full range of human neurotypes, and if you look up my neuroscience writing you’ll find my detailed opinions about schizophrenia, depression, mania, BPD, NPD, ASPD, DID, and more. As for autism, I’ve written loads about autism (e.g. here, here and links therein), and read tons about it, and have talked to my many autistic friends about their experiences, and have a kid with an autism diagnosis. That doesn’t mean my takes are right, of course! But I hope that, if I’m wrong, I’m wrong for more interesting reasons than “forgetting that autism exists”. :)
I guess your model is that autistic people, like sociopathic people, lack all innate social drives? And therefore a social-drive-free RL agent AGI, e.g. one whose reward signals are tied purely to a bank account balance going up, would behave generally like an autistic person, instead of (or in addition to?) like a sociopath? If so, I very strongly disagree.
I think “autism” is an umbrella term for lots of rather different things, but I do think it’s much more likely to involve social drives set to an unusually intense level rather than “turned off”. Indeed, I think they get so intense that they often feel overwhelming and aversive.
For example, many autistic people strongly dislike making eye contact. If someone had no innate social reactions to other people, then they wouldn’t care one way or the other about eye contact; looking at someone’s eyes would be no more aversive or significant than looking at a plant. So the “no social drives” theory is a bad match to this observation. Whereas “unusually intense social drives” theory does match eye contact aversion.
Likewise, “autism = no social drives” theory would predict that an autistic person would be perfectly fine if his frail elderly parents, parents who are no longer able to directly help or support him, died a gruesome and painful death right now. Whereas “unusually intense social drives” theory would predict that he would not be perfectly fine with that. I think the latter tends to be a better fit!
Anyway, I think if you met a hypothetical person whose innate human social drive strengths were set to zero, they would look wildly different from any autistic person, but only modestly different from a sociopathic (ASPD) person.
Have you seen my post Neuroscience of human sexual attraction triggers (3 hypotheses)? I think it’s related but not identical. In particular, the way I would put it is that “feeling safe” (i.e. like an interaction is low-stakes, or more specifically feeling low physiological arousal) tends to be a turnoff [at least for typical straight cis women, not sure about other cases]. And this explains many other things too, not mentioned by your post, e.g. rich / famous / powerful men do well in the dating market because interactions with them is high-stakes by default, simply because they have the power to make another person’s life much better or worse depending on how the interaction goes, and the other person knows that.
HOWEVER, if “feeling safe” is a turn-off, feeling genuinely terrified is a turnoff too. So I think your title “nonconsent preference” is an importantly misleading description. I think it’s an inverted-U thing, not monotonic. Laughter has an analogous inverted-U relation to physiological arousal—e.g. in physical play, a kid will laugh more as the apparent threat goes up, but past some point they’ll stop laughing and switch to screaming.