I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
It seems like the short-term predictor should learn to predict (based on context cues) the behavior triggered by the hardwired circuitry. But it should predict that behavior only 0.3 seconds early?
You might be missing the “static context” part (§5.2.1). The short-term predictor learns a function F : context → output. Suppose (for simplicity) the context is the exact same vector c₁ for 5 minutes, and then out of nowhere at time T, an override appears and says “Ground Truth Alert: F(c₁) was too low!” Then the learning algorithm will make F(c₁) higher for next time.
But the trick is, the output is determined by F(c₁) at T–(0.3 seconds), AND the output is determined by the exact same function F(c₁) at T–(4 minutes). So if the situation recurs a week later, F(c₁) will be higher not just 0.3 seconds before the override, but also 4 minutes before it. You can’t update one without the other, because it’s the same calculation.
Oh! Is the key point that there's a kind of resonance, where this system maintains the behavior of the genetically hardwired components?
I don’t think so. If I understand you correctly, the thing you’re describing here would be backwards-looking (“predicting” something that already happened), but what we want is forward-looking (how much digestive enzymes be needed in 5 minutes?).
Is the idea that the lookahead propagates earlier and earlier with each cycle? You start with a 0.3 second prediction. But that means that supervisory signal (when in the "defer-to-predictor mode") is 0.3 seconds earlier, which means that the predictor learns to predict the change in output 0.6 seconds ahead…
The thing you’re pointing to is a well-known thing that happens in TD learning in the AI literature (and is a limitation to efficient learning). I think it can happen in humans and animals—I recall reading a paper that (supposedly) observed dopamine marching backwards in time with each repetition, which of course got the authors very excited—but if it happens at all, I think it’s rare, and that humans and animals generally learn things with many fewer repetitions than it would take for the signal to walk backwards step by step.
Instead, I claim that the “static context” picture above is capturing an important dynamic even in the real world where the context is not literally static. Certain aspects of the context are static, and that’s good enough. See the sweater example in §5.3.2 for how (I claim) this works in detail.
Drake equation wrong interpretation: The mean of multiplication is not multiplication of the mean
As I’ve previously written, I disagree that this constitutes a separate explanation. This paper is just saying as far as we know, one or more of the Drake equation parameters might be very much lower than Drake’s guess. But yeah duh, the whole point of this discourse is to figure out which parameter is very much lower and why. Pretty much all the other items on your list are engaged in that activity, so I think this box is an odd one out and should be deleted. (But if you’re trying to do a lit review without being opinionated, then I understand why you’d keep it in. I just like to rant about this.)
Multicellular life is difficult
In The Vital Question, Nick Lane argues (IMO plausibly) that the hard step is not multicellular life per se but rather eukaryotes (i.e. cellular life with at least two different genomes). Not all eukaryotes are multicellular, but once eukaryotes existed, they evolved multicellularity many times independently (if I recall correctly).
Space travel is very difficult for unknown reasons
AFAICT, “interstellar travel is impossible or extremely slow because there’s too much dust and crap in space that you’d collide with” remains a live possibility that doesn’t get enough attention around these parts.
immoral but very interesting experiment … not seeing any human face for multiple months, be it in person, on pictures or on your phone
There must be plenty of literature on the psychological effects of isolation, but I haven’t looked into it much. (My vague impression is: “it messes people up”.) I think I disagree that my theory makes a firm prediction, because who is to say that the representations will drift on a multiple-month timescale, as opposed to much slower? Indeed, the fact that adults are able to recall and understand memories from decades earlier implies that, after early childhood, pointers to semantic latent variables remain basically stable.
2. Try to disconnect your previous thoughts from arriving at “she feels pain”
I would describe this as: if it’s unpleasant to think about how my friend is suffering, then I can avoid those unpleasant feelings by simply not thinking about that, and thinking about something else instead.
For starters, there’s certainly a kernel of truth to that. E.g. see compassion fatigue, where people will burn out and quit jobs working with traumatized people. Or if someone said to me: “I stopped hanging out with Ahmed, he’s always miserable and complaining about stuff, and it was dragging me down too”, I would see that as a perfectly normal and common thing for someone to say and do. But you’re right that it doesn’t happen 100% of the time, and that this merits an explanation.
My own analysis is at: §4.1.1 and §4.1.2 of my (later) Sympathy Reward post. The most relevant-to-you part starts at: “From my perspective, the interesting puzzle is not explaining why this ignorance-is-bliss problem happens sometimes, but rather explaining why this ignorance-is-bliss problem happens less than 100% of the time. In other words, how is it that anyone ever does pay attention to a suffering friend? …”
So that’s my take. As for your take, I think one of my nitpicks would be that I think you’re giving the optimizer-y part of the brain a larger action space than it actually has. If I would get a higher reward by magically teleporting, I’m still not gonna do that, because I can’t. By the same token, if I would get a higher reward by no longer knowing some math concept that I’ve already learned, tough luck for me, that is not an available option in my action space. My world-model is built by predictive (a.k.a. self-supervised) learning, not by “whatever beliefs would lead to immediate higher reward”, and for good reason: the latter has pathological effects, as you point out. (I’ve written about it too, long ago, in Reward is Not Enough.) I do have actions that can impact beliefs, but only in an indirect and limited way—see my discussion of motivated reasoning (also linked in my other comment).
Thanks again for engaging :)
these associations need to be very exact, else humans would be reward-hacking all day: it's reasonable to assume that the activations of thinking "She's happy" are very similar to trying to convince oneself "She's happy" internally, even 'knowing' the truth. But if both resulted in big feelings of internal happiness, we would have a lot more psychopaths.
I don’t think things work that way. There are a lot of constraints on your thoughts. Copying from here:
1. Thought Generator generates a thought: The Thought Generator settles on a “thought”, out of the high-dimensional space of every thought you can possibly think at that moment. Note that this space of possibilities, while vast, is constrained by current sensory input, past sensory input, and everything else in your learned world-model. For example, if you’re sitting at a desk in Boston, it’s generally not possible for you to think that you’re scuba-diving off the coast of Madagascar. Likewise, it’s generally not possible for you to imagine a static spinning spherical octagon. But you can make a plan, or whistle a tune, or recall a memory, or reflect on the meaning of life, etc.
If I want to think that Sally is happy, but I know she’s not happy, I basically can’t, at least not directly. Indirectly, yeah sure, motivated reasoning obviously exists (I talk about how it works here), and people certainly do try to convince themselves that their friends are happy when they’re not, and sometimes (but not always) they are even successful.
I don’t think there’s (the right kind of) overlap between the thought “I wish to believe that Sally is happy” and the thought “Sally is happy”, but I can’t explain why I believe that, because it gets into gory details of brain algorithms that I don’t want to talk about publicly, sorry.
Emotions…feel like this weird, distinct thing such that any statement along the lines "I'm happy" does it injustice. Therefore I can't see it being carried over to "She's happy", their intersection wouldn't be robust enough such that it won't falsely trigger for actually unrelated things. That is, "She's happy" ≈ "I'm happy" ≉ experiencing happiness
I agree that emotional feelings are hard to articulate. But I don’t see how that’s relevant. Visual things are also hard to articulate, but we can learn a robust two-way association between [certain patterns in shapes and textures and motions] and [a certain specific kind of battery compartment that I’ve never tried to describe in English words]. By the same token, we can learn a robust two-way association between [certain interoceptive feelings] and [certain outward signs and contexts associated with those feelings]. And this association can get learned in one direction (interoceptive model → outward sign] from first-person experience, and later queried in the opposite direction [outward sign → interoceptive model] in a third-person context.
(Or sorry if I’m misunderstanding your point.)
what is their evolutionary advantage if they don't atleast offer some kind of subconscious effect on conspecifics?
Again, my answer is “none”. We do lots of things that don’t have any evolutionary advantage. What’s the evolutionary advantage of getting cancer? What’s the evolutionary advantage of slipping and falling? Nothing. They’re incidental side-effects of things that evolved for other reasons.
Part of it is the “vulnerability” where any one user can create arbitrary amounts of reacts, which I agree is cluttering and distracting. Limiting reacts per day seems reasonable (I don’t know if 1 is the right number, but it might be, I don’t recall ever react-ing more than once a day myself). Another option (more labor-intensive) would be for mods to check the statistics and talk to outliers (like @TristanTrim) who use way way more reacts than average.
What's your take on why Approval Reward was selected for in the first place VS sociopathy?
Good question!
There are lots of things that an ideal utility maximizer would do via means-end reasoning, that humans and animals do instead because they seem valuable as an end in itself, thanks to the innate reward function. E.g. curiosity, as discussed in A mind needn't be curious to reap the benefits of curiosity. And also play, and injury-avoidance, etc. Approval Reward has the same property—whatever selfish end an ideal utility maximizer can achieve via Approval Reward, it can achieve it as well if not better by acting as if it had Approval Reward in situations where that’s in its selfish best interests, and not where it isn’t.
In all these cases, we can ask: why do humans in fact find it intrinsically motivating? I presume that the answer is something like humans are not automatically strategic, which is even more true when they’re young and still learning. “Humans are the least intelligent species capable of building a technological civilization.” For example, people with analgesic conditions (like leprosy or CIP) are often shockingly cavalier about bodily harm, even when they know consciously that it will come back to bite them in the long term. Consequentialist planning is often not strong enough to outweigh what seems appealing in the moment.
To rephrase more abstractly: for ideal rational agents, intelligent means-end planning towards X (say, gaining allies for a raid) is always the best way to accomplish that same X. If some instrumental strategy S (say, trying to fit in) is usually helpful towards X, means-end planning can deploy S when S is in fact useful, and not deploy S when it isn’t. But in humans, who are not ideal rational agents, they’re often more likely to get X by wanting X and intrinsically want S as an end in itself. The costs of this strategy (i.e., still wanting S even in cases where it’s not useful towards X) are outweighed by the benefit (avoiding the problem of not pursuing S because you didn’t think of it, or can’t be bothered).
This doesn’t apply to all humans all the time, and I definitely don’t think it will apply to AGIs.
…For completeness, I should note that there’s a evo-psych theory that there has been frequency-dependent selection for sociopaths—i.e., if there are too many sociopaths in the population, then everyone else improves their wariness and ability to detect sociopaths and kill or exile them, but when sociopathy is rare, it’s adaptive (or at least, was adaptive in Pleistocene Africa). I haven’t seen any good evidence for this theory, and I’m mildly skeptical that it’s true. Wary or not, people will learn the character traits of people they’ve lived and worked with for years. Smells like a just-so story, or at least that’s my gut reaction. More importantly, the current population frequency of sociopathy is in the same general ballpark as schizophrenia, profound autism, etc., which seem (to me) very unlikely to have been adaptive in hunter-gatherers. My preferred theory is that there’s frequency-dependent selection across many aspects of personality, and then sometimes a kid winds up with a purely-maladaptive profile because they’re at the tail of some distribution. [Thanks science banana for changing my mind on this.]
I find myself wondering if non-behavioral reward functions are more powerful in general than behavioral ones due to less tendency towards wireheading, etc.(consider the laziness & impulsivity of sociopaths)
I think the “laziness & impulsivity of sociopaths” can be explained away as a consequence of the specific way that sociopathy happens in human brains, via chronically low physiological arousal (which also leads to boredom and thrill-seeking). I don’t think we can draw larger lessons from that.
I also don’t see much connection between “power” and behaviorist reward functions. For example, eating yummy food is (more-or-less) a behaviorist component of the overall human reward function. And its consequences are extraordinary. Consider going to a restaurant, and enjoying it, and thus going back again a month later. It sounds unimpressive, but really it’s remarkable. After a single exposure (compare that to the data inefficiency of modern RL agents!), the person is making an extraordinarily complicated (by modern AI standards) plan to get that same rewarding experience, and the plan will almost definitely work on the first try. The plan is hierarchical, involving learned motor control (walking to the bus), world-knowledge (it’s a holiday so the buses run on the weekend schedule), dynamic adjustments on the fly (there’s construction, so you take a different walking route to the bus stop), and so on, which together is way beyond anything AI can do today.
I do think there’s a connection between “power” and consequentialist desires. E.g. the non-consequentialist “pride in my virtues” does not immediately lead to anything as impressive as the above consequentialist desire to go to that restaurant. But I don’t see much connection between behaviorist rewards and consequentialist desires—if we draw a 2×2 thing, then I can think of examples in all four quadrants.
As a full-time AGI safety / alignment researcher since 2021, I wouldn’t have made a fraction as much progress without lesswrong / alignment forum, which is not just a first-rate publishing platform but a unique forum and community, built from the ground up to facilitate careful and productive conversations. I’m giving Lightcone 100% of my x-risk-oriented donation budget this year, and I wish I had more to give.
There’s a failure mode I described in “The Era of Experience” has an unsolved technical alignment problem:
I see many problems, but here’s the most central one: If we have a 100-dimensional parametrized space of possible reward functions for the primary RL system, and every single one of those possible reward functions leads to bad and dangerous AI behavior (as I argued in the previous subsection), then … how does this help? It’s a 100-dimensional snake pit! I don’t care if there’s a flexible and sophisticated system for dynamically choosing reward functions within that snake pit! It can be the most sophisticated system in the world! We’re still screwed, because every option is bad!
Basically, I think we need more theoretical progress to find a parametrized space of possible reward functions, where at least some of the reward functions in the space lead to good AGIs that we should want to have around.
I agree that the ideal reward function may have adjustable parameters whose ideal settings are very difficult to predict without trial-and-error. For example, humans vary in how strong their different innate drives are, and pretty much all of those “parameter settings” lead to people getting really messed up psychologically if they’re on one extreme or the opposite extreme. And I wouldn’t know where to start in guessing exactly, quantitatively, where the happy medium is, except via empirical data.
So it would be very good to think carefully about test or optimization protocols for that part. (And that’s itself a terrifyingly hard problem, because there will inevitably be distribution shifts between the test environment and the real world. E.g. An AI could feel compassionate towards other AIs but indifferent towards humans.) We need to think about that, and we need the theoretical progress.
Thanks. I feel like I want to treat “reward function design” and “AGI motivation design” as more different than you do, and I think your examples above are more about the latter. The reward function is highly relevant to the motivation, but they’re still different.
For example, “reward function design” calls for executable code, whereas “AGI motivation design” usually calls for natural-language descriptions. Or when math is involved, the math in practice usually glosses over tricky ontology identification stuff, like figuring out which latent variables in a potentially learned-from-scratch (randomly-initialized) world model correspond to a human, or a shutdown switch, or a human’s desires, or whatever.
I guess you’re saying that if you have a great “AGI motivation design” plan, and you have somehow operationalized this plan perfectly and completely in terms of executable code, then you can set that exact thing as the reward function, and hope that there’s no inner misalignment / goal misgeneralization. But that latter part is still tricky. …And also, if you’ve operationalized the motivation perfectly, why even have a reward function at all? Shouldn’t you just delete the part of your AI code that does reinforcement learning, and put the already-perfect motivation into the model-based planner or whatever?
Again I acknowledge that “reward function design” and “AGI motivation design” are not wholly unrelated. And that maybe I should read Rubi’s posts more carefully, thanks. Sorry if I’m misunderstanding what you’re saying.
Yeah I think there’s something to that, see my discussion of run-and-tumble in §6.5.3