I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
This might be a dumb question, but did you try anything like changing the prompt from:
…
After the problem, there will be filler tokens (counting from 1 to {N}) to give you extra space to process the problem before answering.…
to:
…
After the problem, there will be distractor tokens (counting from 1 to {N}) to give you extra space to forget the problem before answering.…
I’m asking because AFAICT the results can be explained by EITHER your hypothesis (the extra tokens allow more space / capacity for computation during a forward pass) OR an alternate hypothesis more like “the LLM interprets this as more of a situation where the correct answer is expected” or whatever, i.e. normal sensitivity of LLMs to details of their prompt.
(Not that I have anything against the first hypothesis! Just curious.)
I have lots of disagreements with evolutionary psychology (as normally practiced and understood, see here), but actually I more-or-less agree with everything you said in that comment.
You mean, if I’m a guy in Pleistocene Africa, then why it instrumentally useful for other people to have positive feelings about me? Yeah, basically what you said; I’m regularly interacting with these people, and if they have positive feelings about me, they’ll generally want me to be around, and to stick around, and also they’ll tend to buy into my decisions and plans, etc.
Also, Approval Reward also leads to norm-following, which is also probably adaptive for me, because probably many of those social norms exist for good and non-obvious reason, cf. Heinrich.
this might not be easily learned by a behaviorist reward function
I’m not sure what the word “behaviorist” is doing there; I would just say: “This won’t happen quickly, and indeed might not happen at all, unless it’s directly in the reward function. If it’s present only indirectly (via means-end planning, or RL back-chaining, etc.), that’s not as effective.”
I think “the reward function is incentivizing (blah) directly versus indirectly” is (again) an orthogonal axis from “the reward function is behaviorist vs non-behaviorist”.
Is the claim that evolutionarily early versions of behavioral circuits had approximately the form...
Yeah I think there’s something to that, see my discussion of run-and-tumble in §6.5.3
It seems like the short-term predictor should learn to predict (based on context cues) the behavior triggered by the hardwired circuitry. But it should predict that behavior only 0.3 seconds early?
You might be missing the “static context” part (§5.2.1). The short-term predictor learns a function F : context → output. Suppose (for simplicity) the context is the exact same vector c₁ for 5 minutes, and then out of nowhere at time T, an override appears and says “Ground Truth Alert: F(c₁) was too low!” Then the learning algorithm will make F(c₁) higher for next time.
But the trick is, the output is determined by F(c₁) at T–(0.3 seconds), AND the output is determined by the exact same function F(c₁) at T–(4 minutes). So if the situation recurs a week later, F(c₁) will be higher not just 0.3 seconds before the override, but also 4 minutes before it. You can’t update one without the other, because it’s the same calculation.
Oh! Is the key point that there's a kind of resonance, where this system maintains the behavior of the genetically hardwired components?
I don’t think so. If I understand you correctly, the thing you’re describing here would be backwards-looking (“predicting” something that already happened), but what we want is forward-looking (how much digestive enzymes be needed in 5 minutes?).
Is the idea that the lookahead propagates earlier and earlier with each cycle? You start with a 0.3 second prediction. But that means that supervisory signal (when in the "defer-to-predictor mode") is 0.3 seconds earlier, which means that the predictor learns to predict the change in output 0.6 seconds ahead…
The thing you’re pointing to is a well-known thing that happens in TD learning in the AI literature (and is a limitation to efficient learning). I think it can happen in humans and animals—I recall reading a paper that (supposedly) observed dopamine marching backwards in time with each repetition, which of course got the authors very excited—but if it happens at all, I think it’s rare, and that humans and animals generally learn things with many fewer repetitions than it would take for the signal to walk backwards step by step.
Instead, I claim that the “static context” picture above is capturing an important dynamic even in the real world where the context is not literally static. Certain aspects of the context are static, and that’s good enough. See the sweater example in §5.3.2 for how (I claim) this works in detail.
Drake equation wrong interpretation: The mean of multiplication is not multiplication of the mean
As I’ve previously written, I disagree that this constitutes a separate explanation. This paper is just saying as far as we know, one or more of the Drake equation parameters might be very much lower than Drake’s guess. But yeah duh, the whole point of this discourse is to figure out which parameter is very much lower and why. Pretty much all the other items on your list are engaged in that activity, so I think this box is an odd one out and should be deleted. (But if you’re trying to do a lit review without being opinionated, then I understand why you’d keep it in. I just like to rant about this.)
Multicellular life is difficult
In The Vital Question, Nick Lane argues (IMO plausibly) that the hard step is not multicellular life per se but rather eukaryotes (i.e. cellular life with at least two different genomes). Not all eukaryotes are multicellular, but once eukaryotes existed, they evolved multicellularity many times independently (if I recall correctly).
Space travel is very difficult for unknown reasons
AFAICT, “interstellar travel is impossible or extremely slow because there’s too much dust and crap in space that you’d collide with” remains a live possibility that doesn’t get enough attention around these parts.
immoral but very interesting experiment … not seeing any human face for multiple months, be it in person, on pictures or on your phone
There must be plenty of literature on the psychological effects of isolation, but I haven’t looked into it much. (My vague impression is: “it messes people up”.) I think I disagree that my theory makes a firm prediction, because who is to say that the representations will drift on a multiple-month timescale, as opposed to much slower? Indeed, the fact that adults are able to recall and understand memories from decades earlier implies that, after early childhood, pointers to semantic latent variables remain basically stable.
2. Try to disconnect your previous thoughts from arriving at “she feels pain”
I would describe this as: if it’s unpleasant to think about how my friend is suffering, then I can avoid those unpleasant feelings by simply not thinking about that, and thinking about something else instead.
For starters, there’s certainly a kernel of truth to that. E.g. see compassion fatigue, where people will burn out and quit jobs working with traumatized people. Or if someone said to me: “I stopped hanging out with Ahmed, he’s always miserable and complaining about stuff, and it was dragging me down too”, I would see that as a perfectly normal and common thing for someone to say and do. But you’re right that it doesn’t happen 100% of the time, and that this merits an explanation.
My own analysis is at: §4.1.1 and §4.1.2 of my (later) Sympathy Reward post. The most relevant-to-you part starts at: “From my perspective, the interesting puzzle is not explaining why this ignorance-is-bliss problem happens sometimes, but rather explaining why this ignorance-is-bliss problem happens less than 100% of the time. In other words, how is it that anyone ever does pay attention to a suffering friend? …”
So that’s my take. As for your take, I think one of my nitpicks would be that I think you’re giving the optimizer-y part of the brain a larger action space than it actually has. If I would get a higher reward by magically teleporting, I’m still not gonna do that, because I can’t. By the same token, if I would get a higher reward by no longer knowing some math concept that I’ve already learned, tough luck for me, that is not an available option in my action space. My world-model is built by predictive (a.k.a. self-supervised) learning, not by “whatever beliefs would lead to immediate higher reward”, and for good reason: the latter has pathological effects, as you point out. (I’ve written about it too, long ago, in Reward is Not Enough.) I do have actions that can impact beliefs, but only in an indirect and limited way—see my discussion of motivated reasoning (also linked in my other comment).
Thanks again for engaging :)
these associations need to be very exact, else humans would be reward-hacking all day: it's reasonable to assume that the activations of thinking "She's happy" are very similar to trying to convince oneself "She's happy" internally, even 'knowing' the truth. But if both resulted in big feelings of internal happiness, we would have a lot more psychopaths.
I don’t think things work that way. There are a lot of constraints on your thoughts. Copying from here:
1. Thought Generator generates a thought: The Thought Generator settles on a “thought”, out of the high-dimensional space of every thought you can possibly think at that moment. Note that this space of possibilities, while vast, is constrained by current sensory input, past sensory input, and everything else in your learned world-model. For example, if you’re sitting at a desk in Boston, it’s generally not possible for you to think that you’re scuba-diving off the coast of Madagascar. Likewise, it’s generally not possible for you to imagine a static spinning spherical octagon. But you can make a plan, or whistle a tune, or recall a memory, or reflect on the meaning of life, etc.
If I want to think that Sally is happy, but I know she’s not happy, I basically can’t, at least not directly. Indirectly, yeah sure, motivated reasoning obviously exists (I talk about how it works here), and people certainly do try to convince themselves that their friends are happy when they’re not, and sometimes (but not always) they are even successful.
I don’t think there’s (the right kind of) overlap between the thought “I wish to believe that Sally is happy” and the thought “Sally is happy”, but I can’t explain why I believe that, because it gets into gory details of brain algorithms that I don’t want to talk about publicly, sorry.
Emotions…feel like this weird, distinct thing such that any statement along the lines "I'm happy" does it injustice. Therefore I can't see it being carried over to "She's happy", their intersection wouldn't be robust enough such that it won't falsely trigger for actually unrelated things. That is, "She's happy" ≈ "I'm happy" ≉ experiencing happiness
I agree that emotional feelings are hard to articulate. But I don’t see how that’s relevant. Visual things are also hard to articulate, but we can learn a robust two-way association between [certain patterns in shapes and textures and motions] and [a certain specific kind of battery compartment that I’ve never tried to describe in English words]. By the same token, we can learn a robust two-way association between [certain interoceptive feelings] and [certain outward signs and contexts associated with those feelings]. And this association can get learned in one direction (interoceptive model → outward sign] from first-person experience, and later queried in the opposite direction [outward sign → interoceptive model] in a third-person context.
(Or sorry if I’m misunderstanding your point.)
what is their evolutionary advantage if they don't atleast offer some kind of subconscious effect on conspecifics?
Again, my answer is “none”. We do lots of things that don’t have any evolutionary advantage. What’s the evolutionary advantage of getting cancer? What’s the evolutionary advantage of slipping and falling? Nothing. They’re incidental side-effects of things that evolved for other reasons.
Part of it is the “vulnerability” where any one user can create arbitrary amounts of reacts, which I agree is cluttering and distracting. Limiting reacts per day seems reasonable (I don’t know if 1 is the right number, but it might be, I don’t recall ever react-ing more than once a day myself). Another option (more labor-intensive) would be for mods to check the statistics and talk to outliers (like @TristanTrim) who use way way more reacts than average.
It might be helpful to draw out the causal chain and talk about where on that chain the intervention is happening (or if applicable, where on that chain the situationally-aware AI motivation / planning system is targeted):
(image copied from here, ultimately IIRC inspired from somebody (maybe leogao?)’s 2020-ish tweet that I couldn’t find.)
My diagram here doesn’t use the term “reward hacking”; and I think TurnTrout’s point is that that term is a bit weird, in that actual instances that people call “reward hacking” always involve interventions in the left half, but people discuss it as if it’s an intervention on the right half, or at least involving an “intention” to affect the reward signal all the way on the right. Or something like that. (Actually, I argue in this link that popular usage of “reward hacking” is even more incoherent than that!)
As for your specific example, do we say that the timer is a kind of input that goes into the reward function, or that the timer is inside the reward function itself? I vote for the former (i.e. it’s an input, akin to a camera).
(But I agree in principle that there are probably edge cases.)