I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
It’s definitely possible to “get stuck” not doing X because you’ve never tried X in your life, so you don’t know what you’re missing. Sometimes it can take people many years before they try X. And sometimes they just never do, even though they would totally “take to it” if they did.
I feel like you want to say that the never-trying-X failure mode is “the rule” and I want to say that it’s “the exception” … But if so, that might not be a real disagreement, and instead we’re just thinking about different kinds of X.
I definitely agree that it can be a thing, and brought it up in Heritability: Five Battles multiple times. My examples included X = “living in Churubusco, Indiana” (§2.2.2), or X = “becoming a Soil Conservation Technician” (§2.3), or X = “joining a niche online community like rationalism” (§2.2.3).
(Or sorry if I’m still missing your point.)
I'm very curious about which things fall into the "then-get-stuck" bucket and why. Are you confident there isn't a range of lower-level drives and innate reactions that get fixed similar to accents?
I think the way that rewards (pleasure or displeasure) turn into desires (motivation) is one thing that is definitely continuous learning, not “get stuck”. If you’ve done something over and over in childhood, and then you do it today and it’s 100% miserable and embarrassing, and then you try it again the next day and it’s again 100% miserable and embarrassing, then you’re probably not going to try it a 3rd time, and certainly not a 10th time.
We see this when people get clinical depression as adults. They lose motivation to do all the things they used to like, no matter how much they liked it in childhood, or even more recently than that.
I think personality surveys mostly catch stuff like that. Even introverts go out sometimes, and even extroverts stay in sometimes. If it’s pleasant vs unpleasant, they’ll do it more. So a decision-to-socialize keeps getting steered towards the innate ground truth reward function. Likewise the decision to think about things versus people, to be honest or not, and ≈every other question on personality surveys. (Empirically, adult personality is approximately independent of childhood upbringing, in the population at large.)
Childhood regional accents are not “rewards turning into desires”, but rather something different—I think it involves the learning rate of a certain part of the cortex dropping towards zero after childhood. I don’t think “idiosyncratic cognition” is in that category, my impression is that PFC learning rates remain high throughout life.
(You’re an unusual case RE your accent … hmm, random question, did you watch a lot of American TV / movies as a kid?)
Childhood phobias don’t always last into adulthood but sometimes they do. The trick is that it’s a case where the reward itself can change (I call it “upstream generalization”). So the rewards→desires pathway continues to produce updates, but that doesn’t help. Separately, the reward itself does keep updating in response to new data, but only in narrower circumstances, by and large. I can’t immediately think of other things besides phobias that “get stuck” in that way.
I agree that “every thought we think, we’re thinking it because it’s higher-reward than other thoughts we might be thinking instead” is a great starting point.
---
I kinda disagree with your emphasis on childhood. See my post Heritability, Behaviorism, and Within-Lifetime RL, where I (dismissively) called that school of thought “RL learn-then-get-stuck”. Of course, “RL learn-then-get-stuck” is true for a few things, like regional accents, but I think those are the exception not the rule. (See also §2 of “Heritability: Five Battles”.)
~~
I think you’re right about the person-to-person variation along a bunch of axes, but the way I think about it is generally at a lower level than the kind of “traits” you list. I think there are dozens of innate drives / innate reactions (some more important than others) in the hypothalamus & brainstem, and their relative strengths differ, and most of the things you list are mostly emergent consequences of the drive / reaction strengths vector. (Note that the map from the vector to behaviors is often nonlinear, and also depends on the options and consequences available in an environment / culture.)
Going through some examples from your list.
“Do you focus on "things" vs "people"” is I think related to an “innate drive to think about and interact with other people” that I briefly discuss in §5 here.
“Are words about reality or are words just rallying cries for your team?” is downstream of that along with many other things, like how strongly does one feel Approval Reward, which in turn depends on a bunch of things including how easily nearby people trigger an involuntary orienting reaction in you.
“How much emphasis do you place on wordless felt gut feelings?” is probably partly that those gut feelings come along with stronger involuntary attention and (the interoceptive equivalent of) orienting reactions in some people than others, making those feelings more or less salient versus easily-ignorable. (Presumably there are other contributing factors too.)
Etc. etc. I don’t have great theories for everything, just trying to give a hint of how I think about those kinds of things, in case it matters (and probably it doesn’t matter for your points here).
Yeah from my perspective EAG is a place where a lot of people interested in technical alignment go, to talk to other people interested in technical alignment, about technical alignment stuff.
Meanwhile there are other things happening at EAG too, but you can ignore them. You don’t have to attend the talks, you don’t have to talk to anyone you don’t want to talk to. And it’s not terribly expensive, and the location is (often) down the street from you (OP, John).
I wonder whether you’re thinking harder about countersignaling than about what would be object-level good things to do?
Then, open up Forbes’ list of N richest people, and count how many of them got on that list by climbing the management hierarchy at a big company.
I predict that, to within reasonable approximation, the answer will be zero. Nobody gets on Forbes’ list of richest people by climbing the hierarchy at a big company. They get on that list by founding a company, inheriting, or both.
I didn’t check this either, but it reminds me of a fun fact that, if you look at the CEO of a large company, the CEO-founders are roughly population-average height, while the people promoted up to CEO are towering monstrosities. Copying from my post Neuroscience of human sexual attraction triggers:
…I’m not sure if anyone has done a rigorous systematic study to back that up, but some examples (from here, the author claims not to have cherry-picked) are: John S. Watson (promoted up to CEO of Chevron): 6’4” = 193 cm; Tim Cook (promoted up to CEO of Apple): 6’3” = 190 cm; Jeffrey Immelt (promoted up to CEO of General Electric): 6’4” = 193 cm; Mark Zuckerberg (founded Facebook): 5’9” = 175 cm; Larry Page (co-founded Google): 5’11” = 180 cm; Sergey Brin (co-founded Google): 5’8” = 173 cm; Jack Dorsey (co-founded Twitter): 5’11” = 180 cm; Richard Branson (founded Virgin): 5’11” = 180 cm; Elon Musk (quasi-founded Tesla, PayPal, SpaceX): 5’11” = 180 cm; Warren Buffett (quasi-founded Berkshire Hathaway): 5’10” = 178 cm.
due to Eliezer mentioning in a recent interview:
- Only cognitively boosted humans have a chance at aligning AI
- Cooling the brain through methods like water cooling is one of our best chances to boost human intelligence
Eliezer Yudkowsky said that cooling the brain through methods like water cooling is one of our best chances to boost human intelligence? I am skeptical. Can you try to find that interview?
Sorry if I’m misunderstanding, but this post seems ignorant of Newton’s law of cooling. If the brain is 1° warmer than the blood, then it should cool about twice as fast (in °C/minute) as if it’s 0.5° warmer than the blood, right? So you shouldn’t have tables listing “cooling rate” measured in °C/minute, but rather something like “cooling half-life” (measured in minutes) or “cooling decay rate” (measured in minutes⁻¹) or things like that. You’re dividing a cooling rate (°C/minute) by a temperature difference (°C) to get the proportionality coefficient, and the °C cancels out.
I think a lot of claims in this post are dubious on account of that error.
Thanks for good pushback! Thinking about it more, I want to propose a 2-step model, where first, there are social dynamics arising from non-social causes, and then second, those social dynamics themselves become part of the environment in which they operate. (I’ve suggested similar 2-step models in Social status part 1 vs part 2, or Theory of Laughter §4.2.4.)
Step 1: social dynamics from non-social causes: Things can seem good or bad for lots of non-social reasons. Let’s say, Alice prefers the taste of pizza, her sister Beth prefers sushi, and their parents have to pick just one.
Here, Beth’s preference for sushi is directly making Alice’s life worse—Beth’s advocacy is increasing the chance that Alice will have an less-pleasant night.
Thanks to [some mechanism that I haven’t worked through in detail], if Beth is directly making Alice’s life worse, Alice’s brain moves Beth away from “friend” and towards “enemy” in terms of the “innate friend (+) vs enemy (–) parameter” in Alice’s brain. In the limit, Alice will start relating to Beth with visible anger, reflective of “schadenfreude reward” and “provocation reward”, as opposed to “sympathy reward” or “approval reward”. And punishing Beth naturally comes out of that.
Step 2: those social dynamics themselves become part of the terrain: Everyone looks around and notices the following:
There’s a class of behaviors like “Alice is treating Beth as an enemy”, including anger and schadenfreude and provocation and punishment, and this behavior is reliably correlated with “Beth is doing something that Alice sees as bad”.
We all get used to that pattern, and use it as a signal to draw inferences about people.
OK, new scene. Carol deeply admires Doris. And Doris really likes X for whatever reason (where X = honesty, loyalty, baggy jeans, who knows). If Carol does X, then Carol is creating an association in Doris’s mind between herself and X. And this seems good to Carol—it makes her feel Approval Reward.
But also, if Ella is doing not-X, and Carol gets angry at Ella (starts treating her as an enemy, which includes schadenfreude and provocation and thus punishment), then Carol is slotting herself into that very common social pattern I described above. So if Doris sees this behavior, Doris’s mind will naturally infer that Carol is very pro-X. And Carol in turn fully expects Doris to make this inference. And this seems good to Carol—it makes her feel Approval Reward. So in sum, from Carol’s perspective, the idea of getting angry at Ella seems good.
Then the last step is motivated reasoning etc., by which Carol might wield attention-control to actually summon up anger towards Ella in her own mind. But maybe that last step is optional? I think Carol may sometimes feel like the right thing to do is to do the kinds of things that she would do if she were angry at Ella, even if she doesn’t really feel much actual anger towards Ella in the moment.
…
Aren’t these two things contradictory? Sorry if I’m confused.