I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
Yeah from my perspective EAG is a place where a lot of people interested in technical alignment go, to talk to other people interested in technical alignment, about technical alignment stuff.
Meanwhile there are other things happening at EAG too, but you can ignore them. You don’t have to attend the talks, you don’t have to talk to anyone you don’t want to talk to. And it’s not terribly expensive, and the location is (often) down the street from you (OP, John).
I wonder whether you’re thinking harder about countersignaling than about what would be object-level good things to do?
Then, open up Forbes’ list of N richest people, and count how many of them got on that list by climbing the management hierarchy at a big company.
I predict that, to within reasonable approximation, the answer will be zero. Nobody gets on Forbes’ list of richest people by climbing the hierarchy at a big company. They get on that list by founding a company, inheriting, or both.
I didn’t check this either, but it reminds me of a fun fact that, if you look at the CEO of a large company, the CEO-founders are roughly population-average height, while the people promoted up to CEO are towering monstrosities. Copying from my post Neuroscience of human sexual attraction triggers:
…I’m not sure if anyone has done a rigorous systematic study to back that up, but some examples (from here, the author claims not to have cherry-picked) are: John S. Watson (promoted up to CEO of Chevron): 6’4” = 193 cm; Tim Cook (promoted up to CEO of Apple): 6’3” = 190 cm; Jeffrey Immelt (promoted up to CEO of General Electric): 6’4” = 193 cm; Mark Zuckerberg (founded Facebook): 5’9” = 175 cm; Larry Page (co-founded Google): 5’11” = 180 cm; Sergey Brin (co-founded Google): 5’8” = 173 cm; Jack Dorsey (co-founded Twitter): 5’11” = 180 cm; Richard Branson (founded Virgin): 5’11” = 180 cm; Elon Musk (quasi-founded Tesla, PayPal, SpaceX): 5’11” = 180 cm; Warren Buffett (quasi-founded Berkshire Hathaway): 5’10” = 178 cm.
due to Eliezer mentioning in a recent interview:
- Only cognitively boosted humans have a chance at aligning AI
- Cooling the brain through methods like water cooling is one of our best chances to boost human intelligence
Eliezer Yudkowsky said that cooling the brain through methods like water cooling is one of our best chances to boost human intelligence? I am skeptical. Can you try to find that interview?
Sorry if I’m misunderstanding, but this post seems ignorant of Newton’s law of cooling. If the brain is 1° warmer than the blood, then it should cool about twice as fast (in °C/minute) as if it’s 0.5° warmer than the blood, right? So you shouldn’t have tables listing “cooling rate” measured in °C/minute, but rather something like “cooling half-life” (measured in minutes) or “cooling decay rate” (measured in minutes⁻¹) or things like that. You’re dividing a cooling rate (°C/minute) by a temperature difference (°C) to get the proportionality coefficient, and the °C cancels out.
I think a lot of claims in this post are dubious on account of that error.
Thanks for good pushback! Thinking about it more, I want to propose a 2-step model, where first, there are social dynamics arising from non-social causes, and then second, those social dynamics themselves become part of the environment in which they operate. (I’ve suggested similar 2-step models in Social status part 1 vs part 2, or Theory of Laughter §4.2.4.)
Step 1: social dynamics from non-social causes: Things can seem good or bad for lots of non-social reasons. Let’s say, Alice prefers the taste of pizza, her sister Beth prefers sushi, and their parents have to pick just one.
Here, Beth’s preference for sushi is directly making Alice’s life worse—Beth’s advocacy is increasing the chance that Alice will have an less-pleasant night.
Thanks to [some mechanism that I haven’t worked through in detail], if Beth is directly making Alice’s life worse, Alice’s brain moves Beth away from “friend” and towards “enemy” in terms of the “innate friend (+) vs enemy (–) parameter” in Alice’s brain. In the limit, Alice will start relating to Beth with visible anger, reflective of “schadenfreude reward” and “provocation reward”, as opposed to “sympathy reward” or “approval reward”. And punishing Beth naturally comes out of that.
Step 2: those social dynamics themselves become part of the terrain: Everyone looks around and notices the following:
There’s a class of behaviors like “Alice is treating Beth as an enemy”, including anger and schadenfreude and provocation and punishment, and this behavior is reliably correlated with “Beth is doing something that Alice sees as bad”.
We all get used to that pattern, and use it as a signal to draw inferences about people.
OK, new scene. Carol deeply admires Doris. And Doris really likes X for whatever reason (where X = honesty, loyalty, baggy jeans, who knows). If Carol does X, then Carol is creating an association in Doris’s mind between herself and X. And this seems good to Carol—it makes her feel Approval Reward.
But also, if Ella is doing not-X, and Carol gets angry at Ella (starts treating her as an enemy, which includes schadenfreude and provocation and thus punishment), then Carol is slotting herself into that very common social pattern I described above. So if Doris sees this behavior, Doris’s mind will naturally infer that Carol is very pro-X. And Carol in turn fully expects Doris to make this inference. And this seems good to Carol—it makes her feel Approval Reward. So in sum, from Carol’s perspective, the idea of getting angry at Ella seems good.
Then the last step is motivated reasoning etc., by which Carol might wield attention-control to actually summon up anger towards Ella in her own mind. But maybe that last step is optional? I think Carol may sometimes feel like the right thing to do is to do the kinds of things that she would do if she were angry at Ella, even if she doesn’t really feel much actual anger towards Ella in the moment.
I think the friend/enemy axis probably works more like a scalar coefficient
I agree! My term “friend (+) vs enemy (–) parameter” was trying to convey a model where there’s a scalar that can be positive for friend or negative for enemy.
But I think there’s also a “phase change” when you cross zero, such that the downstream consequences of friend vs enemy can be qualitatively different.
My guess is that “friend” and “enemy” are represented by two cell groups that are mutually-inhibitory, so they can’t both be very active at once. And they have two different sets of downstream consequences. I’m suggesting this by analogy to how hunger works—part of that system is (1) a group of AgRP/NPY neurons that are active when you’re hungry, and (2) a nearby group of α-MSH neurons that are active when you’re full. Each can be more or less active, but also each inhibits the other. So taking both together, you can have continuous variation between hungry and full, but you also get qualitatively different downstream effects of hunger vs fullness (e.g. the AgRP/NPY neurons increase pain-tolerance, the α-MSH neurons increase sex drive).
I’m skeptical of “frenemy”; instead I would propose that there are people who, if you think about them in one way (paying attention to something about them), then they feel like a friend, and if you think about them in a different way, then they feel like an enemy, and you can switch from one to the other in rapid-fire succession, but not simultaneously. What do you think?
I was thinking that a stranger would be ≈0 on the friend-enemy “axis”, until you find a way to judge them. :)
This is all necessarily a bit speculative, because the alleged “friend (+) vs enemy (–) parameter” neuron groups are (to my knowledge) not yet known to science—probably they’re two of the hundreds of little neuron groups in the hypothalamus that nobody has studied yet, especially not in humans (which might or might not be the same as rodents in this respect).
current AIs need to understand at a very early stage what human concepts like “helpfulness,” “harmlessness,” and “honesty” mean. And while it is of course possible to know what these concepts mean without being motivated by them (cf “the genie knows but doesn’t care”), the presence of this level of human-like conceptual understanding at such an early stage of development makes it more likely that these human-like concepts end up structuring AI motivations as well. … AIs will plausibly have concepts like “helpfulness,” “harmlessness,” “honesty” much earlier in the process that leads to their final form … [emphasis added]
I want to nitpick this particular point (I think the other arguments you bring up in that section are stronger).
For example, LLaMa 3.1 405B was trained on 15.6 trillion tokens of text data (≈ what a human could get through in 20,000 years of 24/7 reading). I’m not an ML training expert, but intuitively I’m skeptical that this is the kind of regime where we need to be thinking about what is hard versus easy to learn, or about what can be learned quickly versus slowly.
Instead, my guess is that, if [latent model A] is much easier and faster to learn than [latent model B], but [B] gives a slightly lower predictive loss than [A], then 15.6 trillion tokens of pretraining would be WAY more than enough for the model-in-training to initially learn [A] but then switch over to [B].
I think your question is kinda too vague to answer. (You’re asking for a comparison of two AI architectures, but what are they? I need more detail. Are we assuming that the two options are equally powerful & competent? If so, is that a good assumption? And is that power level “kinda like LLMs of today”, “superintelligence”, or something in between?)
…But maybe see my post Foom & Doom §2.9.1: If brain-like AGI is so dangerous, shouldn’t we just try to make AGIs via LLMs? for some possibly-related discussion.
I agree that “every thought we think, we’re thinking it because it’s higher-reward than other thoughts we might be thinking instead” is a great starting point.
---
I kinda disagree with your emphasis on childhood. See my post Heritability, Behaviorism, and Within-Lifetime RL, where I (dismissively) called that school of thought “RL learn-then-get-stuck”. Of course, “RL learn-then-get-stuck” is true for a few things, like regional accents, but I think those are the exception not the rule. (See also §2 of “Heritability: Five Battles”.)
~~
I think you’re right about the person-to-person variation along a bunch of axes, but the way I think about it is generally at a lower level than the kind of “traits” you list. I think there are dozens of innate drives / innate reactions (some more important than others) in the hypothalamus & brainstem, and their relative strengths differ, and most of the things you list are mostly emergent consequences of the drive / reaction strengths vector. (Note that the map from the vector to behaviors is often nonlinear, and also depends on the options and consequences available in an environment / culture.)
Going through some examples from your list.
“Do you focus on "things" vs "people"” is I think related to an “innate drive to think about and interact with other people” that I briefly discuss in §5 here.
“Are words about reality or are words just rallying cries for your team?” is downstream of that along with many other things, like how strongly does one feel Approval Reward, which in turn depends on a bunch of things including how easily nearby people trigger an involuntary orienting reaction in you.
“How much emphasis do you place on wordless felt gut feelings?” is probably partly that those gut feelings come along with stronger involuntary attention and (the interoceptive equivalent of) orienting reactions in some people than others, making those feelings more or less salient versus easily-ignorable. (Presumably there are other contributing factors too.)
Etc. etc. I don’t have great theories for everything, just trying to give a hint of how I think about those kinds of things, in case it matters (and probably it doesn’t matter for your points here).