I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
(That drawing of the Dunning-Kruger Effect is a popular misconception—there was a post last week on that, see also here.)
I think there’s “if you have a hammer, everything looks like a nail” stuff going on. Economists spend a lot of time thinking about labor automation, so they often treat AGI as if it will be just another form of labor automation. LLM & CS people spend a lot of time thinking about the LLMs of 2025, so they often treat AGI as if it will be just like the LLMs of 2025. Military people spend a lo of time thinking about weapons, so they often treat AGI as if it will be just another weapon. Etc.
So yeah, this post happens to be targeted at economists, but that’s not because economists are uniquely blameworthy, or anything like that.
The “multiple stage fallacy fallacy” is the fallacious idea that equations like
are false, when in fact they are true. :-P
I think Nate here & Eliezer here are pointing to something real, but the problem is not multiple stages per se but rather (1) “treating stages as required when in fact they’re optional” and/or (2) “failing to properly condition on the conditions and as a result giving underconfident numbers”. For example, if A & B & C have all already come true in some possible universe, then that’s a universe where maybe you have learned something important and updated your beliefs, and you need to imagine yourself in that universe before you try to evaluate
Of course, that paragraph is just parroting what Eliezer & Nate wrote, if you read what they wrote. But I think other people on LW have too often skipped over the text and just latched onto the name “multiple stages fallacy” instead of drilling down to the actual mistake.
In the case at hand, I don’t have much opinion in the absence of more details about the AI training approach etc., but here’s a couple general comments.
If an AI development team notices Problem A and fixes it, and then notices Problem B and fixes it, and then notices Problem C and fixes it, we should expect that it’s less likely, not more likely, that this same team will preempt Problem D before Problem D actually occurs.
Conversely, if the team has a track record of preempting every problem before it arises (when the problems are low-stakes), then we can have incrementally more hope that they will also preempt high-stakes problems.
Likewise, if there simply are no low-stakes problems to preempt or respond to, because it’s a kind of system that just automatically by its nature has no problems in the first place, then we can feel generically incrementally better about there not being high-stakes problems.
Those comments are all generic, and readers are now free to argue with each other about how they apply to present and future AI. :)
I genuinely appreciate the sanity-check and the vote of confidence here!
Uhh, well, technically I wrote that sentence as a conditional, and technically I didn’t say whether or not the condition applied to you-in-particular.
…I hope you have good judgment! For that matter, I hope I myself have good judgment!! Hard to know though. ¯\_(ツ)_/¯
I noticed that peeing is rewarding? What the hell?! How did enough of my (human) non-ancestors die because peeing wasn't rewarding enough? The answer is they weren't homo sapiens or hominids at all.
I would split it into two questions:
I do think there’s a generic answer to (2) in terms of learning algorithms etc., but no need to get into the details here.
As for (1), you’re wasting energy by carrying around extra weight of urine. Maybe there are other factors too. (Eventually of course you risk incontinence or injury or even death.) Yes I think it’s totally possible that our hominin ancestors had extra counterfactual children by wasting 0.1% less energy or whatever. Energy is important, and every little bit helps.
There are about ~100-200 different neurotransmitters our brains use. I was surprised to find out that I could not find a single neurotransmitter that is not shared between humans and mice (let me know if you can find one, though).
Like you said, truly new neurotransmitters are rare. For example, oxytocin and vasopressin split off from a common ancestor in a gene duplication event 500Mya, and the ancestral form has homologues in octopuses and insects etc. OTOH, even if mice and humans have homologous neurotransmitters, they presumably differ by at least a few mutations; they’re not exactly the same. (Separately, their functional effects are sometimes quite different! For example, eating induces oxytocin release in rodents but vasopressin release in humans.)
Anyway, looking into recent evolutionary changes to neurotransmitters (and especially neuropeptides) is an interesting idea (thanks!). I found this paper comparing endocrine systems of humans and chimps. It claims (among other things) that GNRH2 and UCN2 are protein-coding genes in humans but inactive (“pseudogenes”) in chimps. If true, what does that imply? Beats me. It does not seem to have any straightforward interpretation that I can see. Oh well.
Thanks for the advice. I have now added at least the basic template, for the benefit of readers who don’t already have it memorized. I will leave it to the reader to imagine the curves moving around—I don’t want to add too much length and busy-ness.
Manipulating the physical world is a very different problem from invention, and current LLM-based architectures are not suited for this. … Friction, all the consequence of a lack of knowledge about the problem; friction, all the million little challenges that need to be overcome; friction, that which is smoothed over the second and third and fourth times something done. Friction, that which is inevitably associated with the physical world. Friction--that which only humans can handle.
This OP is about “AGI”, as defined in my 3rd & 4th paragraph as follows:
By “AGI” I mean here “a bundle of chips, algorithms, electricity, and/or teleoperated robots that can autonomously do the kinds of stuff that ambitious human adults can do—founding and running new companies, R&D, learning new skills, using arbitrary teleoperated robots after very little practice, etc.”
Yes I know, this does not exist yet! (Despite hype to the contrary.) Try asking an LLM to autonomously write a business plan, found a company, then run and grow it for years as CEO. Lol! It will crash and burn! But that’s a limitation of today’s LLMs, not of “all AI forever”. AI that could nail that task, and much more beyond, is obviously possible—human brains and bodies and societies are not powered by some magical sorcery forever beyond the reach of science. I for one expect such AI in my lifetime, for better or worse. (Probably “worse”, see below.)
So…
As for the rest of your comment, I find it rather confusing, but maybe that’s downstream of what I wrote here.
Spencer Greenberg (@spencerg) & Belen Cobeta at ClearerThinking.org have a more thorough and well-researched discussion at: Study Report: Is the Dunning-Kruger Effect real? (Also, their slightly-shorter blog post summary.)
This OP would mostly correspond to what ClearerThinking calls “noisy test of skill”. But ClearerThinking also goes through various other statistical artifacts impacting Dunning-Kruger studies, plus some of their own data analysis. Here’s (part of) their upshot:
The simulations above are remarkable because they show that when researchers are careful to avoid "fake" Dunning-Kruger effects, the real patterns that emerge in Dunning-Kruger studies, can typically be reproduced with just two assumptions:
- Closer-To-The-Average Effect: people predict their skill levels to be closer to the mean skill level than they really are. This could be rational (when people simply have limited evidence about their true skill level), or irrational (if people still do this strongly when they have lots of evidence about their skill, then they are not adjusting their predictions enough based on that evidence).
- Better-Than-Average Effect: on average, people tend to irrationally predict they are above average at skills. While this does not happen on every skill, it is known to happen for a wide range of skills. This bias is not the same thing as the Dunning-Kruger effect, but it shows up in Dunning-Kruger plots.
Belated thanks!
I would prefer a being with the morality of Claude Opus to rule the world rather than a randomly selected human … it's really unclear how good humans are at generalizing at true out-of-distribution moralities. Today's morality likely looks pretty bad from the ancient Egyptian perspective…
Hmm, I think maybe there’s something I was missing related to what you’re saying here, and that maybe I’ve been thinking about §8.2.1 kinda wrong. I’ve been mulling it over for a few days already, and might write some follow-up. Thanks.
Perhaps a difference in opinion is that it's really unclear to me that an AGI wouldn't do much the same thing of "thinking about it more, repeatedly querying their 'ground truth' social instincts" that humans do. Arguably models like Claude Opus already do this where it clearly can do detailed reasoning about somewhat out-of-distribution scenarios using moral intuitions that come from somewhere…
I think LLMs as we know them today and use them today are basically fine, and that this fine-ness comes first and foremost from imitation-learning on human data (see my Foom & Doom post §2.3). I think some of my causes for concern are that, by the time we get to ASI…
(1) Most importantly, I personally expect a paradigm shift after which true imitation-learning on human data won’t be involved at all, just as it isn’t in humans (Foom & Doom §2.3.2) … but I’ll put that aside for this comment;
(2) even if imitation-learning (a.k.a. pretraining) remains part of the process, I expect RL to be a bigger and bigger influence over time, which will make human-imitation relatively less of an influence on the ultimate behavior (Foom & Doop §2.3.5);
(3) I kinda expect the eventual AIs to be kinda more, umm, aggressive and incorrigible and determined and rule-bending in general, since that’s the only way to make AIs that get things done autonomously in a hostile world where adversaries are trying to jailbreak or otherwise manipulate them, and since that’s the end-point of competition.
Perhaps a crux of differences in opinion between us is that I think that much more 'alignment relevant' morality is not created entirely by innate human social instincts but is instead learnt by our predictive world models based on external data -- i.e. 'culture'.…
(You might already agree with all this:)
Bit of a nitpick, but I agree that absorbing culture is a “predictive world model” thing in LLMs, but I don’t think that’s true in humans, at least in a certain technical sense. I think we humans absorb culture because our innate drives make us want to absorb culture, i.e. it happens ultimately via RL. Or at least, we want to absorb some culture in some circumstances, e.g. we particularly absorb the habits and preferences of people we regard as high-status. I have written about this at “Heritability: Five Battles” §2.5.1, and “Valence & Liking / Admiring” §4.5.
See here for some of my thoughts on cultural evolution in general.
I agree that “game-theoretic equilibria” are relevant to why human cultures are how they are right now, and they might also be helpful in a post-AGI future if (at least some of) the AGIs intrinsically care about humans, but wouldn’t lead to AGIs caring about humans if they don’t already.
I think “profoundly unnatural” is somewhat overstating the disconnect between “EA-style compassion” and “human social instincts”. I would say something more like: we have a bunch of moral intuitions (derived from social instincts) that push us in a bunch of directions. Every human movement / ideology / meme draws from one or more forces that we find innately intuitively motivating: compassion, justice, spite, righteous indignation, power-over-others, satisfaction-of-curiosity, etc.
So EA is drawing from a real innate force of human nature (compassion, mostly). Likewise, xenophobia is drawing from a real innate force of human nature, and so on. Where we wind up at the end of the day is a complicated question, and perhaps underdetermined. (And it also depends on an individual’s personality.) But it’s not a coincidence that there is no EA-style group advocating for things that have no connection to our moral intuitions / human nature whatsoever, like whether the number of leaves on a tree is even vs odd.
We don't have to conjure up thought experiments about aliens outside of our light cone. Throughout most of history humans have been completely uncompassionate about suffering existing literally right in front of their faces…
Just to clarify, the context of that thought experiment in the OP was basically: “It’s fascinating that human compassion exists at all, because human compassion has surprising and puzzling properties from an RL algorithms perspective.”
Obviously I agree that callous indifference also exists among humans. But from an RL algorithms perspective, there is nothing interesting or puzzling about callous indifference. Callous indifference is the default. For example, I have callous indifference about whether trees have even vs odd numbers of leaves, and a zillion other things like that.
I guess the main blockers I see are:
You can DM or email me if you want to discuss but not publicly :)
It’s funny that I’m always begging people to stop trying to reverse-engineer the neocortex, and you’re working on something that (if successful) would end up somewhere pretty similar to that, IIUC. (But hmm, I guess if a paranoid doom-pilled person was trying to reverse-engineer the neocortex, and keep the results super-secret unless they had a great theory for how sharing them would help with safe & beneficial AGI, and if they in fact had good judgment on that topic, then I guess I’d be grudgingly OK with that.)
That was a scary but also fun read, thanks for sharing and glad you’re doing OK ❤️