LESSWRONG
LW

1813
Steven Byrnes
24437Ω402217125154
Message
Dialogue
Subscribe

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Intuitive Self-Models
Valence
Intro to Brain-Like-AGI Safety
5steve2152's Shortform
Ω
6y
Ω
86
Research Agenda: Synthesizing Standalone World-Models
Steven Byrnes10hΩ6110

I genuinely appreciate the sanity-check and the vote of confidence here!

Uhh, well, technically I wrote that sentence as a conditional, and technically I didn’t say whether or not the condition applied to you-in-particular.

…I hope you have good judgment! For that matter, I hope I myself have good judgment!! Hard to know though. ¯\_(ツ)_/¯

Reply2
Some Biology Related Things I Found Interesting
Steven Byrnes13h50

I noticed that peeing is rewarding? What the hell?! How did enough of my (human) non-ancestors die because peeing wasn't rewarding enough? The answer is they weren't homo sapiens or hominids at all.

I would split it into two questions:

  • (1) what’s the evolutionary benefit of peeing promptly?
  • (2) In general, if it’s evolutionarily beneficial to do X, why does the brain implement desire-to-X in the form of both “carrots” and “sticks”, as opposed to just one or just the other? Needing to pee is unpleasant (stick) AND peeing is then pleasant (carrot). Being hungry is unpleasant (stick) AND eating is then pleasant (carrot). Etc.

I do think there’s a generic answer to (2) in terms of learning algorithms etc., but no need to get into the details here.

As for (1), you’re wasting energy by carrying around extra weight of urine. Maybe there are other factors too. (Eventually of course you risk incontinence or injury or even death.) Yes I think it’s totally possible that our hominin ancestors had extra counterfactual children by wasting 0.1% less energy or whatever. Energy is important, and every little bit helps.

There are about ~100-200 different neurotransmitters our brains use. I was surprised to find out that I could not find a single neurotransmitter that is not shared between humans and mice (let me know if you can find one, though).

Like you said, truly new neurotransmitters are rare. For example, oxytocin and vasopressin split off from a common ancestor in a gene duplication event 500Mya, and the ancestral form has homologues in octopuses and insects etc. OTOH, even if mice and humans have homologous neurotransmitters, they presumably differ by at least a few mutations; they’re not exactly the same. (Separately, their functional effects are sometimes quite different! For example, eating induces oxytocin release in rodents but vasopressin release in humans.)

Anyway, looking into recent evolutionary changes to neurotransmitters (and especially neuropeptides) is an interesting idea (thanks!). I found this paper comparing endocrine systems of humans and chimps. It claims (among other things) that GNRH2 and UCN2 are protein-coding genes in humans but inactive (“pseudogenes”) in chimps. If true, what does that imply? Beats me. It does not seem to have any straightforward interpretation that I can see. Oh well.

Reply
Four ways learning Econ makes people dumber re: future AI
Steven Byrnes1d40

Thanks for the advice. I have now added at least the basic template, for the benefit of readers who don’t already have it memorized. I will leave it to the reader to imagine the curves moving around—I don’t want to add too much length and busy-ness.

Reply1
Four ways learning Econ makes people dumber re: future AI
Steven Byrnes1d30

Manipulating the physical world is a very different problem from invention, and current LLM-based architectures are not suited for this. … Friction, all the consequence of a lack of knowledge about the problem; friction, all the million little challenges that need to be overcome; friction, that which is smoothed over the second and third and fourth times something done. Friction, that which is inevitably associated with the physical world. Friction--that which only humans can handle. 

This OP is about “AGI”, as defined in my 3rd & 4th paragraph as follows:

By “AGI” I mean here “a bundle of chips, algorithms, electricity, and/or teleoperated robots that can autonomously do the kinds of stuff that ambitious human adults can do—founding and running new companies, R&D, learning new skills, using arbitrary teleoperated robots after very little practice, etc.”

Yes I know, this does not exist yet! (Despite hype to the contrary.) Try asking an LLM to autonomously write a business plan, found a company, then run and grow it for years as CEO. Lol! It will crash and burn! But that’s a limitation of today’s LLMs, not of “all AI forever”. AI that could nail that task, and much more beyond, is obviously possible—human brains and bodies and societies are not powered by some magical sorcery forever beyond the reach of science. I for one expect such AI in my lifetime, for better or worse. (Probably “worse”, see below.)

So…

  • “The kinds of stuff that ambitious human adults can do” includes handling what you call “friction”, so “AGI” as defined above would be able to do that too.
  • “The kinds of stuff that ambitious human adults can do” includes manipulating the physical world, so “AGI” as defined above would be able to do that too. (As a more concrete example, adult humans, after just a few hours’ practice, can get all sorts of things done in the physical world using even quite inexpensive makeshift teleoperated robots, therefore AGI would be able to do that too.)
  • I am >99% confident that “AGI” as defined above is physically possible, and will be invented eventually.
  • I am like 90% confident that it will be invented in my lifetime.
  • This post is agnostic on the question of whether such AGI will or won’t have anything to do with “current LLM-based architectures”. I’m not sure why you brought that up. But since you asked, I think it won’t; I think it will be a different, yet-to-be-developed, AI paradigm.

As for the rest of your comment, I find it rather confusing, but maybe that’s downstream of what I wrote here.

Reply
You’re probably overestimating how well you understand Dunning-Kruger
Steven Byrnes2d311

Spencer Greenberg (@spencerg) & Belen Cobeta at ClearerThinking.org have a more thorough and well-researched discussion at: Study Report: Is the Dunning-Kruger Effect real? (Also, their slightly-shorter blog post summary.)

This OP would mostly correspond to what ClearerThinking calls “noisy test of skill”. But ClearerThinking also goes through various other statistical artifacts impacting Dunning-Kruger studies, plus some of their own data analysis. Here’s (part of) their upshot:

The simulations above are remarkable because they show that when researchers are careful to avoid "fake" Dunning-Kruger effects, the real patterns that emerge in Dunning-Kruger studies, can typically be reproduced with just two assumptions:

  1. Closer-To-The-Average Effect: people predict their skill levels to be closer to the mean skill level than they really are. This could be rational (when people simply have limited evidence about their true skill level), or irrational (if people still do this strongly when they have lots of evidence about their skill, then they are not adjusting their predictions enough based on that evidence).
  2. Better-Than-Average Effect: on average, people tend to irrationally predict they are above average at skills. While this does not happen on every skill, it is known to happen for a wide range of skills. This bias is not the same thing as the Dunning-Kruger effect, but it shows up in Dunning-Kruger plots.
Reply
Perils of under- vs over-sculpting AGI desires
Steven Byrnes3dΩ340

Belated thanks!

I would prefer a being with the morality of Claude Opus to rule the world rather than a randomly selected human … it's really unclear how good humans are at generalizing at true out-of-distribution moralities. Today's morality likely looks pretty bad from the ancient Egyptian perspective…

Hmm, I think maybe there’s something I was missing related to what you’re saying here, and that maybe I’ve been thinking about §8.2.1 kinda wrong. I’ve been mulling it over for a few days already, and might write some follow-up. Thanks.

Perhaps a difference in opinion is that it's really unclear to me that an AGI wouldn't do much the same thing of "thinking about it more, repeatedly querying their 'ground truth' social instincts" that humans do. Arguably models like Claude Opus already do this where it clearly can do detailed reasoning about somewhat out-of-distribution scenarios using moral intuitions that come from somewhere…

I think LLMs as we know them today and use them today are basically fine, and that this fine-ness comes first and foremost from imitation-learning on human data (see my Foom & Doom post §2.3). I think some of my causes for concern are that, by the time we get to ASI…

(1) Most importantly, I personally expect a paradigm shift after which true imitation-learning on human data won’t be involved at all, just as it isn’t in humans (Foom & Doom §2.3.2) … but I’ll put that aside for this comment;

(2) even if imitation-learning (a.k.a. pretraining) remains part of the process, I expect RL to be a bigger and bigger influence over time, which will make human-imitation relatively less of an influence on the ultimate behavior (Foom & Doop §2.3.5);

(3) I kinda expect the eventual AIs to be kinda more, umm, aggressive and incorrigible and determined and rule-bending in general, since that’s the only way to make AIs that get things done autonomously in a hostile world where adversaries are trying to jailbreak or otherwise manipulate them, and since that’s the end-point of competition.

Perhaps a crux of differences in opinion between us is that I think that much more 'alignment relevant' morality is not created entirely by innate human social instincts but is instead learnt by our predictive world models based on external data -- i.e. 'culture'.…

(You might already agree with all this:)

Bit of a nitpick, but I agree that absorbing culture is a “predictive world model” thing in LLMs, but I don’t think that’s true in humans, at least in a certain technical sense. I think we humans absorb culture because our innate drives make us want to absorb culture, i.e. it happens ultimately via RL. Or at least, we want to absorb some culture in some circumstances, e.g. we particularly absorb the habits and preferences of people we regard as high-status. I have written about this at “Heritability: Five Battles” §2.5.1, and “Valence & Liking / Admiring” §4.5.

See here for some of my thoughts on cultural evolution in general.

I agree that “game-theoretic equilibria” are relevant to why human cultures are how they are right now, and they might also be helpful in a post-AGI future if (at least some of) the AGIs intrinsically care about humans, but wouldn’t lead to AGIs caring about humans if they don’t already.

I think “profoundly unnatural” is somewhat overstating the disconnect between “EA-style compassion” and “human social instincts”. I would say something more like: we have a bunch of moral intuitions (derived from social instincts) that push us in a bunch of directions. Every human movement / ideology / meme draws from one or more forces that we find innately intuitively motivating: compassion, justice, spite, righteous indignation, power-over-others, satisfaction-of-curiosity, etc.

So EA is drawing from a real innate force of human nature (compassion, mostly). Likewise, xenophobia is drawing from a real innate force of human nature, and so on. Where we wind up at the end of the day is a complicated question, and perhaps underdetermined. (And it also depends on an individual’s personality.) But it’s not a coincidence that there is no EA-style group advocating for things that have no connection to our moral intuitions / human nature whatsoever, like whether the number of leaves on a tree is even vs odd.

We don't have to conjure up thought experiments about aliens outside of our light cone. Throughout most of history humans have been completely uncompassionate about suffering existing literally right in front of their faces…

Just to clarify, the context of that thought experiment in the OP was basically: “It’s fascinating that human compassion exists at all, because human compassion has surprising and puzzling properties from an RL algorithms perspective.”

Obviously I agree that callous indifference also exists among humans. But from an RL algorithms perspective, there is nothing interesting or puzzling about callous indifference. Callous indifference is the default. For example, I have callous indifference about whether trees have even vs odd numbers of leaves, and a zillion other things like that.

Reply
Research Agenda: Synthesizing Standalone World-Models
Steven Byrnes8dΩ598

I guess the main blockers I see are:

  • I think you need to build in agency in order to get a good world-model (or at least, a better-than-LLM world model).
    • There are effectively infinitely many things about the world that one could figure out. If I cared about wrinkly shirts, I could figure out vastly more than any human has ever known about wrinkly shirts. I could find mathematical theorems in the patterns of wrinkles. I could theorize and/or run experiments on whether the wrinkliness of a wool shirt relates to the sheep’s diet. Etc.
    • …Or if we’re talking about e.g. possible inventions that don’t exist yet, then the combinatorial explosion of possibilities gets even worse.
    • I think the only solution is: an agent that cares about something or wants something, and then that wanting / caring creates value-of-information which in turn guides what to think about / pay attention to / study.
  • What’s the pivotal act?
    • Depending on what you have in mind here, the previous bullet point might be inapplicable or different, and I might or might not have other complaints too.

You can DM or email me if you want to discuss but not publicly :)

It’s funny that I’m always begging people to stop trying to reverse-engineer the neocortex, and you’re working on something that (if successful) would end up somewhere pretty similar to that, IIUC. (But hmm, I guess if a paranoid doom-pilled person was trying to reverse-engineer the neocortex, and keep the results super-secret unless they had a great theory for how sharing them would help with safe & beneficial AGI, and if they in fact had good judgment on that topic, then I guess I’d be grudgingly OK with that.)

Reply1
This is a review of the reviews
Steven Byrnes11d10576

I was pushing back on a similar attitude yesterday on twitter → LINK.

Basically, I’m in favor of people having nitpicky high-decoupling discussion on lesswrong, and meanwhile doing rah rah activism action PR stuff on twitter and bluesky and facebook and intelligence.org and pauseai.info and op-eds and basically the entire rest of the internet and world. Just one website of carve-out. I don’t think this is asking too much!

Reply
Foom & Doom 2: Technical alignment is hard
Steven Byrnes13d20

Maybe study logical decision theory?

Eliezer has always been quite clear that you should one-box for Newcomb’s problem because then you’ll wind up with more money. The starting point for the whole discussion is a consequentialist preference—you have desires about the state of the world after the decision is over.

You have desires, and then decision theory tells you how to act so as to bring those desires about. The desires might be entirely about the state of the world in the future, or they might not be. Doesn’t matter. Regardless, whatever your desires are, you should use good decision theory to make decisions that will lead to your desires getting fulfilled.

Thus, decision theory is unrelated to our conversation here. I expect that Eliezer would agree.

To me it seems a bit surprising that you say we agree on the object level, when in my view you're totally guilty of my 2.b.i point above of not specifying the tradeoff / not giving a clear specification of how decisions are actually made.

Your 2.a is saying “Steve didn’t write down a concrete non-farfuturepumping utility function, and maybe if he tried he would get stuck”, and yeah I already agreed with that.

Your 2.b is saying “Why can't you have a utility function but also other preferences?”, but that’s  a very strange question to me, because why wouldn’t you just roll those “other preferences” into the utility function as you describe the agent? Ditto with 2.c, why even bring that up? Why not just roll that into the agent’s utility function? Everything can always be rolled into the utility function. Utility functions don’t imply anything about behavior, and they don’t imply reflective consistency, etc., it’s all vacuous formalizing unless you put assumptions / constraints on the utility function.

Reply
Foom & Doom 2: Technical alignment is hard
Steven Byrnes14dΩ671

My read of this conversation is that we’re basically on the same page about what’s true but disagree about whether Eliezer is also on that same page too. Again, I don’t care. I already deleted the claim about what Eliezer thinks on this topic, and have been careful not to repeat it elsewhere.

Since we’re talking about it, my strong guess is that Eliezer would ace any question about utility functions and what’s their domain and when is “utility-maximizing behavior” vacuous, … if asked directly.

But it’s perfectly possible to “know” something when asked directly, but also to fail to fully grok the consequences of that thing and incorporate it into some other part of one’s worldview. God knows I’m guilty of that, many many times over!

Thus my low-confidence guess is that Eliezer is guilty of that too, in that the observation “utility-maximizing behavior per se is vacuous” (which I strongly expect he would agree with if asked directly) has not been fully reconciled with his larger thinking on the nature of the AI x-risk problem.

(I would further add that, if Eliezer has fully & deeply incorporated “utility-maximizing behavior per se is vacuous” into every other aspect of his thinking, then he is bad at communicating that fact to others, in the sense that a number of his devoted readers wound up with the wrong impression on this point.)

Anyway, I feel like your comment is some mix of “You’re unfairly maligning Eliezer” (again, whatever, I have stopped making those claims) and “You’re wrong that this supposed mistake that you attribute to Eliezer is a path through which we can solve the alignment problem, and Eliezer doesn’t emphasize it because it’s an unimportant dead-end technicality” (maybe! I don’t claim to have a solution to the alignment problem right now; perhaps over time I will keep trying and failing and wind up with a better appreciation of the nature of the blockers).

Most of your comment is stuff I already agree with (except that I would use the term “desires” in most places that you wrote “utility function”, i.e. where we’re talking about “how AI cognition will look like”).

I don’t follow what you think Eliezer means by “consequentialism”. I’m open-minded to “farfuturepumping”, but only if you convince me that “consequentialism” is actually misleading. I’m don’t endorse coining new terms when an existing term is already spot-on.

Reply1
Load More
90Optical rectennas are not a promising clean energy technology
21d
2
54Neuroscience of human sexual attraction triggers (3 hypotheses)
1mo
6
349Four ways learning Econ makes people dumber re: future AI
Ω
3d
Ω
45
99Inscrutability was always inevitable, right?
Q
2mo
Q
33
58Perils of under- vs over-sculpting AGI desires
Ω
2mo
Ω
13
46Interview with Steven Byrnes on Brain-like AGI, Foom & Doom, and Solving Technical Alignment
2mo
1
55Teaching kids to swim
2mo
12
56“Behaviorist” RL reward functions lead to scheming
Ω
2mo
Ω
5
152Foom & Doom 2: Technical alignment is hard
Ω
3mo
Ω
65
277Foom & Doom 1: “Brain in a box in a basement”
Ω
3mo
Ω
120
Load More
Wanting vs Liking
2 years ago
Wanting vs Liking
2 years ago
(+139/-26)
Waluigi Effect
2 years ago
(+2087)