LESSWRONG
LW

3045
Steven Byrnes
24594Ω402517225254
Message
Dialogue
Subscribe

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
5Steve Byrnes’s Shortform
Ω
6y
Ω
86
The IABIED statement is not literally true
Steven Byrnes3h20

My take is that IABIED has basically a three-part disjunctive argument:

  • (A) There’s no alignment plan that even works on paper.
  • (B) Even if there were such a plan, people would fail to get it to work on the first critical try, even if they’re being careful. (Just as people can have a plan for a rocket engine that works on paper, and do simulations and small-scale tests and component tests etc., but it will still almost definitely blow up the first time in history that somebody does a full-scale test.)
  • (C) People will not be careful, but rather race, skip over tests and analysis that they could and should be doing, and do something pretty close to whatever yields the most powerful system for the least money in the least time.

I think your post is mostly addressing disjunct (A), except that step (6) has a story for disjunct (B). My mental model of Eliezer & Nate would say: first of all, even if you were correct about this being a solution to (A) & (B), everyone will still die because of (C). Second of all, your plan does not in fact solve (A). I think Eliezer & Nate would disagree most strongly with your step (3); see their answer to “Aren’t developers regularly making their AIs nice and safe and obedient?”. Third of all, your plan does not solve (B) either, because one human-level system is quite different from a civilization of them in lots of ways, and lots of new things can go wrong, e.g. the civilization might create a different and much more powerful egregiously misaligned ASI, just as actual humans seem likely to do right now. (See also other comments on (B).)

↑ That was my mental model of Eliezer & Nate. FWIW, my own take is: I have the same take on (B) & (C). And as for (A), I think LLMs won’t scale to AGI, and my own take is that the different paradigm that will scale to AGI is even worse for step (3), i.e. existing concrete plans will lead to egregious misalignment.

Reply
[Intuitive self-models] 1. Preliminaries
Steven Byrnes2d30

Thanks! My perspective for this kind of thing is: if there’s some phenomenon in psychology or neuroscience, I’m not usually in the situation where there are multiple incompatible hypotheses that would plausibly explain that phenomenon, and we’d like to know which of them is true. Rather, I’m usually in the situation where I have zero hypotheses that would plausibly explain the phenomenon, and I’m trying to get up to at least one.

There are so many constraints from what I (think I) know about neuroscience, and so many constraints from what I (think I) know about algorithms, and so many constraints from what I (think I) know about everyday life, that coming up with any hypothesis at all that can’t be easily refuted from an armchair is a huge challenge. And generally when I find even one such hypothesis, I wind up in the long term ultimately feeling like it’s almost definitely true, at least in the big picture. (Sometimes there are fine details that can’t be pinned down without further experiments.)

It’s interesting that my outlook here is so different from other people in science, who often (not always) feel like the default should be to have multiple hypotheses from the get-go for any given phenomenon. Why the difference? Part of it might be the kinds of questions that I’m interested in. But part of it, as above, is that I have lots of very strong opinions about the brain, which are super constraining and thus rule out almost everything. I think this is much more true for me than almost anyone else in neuroscience, including professionals. (Here’s one of many example constraints that I demand all my hypotheses satisfy.)

So anyway, the first goal is to get up to even one nuts-and-bolts hypothesis, which would explain the phenomenon, and which is specific enough that I can break it down all the way down to algorithmic pseudocode, and then even further to how that pseudocode is implemented by the cortical microstructure and thalamus loops and so on, and that also isn’t immediately ruled out by what we already know from our armchairs and the existing literature. So that’s what I’m trying to do here, and it’s especially great when readers point out that nope, my hypothesis is in fact already in contradiction to known psychology or neuroscience, or to their everyday experience. And then I go back to the drawing board. :)

Reply
[Intuitive self-models] 3. The Homunculus
Steven Byrnes5d40

You’re replying to Linda’s comment, which was mainly referring to a paragraph that I deleted shortly after posting this a year ago. The current relevant text (as in my other comment) is:

As above, the homunculus is definitionally the thing that carries “vitalistic force”, and that does the “wanting”, and that does any acts that we describe as “acts of free will”. Beyond that, I don’t have strong opinions. Is the homunculus the same as the whole “self”, or is the homunculus only one part of a broader “self”? No opinion. Different people probably conceptualize themselves rather differently anyway.

To me, this seems like something everyone should be able relate to, apart from the PNSE thing in Post 6. For example, if your intuitions include the idea of willpower, then I think your intuitions have to also include some, umm, noun, that is exercising that willpower.

But you find it weird and unrelatable? Or was it a different part of the post that left you feeling puzzled when you read it? (If so, maybe I can reword that part.) Thanks.

Reply
I wasn't confused by Thermodynamics
Steven Byrnes7d20

Cool! Oops, I probably just skimmed too fast and incorrectly pattern-matched you to other people I’ve talked to about this topic in the past.  :-P

Reply
I wasn't confused by Thermodynamics
Steven Byrnes7d119

It’s true that thermodynamics was historically invented before statistical mechanics, and if you find stat-mech-free presentations of thermodynamics to be pedagogically helpful, then cool, whatever works for you. But at the same time, I hope we can agree that the stat-mech level is the actual truth of what’s going on, and that the laws of thermodynamics are not axioms but rather derivable from the fundamental physical laws of the universe (particle physics etc.) via statistical mechanics. If you find the probabilistic definition of entropy and temperature etc. to be unintuitive in the context of steam engines, then I’m sorry but you’re not done learning thermodynamics yet, you still have work ahead of you. You can’t just call it a day because you have an intuitive feel for stat-mech and also separately have an intuitive feel for thermodynamics; you’re not done until those two bundles of intuitions are deeply unified and interlinking. [Or maybe you’re already there and I’m misreading this post? If so, sorry & congrats :) ]

Reply
Buck's Shortform
Steven Byrnes9d72

Hmm, my usage seems more like: “I think that…” means the reader/listener might disagree with me, because maybe I’m wrong and the reader is right. (Or maybe it’s subjective.) Meanwhile, “I claim that…” also means the reader might disagree with me, but if they do, it’s only because I haven’t explained myself (yet), and the reader will sooner or later come to see that I’m totally right. So “I think” really is pretty centrally about confidence levels. I think :)

Reply11
Tomás B.'s Shortform
Steven Byrnes9d123

To me, this doesn’t sound related to “anxiety” per se, instead it sounds like you react very strongly to negative situations (especially negative social situations) and thus go out of your way to avoid even a small chance of encountering such a situation. I’m definitely like that (to some extent). I sometimes call it “neuroticism”, although the term “neuroticism” is not great either, it encompasses lots of different things, not all of which describe me.

Like, imagine there’s an Activity X (say, inviting an acquaintance to dinner), and it involves “carrots” (something can go well and that feels rewarding), and also “sticks” (something can go badly and that feels unpleasant). For some people, their psychological makeup is such that the sticks are always especially painful (they have sharp thorns, so to speak). Those people will (quite reasonably) choose not to partake in Activity X, even if most other people would, at least on the margin. This is very sensible, it’s just cost-benefit analysis. It needn’t have anything to do with “anxiety”. It can feel like “no thanks, I don’t like Activity X so I choose not to do it”.

(Sorry if I’m way off-base, you can tell me if this doesn’t resonate with your experience.)

(semi-related)

Reply1
Hospitalization: A Review
Steven Byrnes10d175

That was a scary but also fun read, thanks for sharing and glad you’re doing OK ❤️

Reply5
Four ways learning Econ makes people dumber re: future AI
Steven Byrnes12d41

(That drawing of the Dunning-Kruger Effect is a popular misconception—there was a post last week on that, see also here.)

I think there’s “if you have a hammer, everything looks like a nail” stuff going on. Economists spend a lot of time thinking about labor automation, so they often treat AGI as if it will be just another form of labor automation. LLM & CS people spend a lot of time thinking about the LLMs of 2025, so they often treat AGI as if it will be just like the LLMs of 2025. Military people spend a lo of time thinking about weapons, so they often treat AGI as if it will be just another weapon. Etc.

So yeah, this post happens to be targeted at economists, but that’s not because economists are uniquely blameworthy, or anything like that.

Reply
Generalization and the Multiple Stage Fallacy?
Answer by Steven ByrnesOct 07, 20253513

The “multiple stage fallacy fallacy” is the fallacious idea that equations like

P(A&B&C&D)=P(A)×P(B|A)×P(C|A&B)×P(D|A&B&C)

are false, when in fact they are true. :-P

I think Nate here & Eliezer here are pointing to something real, but the problem is not multiple stages per se but rather (1) “treating stages as required when in fact they’re optional” and/or (2) “failing to properly condition on the conditions and as a result giving underconfident numbers”. For example, if A & B & C have all already come true in some possible universe, then that’s a universe where maybe you have learned something important and updated your beliefs, and you need to imagine yourself in that universe before you try to evaluate P(D|A&B&C)

Of course, that paragraph is just parroting what Eliezer & Nate wrote, if you read what they wrote. But I think other people on LW have too often skipped over the text and just latched onto the name “multiple stages fallacy” instead of drilling down to the actual mistake.

In the case at hand, I don’t have much opinion in the absence of more details about the AI training approach etc., but here’s a couple general comments.

If an AI development team notices Problem A and fixes it, and then notices Problem B and fixes it, and then notices Problem C and fixes it, we should expect that it’s less likely, not more likely, that this same team will preempt Problem D before Problem D actually occurs.

Conversely, if the team has a track record of preempting every problem before it arises (when the problems are low-stakes), then we can have incrementally more hope that they will also preempt high-stakes problems.

Likewise, if there simply are no low-stakes problems to preempt or respond to, because it’s a kind of system that just automatically by its nature has no problems in the first place, then we can feel generically incrementally better about there not being high-stakes problems.

Those comments are all generic, and readers are now free to argue with each other about how they apply to present and future AI.  :)

Reply
Load More
Intuitive Self-Models
Valence
Intro to Brain-Like-AGI Safety
Wanting vs Liking
2 years ago
Wanting vs Liking
2 years ago
(+139/-26)
Waluigi Effect
2 years ago
(+2087)
26Excerpts from my neuroscience to-do list
13d
1
90Optical rectennas are not a promising clean energy technology
1mo
2
54Neuroscience of human sexual attraction triggers (3 hypotheses)
2mo
6
361Four ways learning Econ makes people dumber re: future AI
Ω
19d
Ω
49
99Inscrutability was always inevitable, right?
Q
2mo
Q
33
58Perils of under- vs over-sculpting AGI desires
Ω
2mo
Ω
13
48Interview with Steven Byrnes on Brain-like AGI, Foom & Doom, and Solving Technical Alignment
2mo
1
55Teaching kids to swim
3mo
12
56“Behaviorist” RL reward functions lead to scheming
Ω
3mo
Ω
5
152Foom & Doom 2: Technical alignment is hard
Ω
4mo
Ω
65
Load More