LESSWRONG
LW

1303
Steven Byrnes
24876Ω403917425384
Message
Dialogue
Subscribe

I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.

Sequences

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
Intuitive Self-Models
Valence
Intro to Brain-Like-AGI Safety
Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking
Steven Byrnes3d41

Thanks for good pushback! Thinking about it more, I want to propose a 2-step model, where first, there are social dynamics arising from non-social causes, and then second, those social dynamics themselves become part of the environment in which they operate. (I’ve suggested similar 2-step models in Social status part 1 vs part 2, or Theory of Laughter §4.2.4.)

Step 1: social dynamics from non-social causes: Things can seem good or bad for lots of non-social reasons. Let’s say, Alice prefers the taste of pizza, her sister Beth prefers sushi, and their parents have to pick just one.

Here, Beth’s preference for sushi is directly making Alice’s life worse—Beth’s advocacy is increasing the chance that Alice will have an less-pleasant night.

Thanks to [some mechanism that I haven’t worked through in detail], if Beth is directly making Alice’s life worse, Alice’s brain moves Beth away from “friend” and towards “enemy” in terms of the “innate friend (+) vs enemy (–) parameter” in Alice’s brain. In the limit, Alice will start relating to Beth with visible anger, reflective of “schadenfreude reward” and “provocation reward”, as opposed to “sympathy reward” or “approval reward”. And punishing Beth naturally comes out of that.

Step 2: those social dynamics themselves become part of the terrain: Everyone looks around and notices the following:

There’s a class of behaviors like “Alice is treating Beth as an enemy”, including anger and schadenfreude and provocation and punishment, and this behavior is reliably correlated with “Beth is doing something that Alice sees as bad”.

We all get used to that pattern, and use it as a signal to draw inferences about people.

OK, new scene. Carol deeply admires Doris. And Doris really likes X for whatever reason (where X = honesty, loyalty, baggy jeans, who knows). If Carol does X, then Carol is creating an association in Doris’s mind between herself and X. And this seems good to Carol—it makes her feel Approval Reward.

But also, if Ella is doing not-X, and Carol gets angry at Ella (starts treating her as an enemy, which includes schadenfreude and provocation and thus punishment), then Carol is slotting herself into that very common social pattern I described above. So if Doris sees this behavior, Doris’s mind will naturally infer that Carol is very pro-X. And Carol in turn fully expects Doris to make this inference. And this seems good to Carol—it makes her feel Approval Reward. So in sum, from Carol’s perspective, the idea of getting angry at Ella seems good.

Then the last step is motivated reasoning etc., by which Carol might wield attention-control to actually summon up anger towards Ella in her own mind. But maybe that last step is optional? I think Carol may sometimes feel like the right thing to do is to do the kinds of things that she would do if she were angry at Ella, even if she doesn’t really feel much actual anger towards Ella in the moment.

Reply
Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking
Steven Byrnes3d50

I think the friend/enemy axis probably works more like a scalar coefficient

I agree! My term “friend (+) vs enemy (–) parameter” was trying to convey a model where there’s a scalar that can be positive for friend or negative for enemy.

But I think there’s also a “phase change” when you cross zero, such that the downstream consequences of friend vs enemy can be qualitatively different.

My guess is that “friend” and “enemy” are represented by two cell groups that are mutually-inhibitory, so they can’t both be very active at once. And they have two different sets of downstream consequences. I’m suggesting this by analogy to how hunger works—part of that system is (1) a group of AgRP/NPY neurons that are active when you’re hungry, and (2) a nearby group of α-MSH neurons that are active when you’re full. Each can be more or less active, but also each inhibits the other. So taking both together, you can have continuous variation between hungry and full, but you also get qualitatively different downstream effects of hunger vs fullness (e.g. the AgRP/NPY neurons increase pain-tolerance, the α-MSH neurons increase sex drive).

I’m skeptical of “frenemy”; instead I would propose that there are people who, if you think about them in one way (paying attention to something about them), then they feel like a friend, and if you think about them in a different way, then they feel like an enemy, and you can switch from one to the other in rapid-fire succession, but not simultaneously. What do you think?

I was thinking that a stranger would be ≈0 on the friend-enemy “axis”, until you find a way to judge them. :)

This is all necessarily a bit speculative, because the alleged “friend (+) vs enemy (–) parameter” neuron groups are (to my knowledge) not yet known to science—probably they’re two of the hundreds of little neuron groups in the hypothalamus that nobody has studied yet, especially not in humans (which might or might not be the same as rodents in this respect). 

Reply
How human-like do safe AI motivations need to be?
Steven Byrnes4d20

current AIs need to understand at a very early stage what human concepts like “helpfulness,” “harmlessness,” and “honesty” mean. And while it is of course possible to know what these concepts mean without being motivated by them (cf “the genie knows but doesn’t care”), the presence of this level of human-like conceptual understanding at such an early stage of development makes it more likely that these human-like concepts end up structuring AI motivations as well. … AIs will plausibly have concepts like “helpfulness,” “harmlessness,” “honesty” much earlier in the process that leads to their final form … [emphasis added]

I want to nitpick this particular point (I think the other arguments you bring up in that section are stronger).

For example, LLaMa 3.1 405B was trained on 15.6 trillion tokens of text data (≈ what a human could get through in 20,000 years of 24/7 reading). I’m not an ML training expert, but intuitively I’m skeptical that this is the kind of regime where we need to be thinking about what is hard versus easy to learn, or about what can be learned quickly versus slowly.

Instead, my guess is that, if [latent model A] is much easier and faster to learn than [latent model B], but [B] gives a slightly lower predictive loss than [A], then 15.6 trillion tokens of pretraining would be WAY more than enough for the model-in-training to initially learn [A] but then switch over to [B].

Reply
Social drives 1: “Sympathy Reward”, from compassion to dehumanization
Steven Byrnes4d30

I think your question is kinda too vague to answer. (You’re asking for a comparison of two AI architectures, but what are they? I need more detail. Are we assuming that the two options are equally powerful & competent? If so, is that a good assumption? And is that power level “kinda like LLMs of today”, “superintelligence”, or something in between?)

…But maybe see my post Foom & Doom §2.9.1: If brain-like AGI is so dangerous, shouldn’t we just try to make AGIs via LLMs? for some possibly-related discussion.

Reply
Human Values ≠ Goodness
Steven Byrnes13d*8021

(I say this all the time, but I think that [the thing you call “values”] is a closer match to the everyday usage of the word “desires” than the word “values”.)

I think we should distinguish three things: (A) societal norms that you have internalized, (B) societal norms that you have not internalized, (C) desires that you hold independent of [or even despite] societal norms.

For example:

  • a 12-year-old girl might feel very strongly that some style of dress is cool, and some other style in cringe. She internalized this from people she thinks of as good and important—older teens, her favorite celebrities, the kids she looks up to, etc. This is (A).
  • Meanwhile, her lame annoying parents tell her that kindness is a virtue, and she rolls her eyes. This is (B).
  • She has a certain way that she likes to arrange her pillows in bed at night before falling asleep. Very cozy. She has never told anyone about this, and has no idea how anyone else arranges their pillows. This is (C).

Anyway, the OP says: “our shared concept of Goodness is comprised of whatever messages people spread about what other people should value. … which sure is a different thing from what people do value, when they introspect on what feels yummy.”

I think that’s kinda treating the dichotomy as (B) versus (C), while denying the existence of (A).

If that 12yo girl “introspects on what feels yummy”, her introspection will say “myself wearing a crop-top with giant sweatpants feels yummy”. This obviously has memetic origins but the girl is very deeply enthusiastic about it, and will be insulted if you tell her she only likes that because she’s copying memes.

By the way, this is unrelated to “feeling of deep loving connection”. The 12yo girl does not have a “feeling of deep loving connection” to the tiktok influencers, high schoolers, etc., who have planted the idea in her head that crop-tops and giant sweatpants look super chic and awesome. I think you’re wayyy overstating the importance of “feeling of deep loving connection” for the average person’s “values”, and correspondingly wayyy understating the importance of this kind of norm-following thing. I have a draft post with much more about the norm-following thing, should be out soon :) UPDATE: this post.

Reply21
{M|Im|Am}oral Mazes - any large-scale counterexamples?
Steven Byrnes15d40

Companies actually do this and (if done competently) the implementation costs are an infinitesimal fraction of the benefits. Big companies generally already have all the info in various databases, they just need to do the right spreadsheet calculation. A tiny 1-person company could practically do it with pen and paper, I think. You can ballpark things; it doesn’t have to be perfect to be wildly better than the status quo. One competent person can do it pretty well even for a giant firm with hundreds or thousands of employees (but they should hire my late father’s company instead, it’ll get done much better and faster!)

If there’s some activity that the company does that sucks up tons of money, like as in a substantial fraction of the company’s total annual operating costs, with no tracking of how that money is actually being spent, like it’s just a black hole, and poof the money is gone … then the wrong way to relate to that situation is “gee it would take a lot of work to figure out where all this money is going”, and the right way is “wtf we obviously need to figure out where all this money is going ASAP”. :)  I don’t think that situation comes up. Companies already break down how money gets spent in their major cost centers.

Or if it’s a tiny fraction of the company’s total costs, then you can just come up with some ballpark heuristic and it will probably be fine. Like I asked my dad how they divvy up the CEO salary in this kind of system, and he immediately answered with an extremely simple-to-implement heuristic that made a ton of sense. (I won’t share what it is, I think it might be a trade secret.) And no it does not involve tracking exactly how the CEO spends their time.

Reply
Your posts should be on arXiv
Steven Byrnes16dΩ470

Another data point: when I turned Intro to Brain-Like AGI Safety blog post series into a PDF [via typst—I hired someone to do all the hard work of writing conversion scripts etc.], arXiv rejected it, so I put it on OSF instead. I’m reluctant to speculate on what arXiv didn’t like about it (they didn’t say). Some possibilities are: it seems out-of-place on arXiv in terms of formatting (e.g. single-column, not latex), AND tone (casual, with some funny pictures), AND content (not too math-y, interdisciplinary in a weird way). Probably one or more of those three things. But whatever, OSF seems fine.

Reply
Homomorphically encrypted consciousness and its implications
Steven Byrnes24d20

I don't see why it should be possible for something which knows the physical state of my brain to be able to efficiently compute the contents of it.

I think you meant "philosophically necessary" where you wrote "possible"? If so, agreed, that's also my take.

If an omniscient observer can extract the contents of a brain by assembling a causal model of it in un-encrypted phase space, why would it struggle to build the same casual model in encrypted phase space?

I don’t understand this part. “Causal model” is easy—if the computer is a Turing machine, then you have a causal model in terms of the head and the tape etc. You want “understanding” not “causal model”, right?

If a superintelligence were to embark on the project of “understanding” a brain, it would be awfully helpful to see the stream of sensory inputs and the motor outputs. Without encryption, you can do that: the environment is observable. Under homomorphic encryption without the key, the environmental simulation, and the brain’s interactions with it, look like random bits just like everything else. Likewise, it would be awfully helpful to be able to notice that the brain is in a similar state at times t₁ versus t₂, and/or the ways that they’re different. But under homomorphic encryption without the key, you can’t do that, I think. See what I mean?

Reply
Video and transcript of talk on giving AIs safe motivations
Steven Byrnes25dΩ350

The headings “behavioral tools” and “transparency tools” both kinda assume that a mysterious AI has fallen out of the sky, and now you have to deal with it, as opposed to either thinking about, or intervening on, how the AI is trained or designed. (See Connor’s comment here.)

(Granted, you do mention “new paradigm”, but you seem to be envisioning that pretty narrowly as a transparency intervention.)

I think that’s an important omission. For example, it seems to leave out making inferences about Bob from the fact that Bob is human. That’s is informative even if I’ve never met Bob (no behavioral data) and can’t read his mind (no transparency). (Sorry if I’m misunderstanding.)

Reply
Load More
5Steve Byrnes’s Shortform
Ω
6y
Ω
86
34Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking
3d
6
27Social drives 1: “Sympathy Reward”, from compassion to dehumanization
6d
3
28Excerpts from my neuroscience to-do list
1mo
2
90Optical rectennas are not a promising clean energy technology
2mo
2
54Neuroscience of human sexual attraction triggers (3 hypotheses)
3mo
6
364Four ways learning Econ makes people dumber re: future AI
Ω
2mo
Ω
49
99Inscrutability was always inevitable, right?
Q
3mo
Q
33
58Perils of under- vs over-sculpting AGI desires
Ω
3mo
Ω
13
48Interview with Steven Byrnes on Brain-like AGI, Foom & Doom, and Solving Technical Alignment
3mo
1
55Teaching kids to swim
4mo
12
Load More
Wanting vs Liking
2 years ago
Wanting vs Liking
2 years ago
(+139/-26)
Waluigi Effect
2 years ago
(+2087)