I'm Steve Byrnes, a professional physicist in the Boston area. I have a summary of my AGI safety research interests at:

steve2152's Comments

[Link] Lex Fridman Interviews Karl Friston

Friston was also on Sean Carroll's podcast recently - link. I found it slightly helpful. I may listen to this one too; if I do I'll comment on how they compare.

On the construction of the self

I'm really enjoying all these posts, thanks a lot!

something about the argument brought unpleasant emotions into your mind. A subsystem activated with the goal of making those emotions go away, and an effective way of doing so was focusing your attention on everything that could be said to be wrong with your spouse.

Wouldn't it be simpler to say that righteous indignation is a rewarding feeling (in the moment) and we're motivated to think thoughts that bring about that feeling?

the “regretful subsystem” cannot directly influence the “nasty subsystem”

Agreed, and this is one of the reasons that I think normal intuitions about how agents behave don't necessarily carry over to self-modifying agents whose subagents can launch direct attacks against each other, see here.

it looks like craving subsystems run on cached models

Yeah, just like every other subsystem right? Whenever any subsystem (a.k.a. model a.k.a. hypothesis) gets activated, it turns on a set of associated predictions. If it's a model that says "that thing in my field of view is a cat", it activates some predictions about parts of the visual field. If it's a model that says "I am going to brush my hair in a particular way", it activates a bunch of motor control commands and related sensory predictions. If it's a model that says "I am going to get angry at them", it activates, um, hormones or something, to bring about the corresponding arousal and valence. All these examples seem like the same type of thing to me, and all of them seem like "cached models".

From self to craving (three characteristics series)


I don't really see the idea of hypotheses trying to prove themselves true. Take the example of saccades that you mention. I think there's some inherent (or learned) negative reward associated with having multiple active hypotheses (a.k.a. subagents a.k.a. generative models) that clash with each other by producing confident mutually-inconsistent predictions about the same things. So if model A says that the person coming behind you is your friend and model B says it's a stranger, then that summons model C which strongly predicts that we are about to turn around and look at the person. This resolves the inconsistency, and hence model C is rewarded, making it ever more likely to be summoned in similar circumstances in the future.

You sorta need multiple inconsistent models for it to make sense for one to prove one of them true. How else would you figure out which part of the model to probe? If a model were trying to prevent itself from being falsified, that would predict that we look away from things that we're not sure about rather than towards them.

OK, so here's (how I think of) a typical craving situation. There are two active models.

Model A: I will eat a cookie and this will lead to an immediate reward associated with the sweet taste

Model B: I won't eat the cookie, instead I'll meditate on gratitude and this will make me very happy

Now in my perspective, this is great evidence that valence and reward are two different things. If becoming happy is the same as reward, why haven't I meditated in the last 5 years even though I know it makes me happy? And why do I want to eat that cookie even though I totally understand that it won't make me smile even while I'm eating it, or make me less hungry, or anything?

When you say "mangling the input quite severely to make it fit the filter", I guess I'm imagining a scenario like, the cookie belongs to Sally, but I wind up thinking "She probably wants me to eat it", even if that's objectively far-fetched. Is that Model A mangling the evidence to fit the filter? I wouldn't really put it that way...

The thing is, Model A is totally correct; eating the cookie would lead to an immediate reward! It doesn't need to distort anything, as far as it goes.

So now there's a Model A+D that says "I will eat the cookie and this will lead to an immediate reward, and later Sally will find out and be happy that I ate the cookie, which will be rewarding as well". So model A+D predicts a double reward! That's a strong selective pressure helping advance that model at the expense of other models, and thus we expect this model to be adopted, unless it's being weighed down by a sufficiently strong negative prior, e.g. if this model has been repeatedly falsified in the past, or if it contradicts a different model which has been repeatedly successful and rewarded in the past.

(This discussion / brainstorming is really helpful for me, thanks for your patience.)

Wrist Issues

Here's my weird story from 15 years ago where I had a year of increasingly awful debilitating RSI and then eventually I read a book and a couple days later it was gone forever.

My dopey little webpage there has helped at least a few people over the years, including apparently the co-founder of, so I continue to share it, even though I find it mildly embarrassing. :-P

Needless to say, but YMMV; I can speak to my own experience, but I don't pretend to know how to cure anyone else.

Anyway, that sucks and I hope you feel better soon!

Source code size vs learned model size in ML and in humans?

OK, I think that helps.

It sounds like your question should really be more like how many programmer-hours go into putting domain-specific content / capabilities into an AI. (You can disagree.) If it's very high, then it's the Robin-Hanson-world where different companies make AI-for-domain-X, AI-for-domain-Y, etc., and they trade and collaborate. If it's very low, then it's more plausible that someone will have a good idea and Bam, they have an AGI. (Although it might still require huge amounts of compute.)

If so, I don't think the information content of the weights of a trained model is relevant. The weights are learned automatically. Changing the code from num_hidden_layers = 10 to num_hidden_layers = 100 is not 10× the programmer effort. (It may or may not require more compute, and it may or may not require more labeled examples, and it may or may not require more hyperparameter tuning, but those are all different things, and in no case is there any reason to think it's a factor of 10, except maybe some aspects of compute.)

I don't think the size of the PyTorch codebase is relevant either.

I agree that the size of the human genome is relevant, as long as we all keep in mind that it's a massive upper bound, because perhaps a vanishingly small fraction of that is "domain-specific content / capabilities". Even within the brain, you have to synthesize tons of different proteins, control the concentrations of tons of chemicals, etc. etc.

I think the core of your question is generalizability. If you have AlphaStar but want to control a robot instead, how much extra code do you need to write? Do insights in computer vision help with NLP and vice-versa? That kind of stuff. I think generalizability has been pretty high in AI, although maybe that statement is so vague as to be vacuous. I'm thinking, for example, it's not like we have "BatchNorm for machine translation" and "BatchNorm for image segmentation" etc. It's the same BatchNorm.

On the brain side, I'm a big believer in the theory that the neocortex has one algorithm which simultaneously does planning, action, classification, prediction, etc. (The merging of action and understanding in particular is explained in my post here, see also Planning By Probabilistic Inference.) So that helps with generalizability. And I already mentioned my post on cortical uniformity. I think a programmer who knows the core neocortical algorithm and wants to then imitate the whole neocortex would mainly need (1) a database of "innate" region-to-region connections, organized by connection type (feedforward, feedback, hormone receptors) and structure (2D array of connections vs 1D, etc.), (2) a database of region-specific hyperparameters, especially when the region should lock itself down to prevent further learning ("sensitive periods"). Assuming that's the right starting point, I don't have a great sense for how many bits of data this is, but I think the information is out there in the developmental neuroscience literature. My wild guess right now would be on the order of a few KB, but with very low confidence. It's something I want to look into more when I get a chance. Note also that the would-be AGI engineer can potentially just figure out those few KB from the neuroscience literature, rather than discovering it in a more laborious way.

Oh, you also probably need code for certain non-neocortex functions like flagging human speech sounds as important to attend to etc. I suspect that that particular example is about as straightforward as it sounds, but there might be other things that are hard to do, or where it's not clear what needs to be done. Of course, for an aligned AGI, there could potentially be a lot of work required to sculpt the reward function.

Just thinking out loud :)

From self to craving (three characteristics series)


My running theory so far (a bit different from yours) would be:

  • Motivation = prediction of reward
  • Craving = unhealthily strong motivation—so strong that it breaks out of the normal checks and balances that prevent wishful thinking etc.
  • When empathetically simulating someone's mental state, we evoke the corresponding generative model in ourselves (this is "simulation theory"), but it shows up in attenuated form, i.e. with weaker (less confident) predictions (I've already been thinking that, see here).
  • By meditative practice, you can evoke a similar effect in yourself, sorta distancing yourself from your feelings and experiencing them in a quasi-empathetic way.
  • ...Therefore, this is a way to turn cravings (unhealthily strong motivations) into mild, healthy, controllable motivations

Incidentally, why is the third bullet point true, and how is it implemented? I was thinking about that this morning and came up with the following ... If you have generative model A that in turn evokes (implies) generative model B, the model B inherits the confidence of A, attenuated by a measure of how confident you are that A leads to B (basically, P(A) × P(B|A)). So if you indirectly evoke a generative model, it's guaranteed to appear with a lower confidence value than if you directly evoke it.

In empathetic simulation, A would be the model of the person you're simulating, and B would be the model that you think that person is thinking / feeling.

Sorry if that's stupid, just thinking out loud :)

Get It Done Now

it includes detailed advice that approximately no one will follow

Hey, I read the book in 2012, and I still have a GTD-ish alphabetical file, GTD-ish desk "inbox", and GTD-ish to-do list. Of course they've all gotten watered down a bit over the years from the religious fervor of the book, but it's still something.

If you decide to eventually do a task that requires less than two minutes to do, that can efficiently be done right now, do it right now.

Robert Pozen Extreme Productivity has a closely-related principle he calls "OHIO"—Only Handle It Once. If you have all the decision-relevant information that you're likely to get, then just decide right away. He gives an example of getting an email invitation to something, checking his calendar, and immediately booking a flight and hotel. I can't say I follow that one very well, but at least I acknowledge it as a goal to aspire to.

Source code size vs learned model size in ML and in humans?

I'm not sure exactly what you're trying to learn here, or what debate you're trying to resolve. (Do you have a reference?)

If almost all the complexity is in architecture, you can have fast takeoff because it doesn't work well until the pieces are all in place; or you can have slow takeoff in the opposite case. If almost all the complexity is in learned content, you can have fast takeoff because there's 50 million books and 100,000 years of YouTube videos and the AI can deeply understand all of them in 24 hours; or you can have slow takeoff because, for example, maybe the fastest supercomputers can just barely run the algorithm at all, and the algorithm gets slower and slower as it learns more, and eventually grinds to a halt, or something like that.

If an algorithm uses data structures that are specifically suited to doing Task X, and a different set of data structures that are suited to Task Y, would you call that two units of content or two units of architecture?

(I personally do not believe that intelligence requires a Swiss-army-knife of many different algorithms, see here, but this is certainly a topic on which reasonable people disagree.)

Pointing to a Flower

If you're saying that "consistent low-level structure" is a frequent cause of "recurring patterns", then sure, that seems reasonable.

Do they always go together?

  • If there are recurring patterns that are not related to consistent low-level structure, then I'd expect an intuitive concept that's not an OP-type abstraction. I think that happens: for example any word that doesn't refer to a physical object: "emotion", "grammar", "running", "cold", ...

  • If there are consistent low-level structures that are not related to recurring patterns, then I'd expect an OP-type abstraction that's not an intuitive concept. I can't think of any examples. Maybe consistent low-level structures are automatically a recurring pattern. Like, if you make a visualization in which the low-level structure(s) is highlighted, you will immediately recognize that as a recurring pattern, I guess.

Pointing to a Flower

I think the human brain answer is close to "Flower = instance of a recurring pattern in the data, defined by clustering" with an extra footnote that we also have easy access to patterns that are compositions of other known patterns. For example, a recurring pattern of "rubber" and recurring pattern of "wine glass" can be glued into a new pattern of "rubber wine glass", such that we would immediately recognize one if we saw it. (There may be other footnotes too.)

Given that I believe that's the human brain answer, I'm immediately skeptical that a totally different approach could reliably give the same answer. I feel like either there's gotta be lots of cases where your approach gives results that we humans find unintuitive, or else you're somehow sneaking human intuition / recurring patterns into your scheme without realizing it. Having said that, everything you wrote sounds reasonable, I can't point to any particular problem. I dunno.

Load More