I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
Another anecdote: I put off a couple posts from November into December because I happened to care about these particular posts having visibility on the lesswrong frontpage, and the lesswrong frontpage has been unusually gummed up by high-karma low-effort posts during November.
In phase space, our system might evolve into very complex states that require fine-grained knowledge to keep track of. Instead of neat squares, we wind up with fractal, space filling shapes. But the thermodynamic model involves far coarser evolution, resulting in all the little gaps between thin "strands" of our phase space volume getting filled in because we can't keep track of the fine details of where our volume is/isn't.
I usually put it slightly differently: information converts from “actionable” to “not actionable”, where “actionable” means “you have some method that allows you to do something useful (e.g. charge a battery) with that information.
…And if you have non-actionable information about a system, then you might as well just forget it. Hence entropy goes up.
For example, complicated correlations between the 17th decimal digits of air molecule positions is never actionable. Knowing that information does not help you charge a battery.
On the other hand, if you happen to know that all the air molecules are on the left side of the box and none on the right, that is actionable, as long as you can quickly slide a separator down the middle of the box and attach it to a piston. You can charge a battery using that information.
There are fun edge cases where “information is lost and entropy went up”, or else it didn’t, depending on whether you have special equipment that allows you to act on that particular type of information.
Maybe you should just do the opposite of all these things so your writing becomes popular? Well, I don’t know. Maybe, maybe not.
Right on! I think you should have emphasized that part more, right from the start. (As written, you’re kinda connoting that the quick post properties are better and the effortpost properties are worse, until readers get to the last section.)
If I publish a scrupulously-researched 80-page review article on bacterial chemoreceptors, it’s sure not gonna go viral on Hacker News, but that’s still a very valuable thing I did.
See also: related comment of mine.
We can also imagine that the number of traders increases over time as well, with the th trader appearing on day and starting with .
…
The total amount of money in the economy is bounded to as well, so so everyone's net worth is in cents.
Aren’t these two things contradictory? Sorry if I’m confused.
It’s definitely possible to “get stuck” not doing X because you’ve never tried X in your life, so you don’t know what you’re missing. Sometimes it can take people many years before they try X. And sometimes they just never do, even though they would totally “take to it” if they did.
I feel like you want to say that the never-trying-X failure mode is “the rule” and I want to say that it’s “the exception” … But if so, that might not be a real disagreement, and instead we’re just thinking about different kinds of X.
I definitely agree that it can be a thing, and brought it up in Heritability: Five Battles multiple times. My examples included X = “living in Churubusco, Indiana” (§2.2.2), or X = “becoming a Soil Conservation Technician” (§2.3), or X = “joining a niche online community like rationalism” (§2.2.3).
(Or sorry if I’m still missing your point.)
I'm very curious about which things fall into the "then-get-stuck" bucket and why. Are you confident there isn't a range of lower-level drives and innate reactions that get fixed similar to accents?
I think the way that rewards (pleasure or displeasure) turn into desires (motivation) is one thing that is definitely continuous learning, not “get stuck”. If you’ve done something over and over in childhood, and then you do it today and it’s 100% miserable and embarrassing, and then you try it again the next day and it’s again 100% miserable and embarrassing, then you’re probably not going to try it a 3rd time, and certainly not a 10th time.
We see this when people get clinical depression as adults. They lose motivation to do all the things they used to like, no matter how much they liked it in childhood, or even more recently than that.
I think personality surveys mostly catch stuff like that. Even introverts go out sometimes, and even extroverts stay in sometimes. If it’s pleasant vs unpleasant, they’ll do it more. So a decision-to-socialize keeps getting steered towards the innate ground truth reward function. Likewise the decision to think about things versus people, to be honest or not, and ≈every other question on personality surveys. (Empirically, adult personality is approximately independent of childhood upbringing, in the population at large.)
Childhood regional accents are not “rewards turning into desires”, but rather something different—I think it involves the learning rate of a certain part of the cortex dropping towards zero after childhood. I don’t think “idiosyncratic cognition” is in that category, my impression is that PFC learning rates remain high throughout life.
(You’re an unusual case RE your accent … hmm, random question, did you watch a lot of American TV / movies as a kid?)
Childhood phobias don’t always last into adulthood but sometimes they do. The trick is that it’s a case where the reward itself can change (I call it “upstream generalization”). So the rewards→desires pathway continues to produce updates, but that doesn’t help. Separately, the reward itself does keep updating in response to new data, but only in narrower circumstances, by and large. I can’t immediately think of other things besides phobias that “get stuck” in that way.
I agree that “every thought we think, we’re thinking it because it’s higher-reward than other thoughts we might be thinking instead” is a great starting point.
---
I kinda disagree with your emphasis on childhood. See my post Heritability, Behaviorism, and Within-Lifetime RL, where I (dismissively) called that school of thought “RL learn-then-get-stuck”. Of course, “RL learn-then-get-stuck” is true for a few things, like regional accents, but I think those are the exception not the rule. (See also §2 of “Heritability: Five Battles”.)
~~
I think you’re right about the person-to-person variation along a bunch of axes, but the way I think about it is generally at a lower level than the kind of “traits” you list. I think there are dozens of innate drives / innate reactions (some more important than others) in the hypothalamus & brainstem, and their relative strengths differ, and most of the things you list are mostly emergent consequences of the drive / reaction strengths vector. (Note that the map from the vector to behaviors is often nonlinear, and also depends on the options and consequences available in an environment / culture.)
Going through some examples from your list.
“Do you focus on "things" vs "people"” is I think related to an “innate drive to think about and interact with other people” that I briefly discuss in §5 here.
“Are words about reality or are words just rallying cries for your team?” is downstream of that along with many other things, like how strongly does one feel Approval Reward, which in turn depends on a bunch of things including how easily nearby people trigger an involuntary orienting reaction in you.
“How much emphasis do you place on wordless felt gut feelings?” is probably partly that those gut feelings come along with stronger involuntary attention and (the interoceptive equivalent of) orienting reactions in some people than others, making those feelings more or less salient versus easily-ignorable. (Presumably there are other contributing factors too.)
Etc. etc. I don’t have great theories for everything, just trying to give a hint of how I think about those kinds of things, in case it matters (and probably it doesn’t matter for your points here).
Yeah from my perspective EAG is a place where a lot of people interested in technical alignment go, to talk to other people interested in technical alignment, about technical alignment stuff.
Meanwhile there are other things happening at EAG too, but you can ignore them. You don’t have to attend the talks, you don’t have to talk to anyone you don’t want to talk to. And it’s not terribly expensive, and the location is (often) down the street from you (OP, John).
I wonder whether you’re thinking harder about countersignaling than about what would be object-level good things to do?
I guess I’d say “the thing we intended for the AGI to be trying to do” can be vague, or described at a meta-level, as opposed to very specific.
I didn’t mean for that sentence to be making a specific controversial claim about alignment targets. I generally see “alignment targets” as an open question (see a footnote in post 10).