I'm an AGI safety / AI alignment researcher in Boston with a particular focus on brain algorithms. Research Fellow at Astera. See https://sjbyrnes.com/agi.html for a summary of my research and sorted list of writing. Physicist by training. Email: steven.byrnes@gmail.com. Leave me anonymous feedback here. I’m also at: RSS feed, X/Twitter, Bluesky, Substack, LinkedIn, and more at my website.
I feel like my starting-point definition of “reward function” is neither “constitutive” nor “evidential” but rather “whatever function occupies this particular slot in such-and-such RL algorithm”. And then you run this RL algorithm, and it gradually builds a trained agent / policy / whatever we want to call it. And we can discuss the CS question about how that trained agent relates to the thing in the “reward function” slot.
For example, after infinite time in a finite (and fully-explored) environment, most RL algorithms have the property that they will will produce a trained agent that takes actions which maximize the reward function (or the exponentially-discounted sum of future rewards or whatever).
More generally, all bets are off, and RL algorithms might or might not produce trained agents that are aware of the reward function at all, or that care about it, or that relate to it in any other way. These are all CS questions, and generally have answers that vary depending on the particulars of the RL algorithm.
Also, I think that, in the special case of the human brain RL algorithm with its reward function (innate drives like eating-when-hungry), a person’s feelings about their own innate drives are not a good match to either “constitutive” or “evidential”.
So if AGI somehow does have an Approval Reward mechanism, what will count as a relevant or valued approval reward signal? Would AGI see humans as not relevant (like birds -- real, embodied creatures with observable preferences that just don't matter to them), or not valued (out-group, non-valued reference class), and largely discount our approval in their reward systems? Would it see other AGI entities as relevant/valued?
I feel like this discussion can only happen in the context of a much more nuts-and-bolts plan for how this would work in an AGI. In particular, I think the AGI programmers would have various free parameters / intervention points in the code to play around with, some of which may be disanalogous to anything in human or animal brains. So we would need to list those intervention points and talk about what to do with them, and then think about possible failure modes, which might be related to exogenous or endogenous distribution shifts, AGI self-modification / making successors, etc. We definitely need this discussion but it wouldn’t fit in a comment thread.
The way I see it, "making solid services/products that work with high reliability" is solving a lot of the alignment problem.
Funny, I see "high reliability" as part of the problem rather than part of the solution. If a group is planning a coup against you, then your situation is better not worse if the members of this group all have dementia. And you can tell whether or not they have dementia by observing whether they’re competent and cooperative and productive before any coup has started.
If the system is not the kind of thing that could plot a coup even if it wanted to, then it’s irrelevant to the alignment problem, or at least to the most important part of the alignment problem. E.g. spreadsheet software and bulldozers likewise “do a lot of valuable work for us with very low risk”.
humans having magically "better reward functions"
Tbc this is not my position. I think that humans can do lots of things LLMs can’t, e.g. found and grow and run innovative companies from scratch, but not because of their reward functions. Likewise, I think a quite simple reward function would be sufficient for (misaligned) ASI with capabilities lightyears beyond both humans and today’s LLMs. I have some discussion here & here.
there's a very large correlation between "not being scary" and "being commercially viable", so I expect a lot of pressure for non-scary systems
I have a three-way disjunctive argument on why I don’t buy that:
As in, an organization makes an "AI agent" but this agent frequently calls a long list of specific LLM+Prompt combinations for certain tasks.
I think this points to another deep difference between us. If you look at humans, we have one brain design, barely changed since 100,000 years ago, and (many copies of) that one brain design autonomously figured out how to run companies and drive cars and go to the moon and everything else in science and technology and the whole global economy.
I expect that people will eventually invent an AI like that—one AI design and bam, it can just go and autonomously figure out anything—whereas you seem to be imagining that the process will involve laboriously applying schlep to get AI to do more and more specific tasks. (See also my related discussion here.)
how far down the scale of life these have been found?
I don’t view this as particularly relevant to understanding human brains, intelligence, or AGI, but since you asked, if we define RL in the broad (psych-literature) sense, then here’s a relevant book excerpt:
Pavlovian conditioning occurs in a naturally brainless species, sea anemones, but it is also possible to study protostomes that have had their brains removed. An experiment by Horridge[130] demonstrated response–outcome conditioning in decapitated cockroaches and locusts. Subsequent studies showed that either the ventral nerve cord[131,132] or an isolated peripheral ganglion[133] suffices to acquire and retain these memories.
In a representative experiment, fine wires were inserted into two legs from different animals. One of the legs touched a saline solution when it was sufficiently extended, a response that completed an electrical circuit and produced the unconditioned stimulus: shock. A yoked leg received shock simultaneously. The two legs differed in that the yoked leg had a random joint angle at the time of the shock, whereas the master leg always had a joint angle large enough for its “foot” to touch the saline. Flexion of the leg reduced the joint’s angle and terminated the shock. After one leg had been conditioned, both legs were then tested independently. The master leg flexed sufficiently to avoid shock significantly more frequently than the yoked leg did, demonstrating a response–outcome (R–O) memory. —Evolution of Memory Systems
Oh, it’s definitely controversial—as I always say, there is never a neuroscience consensus. My sense is that a lot of the controversy is about how broadly to define “reinforcement learning”.
If you use a narrow definition like “RL is exactly those algorithms that are on arxiv cs.AI right now with an RL label”, then the brain is not RL.
If you use a broad definition like “RL is anything with properties like Thorndike's law of effect”, then, well, remember that “reinforcement learning” was a psychology term long before it was an AI term!
If it helps, I was arguing about this with a neuroscientist friend (Eli Sennesh) earlier this year, and wrote the following summary (not necessarily endorsed by Eli) afterwards in my notes:
- Eli doesn’t like the term “RL” in a brain context because of (1) its implication that "reward" is stuff in the environment as opposed to an internal “reward function” built from brain-internal signals, (2) its implication that we’re specifically maximizing an exponentially-discounted sum of future rewards.
- …Whereas I like the term “RL” because (1) If brain-like algorithms showed up on GitHub, then everyone in AI would call it an “RL algorithm”, put it in “RL textbooks”, and use it to solve “RL problems”, (2) This follows the historical usage (there’s reinforcement, and there’s learning, per Thorndike’s Law of Effect etc.).
- When I want to talk about “the brain’s model-based RL system”, I should translate that to “the brain’s Bellman-solving system” when I’m talking to Eli, and then we’ll be more-or-less on the same page I think?
…But Eli is just one guy, I think there are probably dozens of other schools-of-thought with their own sets of complaints or takes on “RL”.
Personally, my stance is something more like, "It seems very feasible to create sophisticated AI architectures that don't act as scary maximizers." To me it seems like this is what we're doing now, and I see some strong reasons to expect this to continue. (I realize this isn't guaranteed, but I do think it's pretty likely)
We probably mostly disagree because you’re expecting LLMs forever and I’m not. For example, AlphaZero does act as a scary maximizer. Indeed, nobody knows any way to make an AI that’s superhuman at Go, except by techniques that produce scary maximizers. Is there a way to make an AI that’s superhuman at founding and running innovative companies, but isn’t a scary maximizer? That’s beyond present AI capabilities, so the jury is still out.
The issue is basically “where do you get your capabilities from?” One place to get capabilities is by imitating humans. That’s the LLM route, but (I claim) it can’t go far beyond the hull of existing human knowledge. Another place to get capabilities is specific human design (e.g. the heuristics that humans put into Deep Blue), but that has the same limitation. That leaves consequentialism as a third source of capabilities, and it definitely works in principle, but it produces scary maximizers.
While the human analogies are interesting, I assume they might appeal more to the "consequentialist AIs are still coming” crowd than people like myself. Humans were evolved for some pretty wacky reasons, and have a large number of serious failure modes…
Yup, my expectation is that ASI will be even scarier than humans, by far. But we are in agreement that humans with power are much-more-than-zero scary.
I'd flag that in a competent and complex AI architecture, I'd expect that many subcomponents would have strong biases towards corrigibility and friendliness. This seems highly analogous to human minds, where it's really specific sub-routines and similar that have these more altruistic motivations.
I’m not sure what you mean by “subcomponents”. Are you talking about subcomponents at the learning algorithm level, or subcomponents at the trained model level? For the former, I think both LLMs and human brains are mostly big simple-ish learning algorithms, without much in the way of subcomponents. For the latter (where I would maybe say “circuits” instead of “subcomponents”?), I would also disagree but for different reasons, maybe see §2 of this post.
I personally find histories of engineering complex systems in predictable and controllable ways to be much more informative, for these challenges.
To explain my disagreement, I’ll start with an excerpt from my post here:
Question: Do you expect almost all companies to eventually be founded and run by AGIs rather than humans? …
3.2.4 Possible Answer 4: “No, because if someone wants to start a business, they would prefer to remain in charge themselves, and ask an AGI for advice when needed, rather than ‘pressing go’ on an autonomous entrepreneurial AGI.”
That’s a beautiful vision for the future. It really is. I wish I believed it. But even if lots of people do in fact take this approach, and they create lots of great businesses, it just takes one person to say “Hmm, why should I create one great business, when I can instead create 100,000 great businesses simultaneously?”
…And then let’s imagine that this one person starts “Everything, Inc.”, a conglomerate company running millions of AGIs that in turn are autonomously scouting out new business opportunities and then founding, running, and staffing tens of thousands of independent business ventures.
Under the giant legal umbrella of “Everything, Inc.”, perhaps one AGI has started a business venture involving robots building solar cells in the desert; another AGI is leading an effort to use robots to run wet-lab biology experiments and patent any new ideas; another AGI is designing and prototyping a new kind of robot that’s specialized to repair other robots, another AGI is buying land and getting permits to eventually build a new gas station in Hoboken, various AGIs are training narrow AIs or writing other special-purpose software, and of course there are AGIs making more competent and efficient next-generation AGIs, and so on.
Obviously, “Everything, Inc.” would earn wildly-unprecedented, eye-watering amounts of money, and reinvest that money to buy or build chips for even more AGIs that can found and grow even more companies in turn, and so on forever, as this person becomes the world’s first trillionaire, then the world’s first quadrillionaire, etc.
That’s a caricatured example—the story could of course be far more gradual and distributed than one guy starting “Everything, Inc.”—but the point remains: there will be an extraordinarily strong economic incentive to use AGIs in increasingly autonomous ways, rather than as assistants to human decision-makers. And in general, when things are both technologically possible and supported by extraordinarily strong economic incentives, those things are definitely gonna happen sooner or later, in the absence of countervailing forces. …
So that’s one piece of where I’m coming from.
Meanwhile, as it happens, I have worked on “engineering complex systems in predictable and controllable ways”, in a past job at an engineering firm that made guidance systems for nuclear weapons and so on. The techniques we used involved understanding the engineered system incredibly well, understanding the environment / situations that the system would be in incredibly well, knowing exactly what the engineered system should do in any of those situations, and thus developing strong confidence and controls to ensure that the system would in fact do those things.
If I imagine applying those engineering techniques, or anything remotely like them, to “Everything, Inc.”, I just can’t. They seem obviously totally inapplicable. I know extraordinarily little about what any of these millions of AGIs is doing, or where they are, or what they should be doing.
See what I mean?
I tweeted some PreK-to-elementary learning resources a few years ago here.