Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

YouTube link

What can we learn about advanced deep learning systems by understanding how humans learn and form values over their lifetimes? Will superhuman AI look like ruthless coherent utility optimization, or more like a mishmash of contextually activated desires? This episode’s guest, Quintin Pope, has been thinking about these questions as a leading researcher in the shard theory community. We talk about what shard theory is, what it says about humans and neural networks, and what the implications are for making AI safe.

Topics we discuss:

Daniel Filan: Hello everybody. In this episode I’ll be speaking with Quintin Pope. Quintin is a PhD student at Oregon State University where he works with Xiaoli Fern on applying techniques for natural language processing to taxonomic and metagenomic data. He’s also one of the leading researchers behind shard theory, a perspective on AI alignment that focuses on commonalities between AI training and human learning within a lifetime. For links to what we’re discussing, you can check the description of this episode and you can read the transcript at All right, Quintin, welcome to AXRP.

Quintin Pope: I’m glad to be here.

Why understand human value formation?

Daniel Filan: Yeah. So the first thing I want to ask is… So I guess we’re broadly going to be talking about shard theory and the shard theory perspective. And a research-methodological assumption behind shard theory research is that: it seems like you think that focusing on the processes by which humans develop desires or values or something… I get the impression that shard theory people think it’s really important to study this to understand AI or AI alignment or something. I’m wondering if you could tell me a bit about: why the shard theory methodology? Why look at how humans form these desires?

Quintin Pope: Well, I think there are at least two perspectives under which this research direction is a good thing to pursue. One is that we want to align AIs to our values and just generally it seems like a good idea to have some idea of what human values are. How do they arise, computationally speaking, how might they change in ways that we do or don’t reflectively endorse, and so on and so forth. And so just from the general perspective of, we want the AI’s behavior to match these things we call values in some way that we’re not really sure about and we’re not really sure about what the values are and so on and so forth. It seems good to have more clarity about what our target even is. So that’s the weak reason for wanting to investigate human value information, which is relevant to a broad range of perspectives on AI alignment.

So in addition to thinking, understanding human value formation is good for that reason, I also have a strong or more aggressive reason to think it’s good to understand human value formation, which is that I think the best way to align AI is to have the AI as much as is possible form values using the same underlying process as is responsible for human value formation. And the reason I think this is because if you’re uncertain about a thing and uncertain about how to quantify a thing, but you want to reproduce the thing and even reproduce aspects of the thing that you personally don’t understand, it’s best to try and replicate the causal process that caused the thing to emerge in the first place. So an illustrative example to better visualize the point I’m trying to make here, might be like: suppose we had a classical painting, say, and we wanted to produce the best possible replication of that painting.

One way to go about this is to try to reproduce the causal process that caused the painting to arise in the first place, which would be to have a person look at the painting and try to paint another painting that’s very similar to it. Another way would be to use a totally different sort of causal process such as, say, taking a photograph of the painting and then having a printer print out that photograph. And as far as the things you can observe or the things that you can measure explicitly in order to incorporate into what’s recorded in the photograph, that second process will definitely be better. It will better replicate the exact pixel distribution of the painting in question. And so if you know everything about what you want to replicate in the thing you’re looking at and you’re confident about all these unknown unknowns about the thing in question, then the picture approach would be better because you can get a more precise replication.

However, that’s not our epistemic position in regards to alignment and human values. The thing that causes problems are the unknown unknown aspects of human values, where there was some aspect of what we want or how our reflective process works, or other mysterious quantities about human values, that we didn’t know we had to replicate or we didn’t know we had to make the AI respect in some particular way. This is related to Eliezer [Yudkowksy]’s fragility of value and boredom, and how if you just miss boredom, it’s completely done. So if you think about the painting example, maybe the human making the replica of the painting isn’t going to have as good a pixel-by-pixel replication of the image that a machine would’ve captured. But they are going to have partial matches along dimensions that the machine had zero match along. So for example, maybe it turns out that the relevant aspect of the painting to replicate isn’t as much its visual appearance as say the texture of the painted brushes or the smell of the paint. Or whatever aspect we didn’t think of at the time actually turns out to be quite important.

Daniel Filan: Or lighting conditions or the angle at which… Yeah.

Quintin Pope: And so trying to replicate the causal process by which an artifact appeared, puts you in a better position to reproduce aspects of the artifact that you didn’t know you should be targeting. And I think this is a particularly useful thing to have in the deep learning regime where we have loss functions in SGD, which will let us… provided we can quantify something in a way that’s accessible to expressing in a loss function, we can usually replicate that desired behavior given sufficient scale and sufficient training and sufficient resources. And so it’s the unknown unknowns that concern me. And I think replicating causal processes which produce those unknown unknowns gives us the best chance of getting close enough along dimensions we’re not certain about.

Daniel Filan: So one thing that the analogy brings out: so I suppose you can take a photo of an artwork or you can try to paint it yourself. One advantage of taking a photo is that I can get a camera that works really well a lot more easily than I can learn to paint really well. Somehow the camera has really fine-grained control over the details of all the pixels. And so I’m wondering, do you think that there’s… Maybe this is just pushing the analogy too far, but by doing something more like an engineering top-down approach, it seems like potentially the aspects that you can measure, you can maybe have more fine-grained control over. What do you think of that?

Quintin Pope: Yeah, I think the analogy does break down at that level because the thing it’s analogous to is the deep learning/RL-esque approach to producing an intelligent agent versus the basically not really existent sort of… I don’t know what would even be the alternative at this point. Approximation to AIXI, some sort of explicit, pure direct optimization search procedure? So solving agent foundations and then trying to build an AGI, how exactly? So if you compare the degree of research, how quickly research has progressed in terms of getting deep learning systems to do stuff versus getting anything else to do stuff, we’re not even at the point where anything else can do stuff at all, much less the stuff we in particular want.

Daniel Filan: So in your mind, the process of human value formation is basically just deep learning: enough that you’re willing to, in your analogy you’re like, “Oh yeah, we need to study human values and recreate them just like a painter.” And then you’re like, “Oh yeah, the painting is just doing deep learning.”

Quintin Pope: At this point I think I am comfortable saying that the workhorse underlying the human learning process is basically loss minimization/reward signals. One interesting thing in this regard, or one piece of evidence in this direction is the startling degree of convergence we’ve seen between how state-of-the-art deep learning systems work versus how the human brain works. And so in the human brain you have, what’s basically self-supervised prediction of incoming sensory signals, like predictive processing, that sort of thing, in terms of learning to predictively model what’s going to happen in your local sensory environment. And then in deep learning, we have all the self-supervised learning of pre-training in language models that’s also learning to predict a sensory environment. Of course, the sensory environment in question is text, at least at the moment. But we’ve seen how that same approach easily extends to multimodal text plus image or text audio or whatever other systems.

And then most of your cognition comes from that, whether you’re a human or a deep learning model. And then there’s a little bit of RL on top of that, which is directing that cognitive behemoth in useful directions using a relatively small amount of fine-tuning data. And just very recently, there’s actually been an advance out of Anthropic in terms of pre-training that further narrows the gap between what the human brain does and how at least they train their language models, which is that Anthropic released a paper, “Pretraining Language Models with Human Preferences”. And what they did is: instead of doing all self-supervised pre-training at the start and then a bit of RL at the end, they basically mixed these things together and did what is essentially RL throughout the entire language model pre-training process. And this brings it closer to what the human brain does in terms of our reward circuitry continuously providing supervision over our cognition throughout our entire lifetimes instead of just only activating after your childhood, say, which would be absurd for a biological organism.

Daniel Filan: All right. So I think I actually want to defer talk about how you think AI learning is going to look pretty similar to human learning within the lifetime. I guess first up, we’re talking a lot about understanding how humans form values. First of all, what do you think about… What do you mean by value? What aspects of humans are you aiming to try and understand and capture?

Quintin Pope: Yeah, so I think there are basically two ways in which humans have values. One is they have priors over their generative model, and the other is they have discriminative preferences over plans. So humans have this thing where, say, if they’re thinking about… if you have a population of beetles, say, and you introduce a selection pressure to minimize the total population, or you try to replicate the group selectionist paradigm in beetles, and you do an experiment where you punish groups of beetles for breeding too quickly and having too large a population, and you try and introduce group-level selection on their genomes towards restraining their own breeding. If you ask a human what is this going to result in, their generative prior over what sorts of adaptations would be appropriate is going to reflect certain types of value laden judgements.

And they’re going to be like, “The beetles will nobly restrain their own consumption of resources or their own reproduction” or that sort of thing. And that’s the first thing that occurs to them in terms of what’s appropriate to do in this situation, even though they’re not the correct prediction in terms of evolutionary biology. And generally people thinking about how to get someone’s mother out of a burning building are going to be at first generating plans which are significantly biased towards the sorts of action sequences and outcomes that respect other types of human values. So the first things that they’re going to think about is, the fire team rescues the person’s mother. And unless they’re explicitly thinking about monkey paw curling and that sort of thing, they’re probably not going to think, “Oh, the building explodes and her corpse comes flying out.”

So humans, when they’re generating plans, have this tendency towards generating plans that are plausible human-like plans. Well, that’s tautological, but there are certain aspects of the sorts of plans that humans tend to produce which are biased towards achieving various parts of human values, and not egregiously interfering with those values or damaging people. So that’s an aspect of the generative prior which biases plans. That’s one part of values, I think. And then another part of values is a part of the discriminative preferences, or… there’s this thing humans do when we’re considering plans or considering outcomes where we can say, “This plan is better than this other plan.” Or, “This outcome is preferable in some respect to this other outcome.”

And so when we have a list of different outcomes or plans or world states, we can order them according to something. So this is what I’m calling discriminative values. Then you can have these two things working together, where you have a generative process which is already biased towards good plans - or “good plans” relative to our perspective - and then a discriminative process that’s working over those plans or comparing different plans to each other then selecting plans that are further in the goodness aspect, whatever that means. And you can even do yet cleverer things where you have some latent mental representation of a plan, and then you provide feedback from your internal discriminator to continuously shift that plan in the ‘good plan’ direction. And all of these, of course, have analogies in what we can do in deep learning.

Daniel Filan: Sure, right. So one question I have is: when you say there’s an aspect of imagining what things I might want to do that’s related to my values, I think that’s true. I think also part of how I imagine things that I might do is stuff I’ve done before or things that would be easy for me to do, or… it seems like a bunch of that is not particularly value-laden, in the way that I think of values as being important for something like AI alignment. Do you think there’s a structural difference between, in a way such that you could tell, “oh, this part of my generation of ideas is to do with values and this part isn’t?”

Quintin Pope: Sort of. So I think that part of this is socially determined, where we humans give each other feedback on the sorts of preferences that we’d prefer each other to have. And so there are some parts of your planning process that from a game-theoretic perspective, it’s better to move to the forefront or say this is more representative of the kind of agent that I am, as opposed to your desire to minimize your own resource expenditure and things like that. And I think this sort of feedback from other people shapes our perspective of what’s appropriate to have in our own minds. And so there are some parts of our planning process that we more strongly endorse than others. Or partially as a consequence of social feedback and partially as a consequence of what you get when you feed back those sorts of learned preferences over preferences and have them analyze themselves, so to speak, you find that there are certain parts of how you plan that you want to retain more so than other parts.

You want to continue being a good person in the future, but you don’t necessarily want to continue being someone who’s better at programming in Python than JavaScript in the future. So if you have a pill that makes you a better JavaScript programmer than a Python programmer by improving your JavaScript programming, that’s something you would be fine reflectively endorsing, that sort of change in yourself. And then there are other changes, of course, that we don’t reflectively endorse. So that’s one axis along which the things we call values tend to differ from other aspects about how we tend to think. So I think it’s a mix; or the two biggest factors that differentiate these things are socially-endorsed [versus not], and also internally-endorsed versus not.

Why not design methods to align to arbitrary values?

Daniel Filan: So hopefully this is a good way of us keeping in mind what thing we’re supposed to be explaining, maybe with shard theory or something else. I’m jumping over a bit, but I’d like to go a little bit back to this idea of: the way we’re going to make AI… It seemed like you thought that it would be good to figure out what made humans have human values and replicate that process. I think there’s this alternative perspective of alignment where the idea is you have humans around and then you have some process which takes some AI, takes some fledgling AI maybe in the process of training or whatever, and pulls it closer to what people want. Such that whatever you wanted, you could pull the AI closer to that so that it was trying to achieve what you wanted it to achieve.

And it seems like it has the advantage that you’re less likely to forget an aspect of what you want because all your desires are still present, or what you want for the AI is around in you. And it also seems a bit more, I don’t know, engineering-y. For a bunch of artifacts in the world, often a thing that happens is: you imagine what you want out of the artifact and then you design that to have that property. And I’m wondering if you could say a little bit about why you’re less optimistic about these sort of ‘aligned to some quasi-arbitrary goal someone has’ plans.

Quintin Pope: So I kind of don’t think such a process even exists.

Daniel Filan: Oh, why not?

Quintin Pope: I think that the entire idea of there being a goal, or such a thing as a goal, or insofar as goals exist in the world, they’re more or less exclusively artifacts produced by neural/deep learning systems that… GPT-4’s goal is not to minimize predictive loss. That’s the loss function with which it was trained. But insofar as it ever has goals, they’re not to do that unless you prompt it in some really weird way. So I, not very strongly, but somewhat think that the sort of thing you’re gesturing towards - being able to align something to an arbitrary goal - is not a thing that exists. And also, I don’t see a particular reason to think it should exist as… people came up with the Orthogonality Thesis, say, by introspecting on themselves, and by thinking in frames that very much did not predict deep learning.

Daniel Filan: The Orthogonality Thesis being the idea that smart things can have basically arbitrary goals, right?

Quintin Pope: Yeah. And so I don’t put much stock on either of those methods. It seems to me, after having studied deep learning quite a lot, that deep learning operates on different intuitions than the rest of the world. It’s sort of like quantum mechanics, where people try to come up with various analogies for how you should think about quantum mechanics and describe it in terms of various classical things. But nothing in the classical world is really quantum mechanics in the way quantum mechanics mathematically actually is. And I’m guessing that deep learning (including the human neurological learning process) is in this category of things which are very, very difficult to reason about via analogy, especially if you’re going into things trying to reason from analogies while not even knowing how deep learning works, and that it does work, and not being familiar with empirical results from deep learning. So I’m extremely skeptical of the sorts of intuitions that I perceive as being behind the [idea that] ‘There should exist a way of producing an intelligence directed in an arbitrary axis’.

Daniel Filan: But with the Orthogonality Thesis, you mentioned that you believe that the reason people thought of it is by introspection on their own minds. If human brains are made by processes that are really similar to deep learning and AI, shouldn’t we think that that’s actually a pretty good methodology?

Quintin Pope: Ironically, if you weren’t aware of deep learning, you might think that, or if you weren’t aware of various empirical results from deep learning, you might think that. But just think about how bad GPT-4 is at introspection. If anything, it’s way less introspectively aware than a human. In general, I don’t think that deep learning systems are the sorts of things that you expect to end up with good introspective awareness of how they themselves work. And also the issue with introspection is that it’s operating at the wrong level on the abstraction hierarchy.

So the question of Orthogonality Thesis is a question of… in the deep learning framework at least, it’s a question of what happens as a result of training different sorts of AI systems in various different ways. It’s on that level of abstraction. But introspection for humans… if you’re introspecting in a moment in time, this is the learned artifact of your neurological learning process, not the learning process itself. So in deep learning speak, you’re looking at the activations, or the activations are, quote unquote, “trying to look at themselves”, but what really matters is the training process trajectory of SGD operating on the weights of the neural network, not at the activation level.

Daniel Filan: So is the idea that we just shouldn’t trust our own introspection as a way of imagining what it would be like if we were formed in a different way?

Quintin Pope: You shouldn’t trust it like that. It’s not the case that introspection provides zero evidence. It’s just that you shouldn’t necessarily believe the results of introspection, the literal results of introspection. There was a neat example… if you work with ChatGPT, say, and ask it questions about itself, it’s very often completely wrong about how it thinks. But its answers still provide evidence that’s relevant for inferring how it thinks.

Postulates about human brains

Daniel Filan: All right. I guess I’d like to move on a little bit to talking more about these background assumptions about human brains that are, as far as I can tell, going to underlie your understanding of human value formation. So I’ve read some of your work and first I’d like to say what I think the assumptions are as I understand them. And you can tell me if I’m wrong or adding things that shouldn’t be there or missing important things.

Quintin Pope: All right, go ahead.

Daniel Filan: So this is based off, I think, a section of the post on shard theory, but I’ve modified it a bit to include things that I thought were important, which is why I want some feedback. So the first assumption is: so there’s this thing called a cortex in your brain, and this is basically responsible for making decisions about what to do. And when you’re born, you get a cortex that’s kind of randomly scrambled. It doesn’t have very important structure, and in particular your genome isn’t specifying the structure of your cortex very much. So human values and biases aren’t really accessible to the genome. Is that a fair statement?

Quintin Pope: Yeah. So for one, you have many cortices.

Daniel Filan: Oh, okay.

Quintin Pope: Yeah. And the genome does specify how they’re structured at birth. It’s just that specification doesn’t have very much information content in it. So it tells you the initialization, in ML speak. And there are patterns and structures in there, but they’re mostly not storing that much behavioral information. There is some direct behavioral information in terms of neural circuitry, which is specified exactly by the genome, such as, say, keeping your heart beating. But for the most part, say, an understanding of language isn’t in there.

Daniel Filan: So when you said there were multiple cortices, which cortex… Are we talking about all the cortices, or are we talking about some important cortices here?

Quintin Pope: Do you mean -

Daniel Filan: I think in the shard theory post, it says something like “the cortex is randomly initialized at birth”.

Quintin Pope: I think Alex was using that phrase to mean cortical matter, but there are different regions in the brain and they are genetically specified to varying degrees. The prefrontal cortex is mostly not genetically specified or not preloaded with genetically-specified behaviors. But there are other things that are more strongly specified, such as the circuitry that keeps your heart beating, like I mentioned. My guess is that there are hardwired pain response behaviors that cause your hand to flinch away from painful stimulus.

Daniel Filan: Okay. So my current understanding of the claim is something like: your brain has a bunch of parts. The genome specifies something about the architecture of your brain, but it does not set very many behaviors, at least none of the behaviors that we think of as intelligent or any of the interesting behaviors. The behaviors it sets are the fact that you can breathe without thinking about it really, or your heart beating or stuff like that, or reflexes. Is that roughly fair?

Quintin Pope: Yeah, pretty much.

Daniel Filan: Okay, cool.

Quintin Pope: And there’s, eyes tend to track faces more than other stuff.

Daniel Filan: All right, so the next thing is that there’s a sense that one of the things that the brain does is self-supervised learning, which I take to mean something like learning to predict what it’s going to see next or what sensory data it is going to receive. Does that sound about right?

Quintin Pope: Yeah. So you have a bunch of senses about your environment and also your internal state and so on and so forth. And the brain tries to predict all of these things simultaneously and also tries to predict them at multiple timescales. There are a bunch of generative/predictive models that are continuously contrasting their internally-generated predictions of the future with the actual incoming sensory information about the future. And some of these are oriented towards your internal bodily sense, such as some are probably predicting whether you’ll get sick soon. And others are oriented towards the external sensory environment, like what your visual system is about to perceive in the next second or so, and so on and so forth.

Daniel Filan: And so your brain somehow has some representations of all the various sensory data it’s going to receive at various time scales, and there’s some learning process to make those representations accurate, or maybe they’re probability distributions instead of very, very simple representations. Does that sound right?

Quintin Pope: It’s closer to say, I think, that they have complicated representations. I think they’re probably not factored in a way that you’d easily be able to look at it and say, “Oh, this is a probability distribution.”

Daniel Filan: Okay, so in what sense are they complicated?

Quintin Pope: They implement very complicated target mappings between what I previously observed versus what I predict to observe in the near future. They’re complicated in the sense that GPT-4 contains a bunch of extremely complicated learned circuitry. They’re complicated in the sense that the representations track very intricate and not easily predicted aspects of the physical world around you and the ways that you interact with that world and so on and so forth.

Daniel Filan: Okay. So the next assumption is that the brain also does reinforcement learning in the sense of… I guess I don’t know what reinforcement learning algorithm. But some kind of learning based on reinforcements: you get some kind of reward and that reinforces whatever you were doing right before you got the reward. And in particular, the reward circuitry is some part of your brain, it’s genomically specified, and it’s very simple. It’s something like, “Are you detecting a bunch of glucose molecules?” Or maybe your brain has some hardwired face detector and it’s like, “Do you see a smiling face?” Or something like that. Does that sound about right?

Quintin Pope: Yeah. So there are two aspects to it. One is a bunch of very simple hardwired genomically-specified reward circuits over stuff like your sensory experiences or simple correlates of good sensory experiences. There’s also what you might call a learned value function, which is sort of what behavioral science calls learned reinforcers. So if your environment contains sensory correlates of something that produces reward, then the things that predicted the reward incoming eventually become reinforcers themselves. So when you’re training a dog, initially you reward it with treats, but you always maybe have a sound prior to giving it treats, and eventually the sound becomes rewarding in and of itself.

Daniel Filan: And by rewarding, you mean in the sense of a reinforcer that reinforces, I guess, you’re thinking of computations that happened right before the event?

Quintin Pope: Yeah.

Daniel Filan: Okay, cool. I have a few more questions about that, but before that, I’ll get to the other somewhat more implicit assumptions. One thing - I guess you’ve said it in a few places - but I understand the shard theory perspective to be that evolution and the genome, they sort of specify this setup, but they don’t specify beliefs or pieces of knowledge or preferences, they just set up the learning process. Is that right?

Quintin Pope: Yeah, pretty much. I suppose you can call reflexively moving away, pulling back from hot stoves, say, like a sort of preference, but I don’t think the genomically-specified behaviors are much more complicated than that.

Sufficiency of the postulates

Daniel Filan: Okay, all right. Then finally, here’s something that I read as an implicit assumption to shard theory conclusions, which is that these are all of the relevant processes for human within-lifetime learning. Once you understand the self-supervised learning and the reinforcement learning, you basically understand how humans learn within their lifetimes. Does that sound right to you?

Quintin Pope: I think it’s most of it, but there are, of course other factors as well. For example, I think that probably infants have an attention level bias towards paying more attention to face-like structures than other parts of their visual field. This is not cleanly part of the reward system or part of the self-supervised objective. It’s a tendency to move your eyes in particular directions, which then feeds back into the sorts of things that you tend to learn about more than other kinds of things and so it’s like a lever which influences the types of values that are statistically formed by this kind of learning process, but it doesn’t neatly fit into the language of reinforcement learning or self-supervised learning directly.

There are other factors as well, such as architectural facts about the brain and what sorts of brain locations are closer or further from each other, which influences the sorts of algorithms that are easier to represent in the kind of architecture that the brain does have, and so these probably downstream influences your values and various statistical patterns. There’s also the level of noise that your neurons have. Research on deep learning inductive biases shows that the noise structure of your optimizer, over time, influences the types of generalization patterns that you tend to learn. So yeah, there are definitely other factors as well.

Daniel Filan: Okay. Would it be fair to say that you see those factors as important insofar as they influence how the reinforcement learning, the self-supervised learning work? Maybe the factors are biases in the value-neutral sense, like biases in how the self-supervised learning goes, or like hyperparameters of the RL or something?

Quintin Pope: Yeah, for the most part. There are also various genomically-specified Rube Goldberg-esque thingies. Your genome has various tools available to it to change your behavior mid-lifetime. There was an experiment with salt deprivation in rats. What the experimenters did is they took these rats and they sprayed extremely salty water into their mouths, and this is quite an unpleasant experience for the rats and so they ran away and didn’t like the sprayer that sprayed them with salty water. Then, the experimenters then injected the rats with a particular compound which chemically imitates the physiological effects of extreme salt deprivation so it caused the rats to biologically think that they were very salt-deprived and the rats immediately went to the salt sprayer and were very interested in it and tried to get salt water out of that sprayer, even though all the rewards they’d gotten from the salt water being sprayed into their face were very negative at the time.

What I think happened and what Steve Byrnes thinks happens when he wrote up a LessWrong post, which was “Inner alignment in salt-starved rats”: basically you have predictors of various sensory signals hardwired, so the genome specifies that there are going to be these predictors of various hardwired sensory signals, and one of those sensory signals is ‘saltwater in my mouth’. Throughout the rat’s lifetime, it’s learning to predict the occurrences of salt water in their mouth. What this does is it’s like a learned pointer to the world model. The world model learns that there’s this dispenser of salt water, and the genome now has this pointer to this part of the world model that this dispenser thing has salt water in it and so when the genome activates a genetically hard-coded algorithm, which is like “we are now salt deprived, we now need salt water”, this connects with the part of the world model that the rat has learned about, where salt is in its environment, and the genome can paint this thing with positive valence in order to direct the rat’s behavior towards the salt water now that it needs salt in its diet, even though no part of the reward that the rat received within its lifetime said “salt was good” and no part of the rat’s within-lifetime behavior is telling it that it should pursue salt water when it’s salt deprived.

Daniel Filan: Gotcha. If I take this as being relevant to human brains and human cognition, it sounds like the thing that’s going on, the type of reinforcement learning that’s going on, can’t just be [that] you get a reward signal and that strengthens or weakens certain behaviors or certain types of thinking. It seems like it must be the case that there’s something being stored somewhere which is like, “oh yeah, these kinds of thinking got this type of reward”. Then, there’s some process that directs your thinking either to be things that got that type of reward or things that got the negative of that type of reward, if that’s what’s needed, right? If you’re sprayed with salty water, some bit of you is storing like, “Oh yeah. These behaviors were associated with saltiness and ordinarily, I’m going to try to avoid saltiness, but sometimes I’m going to go towards saltiness and then I’m going to pick those behaviors.” Does that sound right to you?

Quintin Pope: Yeah. There are various, probably fairly simple, genetically-specified algorithms and conditionals that activate to implement behaviors that are very hard to learn within lifetime. You don’t want your rats to have to go through periods of extreme salt deprivation and then explore around their environment until they find salt in order to learn that salt deprivation needs salt to remedy, because they’ll die before they find the salt and discover this relationship, so the genome definitely does have conditional circuitry for certain types of ancestrally-relevant survival behavior.

Daniel Filan: Yeah. Well, in this case, it’s kind of an abstract behavior, right? In the ancestral environment, there wasn’t a guy in a lab coat who you could go to to get salt, right?

Quintin Pope: Yeah, but there were salt deprivations and so-

Daniel Filan: The thing that must be happening is: the thing being coded is like “get a thing that you’ve stored as having salty” rather than a very direct behavior.

Quintin Pope: Yeah. It’s like configuring the learning process to learn pointers between what the genome can “see” versus the learned understanding of the world. That lets the genome point the rat in various directions when it’s relevant for behaviors that you would die when you try to learn them in the real world or that you don’t have time to learn within your lifetime and stuff like that.

Reinforcement learning as conditional sampling

Daniel Filan: Sure. I guess, the reason that I’m focusing on this is to me, this sounds different than at least what I think of as reinforcement learning in the very simple… just like, you have this reinforcer and it always reinforces behavior, right?

Quintin Pope: Yeah. I don’t really think of it as that different. All behaviors are sampling from some distribution. The thing about distributions is that you can condition them in various ways. Conceptually speaking, the normal sorts of reinforcement learning we do with just one type of reward signal is just sampling from some distribution of behavior conditional on the samples having high reward assigned to them. What the brain is doing is, it can condition on different sorts of things. Instead of just conditioning on “this behavior gets high reward”, whatever that means, it’s like “this behavior gets high reward as measured by the salt detector thing”.

In the context of language modeling, we can do exactly the same thing, or a conceptually extremely similar thing to what the genome is doing here, where we can have… [In the] “Pretraining Language Models with Human Preferences” paper I mentioned a while ago, what they actually technically do is they label their pre-training corpus with special tokens depending on whether or not the pre-training corpus depicts good or bad behavior and so they have this token for, “okay, this text is about to contain good behavior” and so once the model sees this token, it’s doing conditional generation of good behavior. Then, they have this other token that means bad behavior is coming, and so when the model sees that token… or actually I think they’re reward values, or classifier values of the goodness or badness of the incoming behavior.

But anyway, what happens is that you learn this conditional model of different types of behavior, and so in deployment, you can set the conditional variable to be good behavior and the model then generates good behavior. You could imagine an extended version of this sort of setup where instead of having just binary good or bad behavior, as you’re labeling, you have good or bad behavior, polite or not polite behavior, academic speak versus casual speak. You could have factual correct claims versus fiction writing and so on and so forth. This would give the code base, all these learned pointers to the models’ “within lifetime” learning, and so you would have these various control tokens or control codes that you could then switch between, according to whatever simple program you want, in order to direct the model’s learned behavior in various ways. I see this as very similar to the sorts of things that the brain’s steering/genomically-specified behavior control system is doing within lifetime.

Compatibility with genetically-influenced behaviour

Daniel Filan: Okay. I think part of this is, there’s this idea that the relevant process of learning is happening over a lifetime. It’s the self-supervised learning and the reinforcement learning and that the genome is specifying basically mostly architectural facts and some very low-level facts, like I don’t know, flinches or reflexes, stuff like this.

I’m wondering, if I look at humans, there’s this literature on which things are kind of “genetically innate”, or at least have some genetically innate component, versus totally specified by the environment. A standard finding is that you can take twins and separate them and adopt them to different places and the standard finding is that they’re very similar in a bunch of ways. They have a lot of similar preferences. Maybe they’ll go to church at similar frequencies or stuff like this. I think at face value, this seems kind of contrary to this picture, so I’m wondering how would you reconcile those?

Quintin Pope: Yeah. I want to be clear that I’m not arguing for a blank slatist interpretation of human learning. It’s not the case. We just had an entire discussion about how the genome has these various levers of control for your learned behavior that can let it steer your behavior in various ways depending on the circumstances you’re in, so that’s one way in which your genome has options for determining what sort of person you are and how you act. The other thing that the genome can do is it can specify your reward circuitry. To give a very simple example, say, humans have different types of reward circuitry… or the reward circuitry of different humans judges tastes in different ways. If you have reward circuitry that tends to very highly upweight sugar consumption, then you’re more likely to like sweets in your lifetime and this is a genetically determined fact about your learning process, which tends to influence the sorts of values you learn within your lifetime.

My general expectation is very much if there’s a question of values or any behavioral trait really, how much is that genetically influenced versus environmentally influenced? My initial assumption is these will be half-and-half genetic versus environment. But what I’m saying is that the mechanisms by which the genome exerts this influence are primarily not by directly specifying the behaviors in question. They don’t directly specify the neural circuitry that causes you to see ice cream and think, I want to eat that. It specifies the parts of the reward function, which in combination with the environment you live in, causes you to form the sorts of behavior-specifying circuitry that incline you towards eating ice cream.

Daniel Filan: That kind of makes sense, but then I wonder like… okay, I’m going to do something a bit naughty. I’m going to use religion as an example, even though I haven’t actually checked that these results hold for religiosity, but it seems to me they’re the kind of thing that it’s going to hold for.

Quintin Pope: Yeah. I think that specific religion is still slightly genetically influenced, but it’s one of the least genetically and most culturally influenced factors, I want to say.

Daniel Filan: Yeah. I’m not thinking of which denomination you’re a part of. I’m more thinking like how frequently do you pray or exhibit generically religious behavior. That strikes me as more likely to be something that could be sort of genetically specified.

There, I’m like if that’s happening, I don’t know, if you can do some twin adoption study where people are raised in what at least seem to me as being relatively different environments - it’s sort of unclear how you can go from reward circuitry that specifies glucose or smiles or something and get to religiosity without the genome whispering more things in your ear, so to speak.

Quintin Pope: Yeah. That’s one of the more difficult things to see how the genome has levers to influence it. One thing that the genome can do is: the learned pointers to the world model point to anything which has a genomically accessible correlate. What I mean by this is that the genome doesn’t have to exactly specify which parts of the world it should point to. If there are complicated parts of the world that you can’t genetically encode a precise classifier for, such as whether you’re in a religious gathering or not, the genome can just point to things that are statistically speaking correlates of that in the sensory experiences. If you can figure out any sort of simple classifier that has access to your sensory signals and is significantly more likely to activate for religious interactions or religious-style thinking than for other stuff, you can probably have a genomic level bias towards or away from religion.

One of the other things that the genome can do is it can have simple classifiers over internal states of experience… so, I don’t know, strong introspection, strongly introspective states. Maybe it has some level of access to the degree to which you’re experiencing awe, say, as an emotion. Genomically-specified reward circuitry of different people can assign different levels of reward and different persistences of reward to an internal experience of awe. This could bias people towards being more or less religious or being more or less, I don’t know, cultural correlates of religiosity. Like not being religious, but being spiritual, that sort of thing.

Daniel Filan: Yeah, yeah. It sounds like emotions, awe, it sounds like you think those are the kinds of things that might be genomically specified such that they could be hooked up to a relatively simple reward circuit. Is that right?

Quintin Pope: I think they’re genomically accessible. I think you can have genomically-specified circuitry that’s, say, looking at your brain, and is able to estimate the level of awe you’re experiencing or the level of anger you’re experiencing and that feedback from genomically-specified circuits can influence the level of awe, anger and so on and so forth that you are experiencing, but it’s not an exact thing.

Daniel Filan: Yeah. I guess, it seems like it would be hard for this to be true if things like awe and anger were things that you learned during the self-supervised learning and then the RL, right? Because then, you and I might learn them in different ways and they could be in different bits of my brain and it would be hard for my genome to pick up. It would be hard to say, “Oh yeah, this receptor is awe in Daniel, and it’s also awe in Quintin.”

Quintin Pope: Yeah, so it’s not matching locations. It’s at the level of behavioral patterns and how they relate to internal states, not at the level of pre-specifying this particular location in the learned world model as being the awe location. Say, if people tend to be like - let’s give a really simple example - angry when they’re hungry, say, then the genome can look for, in your learned world model, correlates of hunger and that’s going to be anger plus a bunch of other things, and then it can look for correlates of pain, I guess and that’s going to be anger plus a bunch of other things. The intersection of those things is more anger than other stuff. When I mean learned pointers, you can have various clever mechanisms of trying to extract what in the world model are these various higher-level concepts. That’s even assuming that there isn’t, say, some whole brain level statistical pattern that anger tends to exhibit or awe tends to exhibit.

I think that EEG studies on human brains are able to classify emotional states with better than random chance accuracy. They don’t do this by looking at each individual - or if you think about the sorts of features that they probably extract, it’s not at the level of individual neuron activations, it’s at the level of statistical patterns over blood flow in the brain - or sorry, no, over high-level electrical activity patterns. If statistically, anger tends to look like this on an EEG and awe tends to look like something else on an EEG, I think you can have genome-specified circuitry that is more sensitive to the sorts of statistical patterns that anger tends to have, or more sensitive to the sorts of patterns that awe tends to have.

Daniel Filan: Okay. Yeah. It makes sense to me that genomes could specify reward circuitry that was related to statistics of electrical activity. You also said something about reward circuitry hooking up to things in the learned world model. Are you thinking of that kind of thing? The kind of thing where some parts of your world model, you’re thinking about it, and you have some emotion and that induces some high-level statistics of electrical activity in the brain and the circuitry picks up on that? Or are you thinking of something kind of different?

Quintin Pope: Yeah. There are two different things going on here, and I was kind of jumping between them in a way that was probably a bit confusing. There’s the sensory ground truth signals that the genomically-specified circuitry uses to learn a pointer towards various concepts in the world model. Then, there’s that pointer itself.

The high-level activation patterns that the genome could specify a detector for, I was thinking of those as being the sensory ground truth that are then used to make a pointer to the world model and see which aspects of the world model correlate with or predict the occurrence of those sorts of high-level electrical activity patterns associated with emotions.

Daniel Filan: Gotcha. So it’s something like: the genome is plugged into electrical activity or glucose or some transmitters or something, fairly concrete physical things about the brain, and then the way that relates to the human world model is that some things about the human world model - “I believe a donut is on its way to me” or something like that - that’s going to statistically be related to “I’m going to get glucose five minutes from now. Right now, I’m feeling anxious because I’m waiting for it or something.” Is that roughly right?

Quintin Pope: Yeah.

Daniel Filan: Okay, cool. I think I understand this view a bit better. I guess I have this question: why should I believe these assumptions about the brain?

Quintin Pope: Yeah. Steven Byrnes wrote a 15-post sequence arguing for this sort of picture of the brain as doing within-lifetime learning. That’s one thing.

Daniel Filan: In the form of self-supervised learning and reinforcement learning?

Quintin Pope: Yeah.

Daniel Filan: Okay.

Quintin Pope: Also, you compare it to, say, the recent progress in deep learning, and we seem to have stumbled upon a pretty similar paradigm to what the brain seems to do. That’s suggestive, it’s a convergent thing, as an effective way of getting performance out of a computational system. Also, there’s just how difficult it is to get interpretability to work at the level where you’d be able to hand specify the sorts of circuits that you’d need to be able to specify in order for the genome to directly encode high-level behavioral parts of the world model and value system.

Daniel Filan: You mean interpretability on the human brain?

Quintin Pope: Interpretability on deep networks. So, the tools we actually use to steer our deep networks are, like I said with the Anthropic thing, conditioning of a generative model using basically learned pointers to the different concepts in the generative model and this is also how prompting works.

Why deep learning is basically what the brain does

Daniel Filan: Yeah. I mean, that counts as evidence once I believe that the human brain works this way, or once I believe that the human brain is really similar to deep learning, right?

Quintin Pope: Yeah. You do have to believe that the human brain has a degree of similarity to deep learning for information about what is and isn’t easy in deep learning to count. Then in terms of like, do you basically want me to argue that the brain and deep learning are doing very similar things?

Daniel Filan: If you don’t think it’s true, then I don’t want you to argue it, but if part of your reason for thinking that this view of the brain is right, is that it’s similar to modern deep learning, then yeah, I would like to know why you think that the human brain is similar to deep learning without making reference to the similarities that you’re deriving from that connection.

Quintin Pope: Okay.

Daniel Filan: Or if that’s not a part of your reasoning, then I’m happy to drop it, of course.

Quintin Pope: Yeah. Partially, it’s because we can just look at the activations in the brain and look at the activations in deep models and it turns out that these things are startlingly similar to each other. In particular, if you look at the activations in the brain’s linguistic cortex while processing various types of text and then compare that to the activations within a language model transformer when processing the same types of texts, you can find a linear transformation from the transformer’s representations to the recorded brain’s representations.

This linear transformation is much better at predicting the brain’s representations than a transformation from a randomly initialized transformer can be. In fact, there’s been studies finding that the linear transformation on top of the model embeddings predicts the recorded brain activations up to the noise limit of the recording of the brain recording technology they were using. Different studies found different degrees of correspondence between brain activations and neural network activations.

I think there was a study out of either Facebook or Microsoft, I think Facebook, which either used more sensitive recording instruments or something like that. But what they found was that language model activations are predictive of brain activations, and also that as you make the language model’s training objective more similar to the presumable training objective of the brain’s linguistic systems, such as by having the language model predict multiple tokens in advance instead of one token in advance, that the degree of similarity between the language model activations and the brain recordings goes up. Yeah, so that’s one piece of evidence, like a direct comparison between brain activations and language model activations.

Another piece of evidence is that from a neurological level of what the brain is doing versus the parameter-space level of what SGD is doing, they clearly share at least some degree of their inductive biases and plausibly most of their actually relevant inductive biases. In particular, there are two sources of these inductive biases that seem most comparable between the brain and deep learning. One is an inductive bias towards flat regions in the parameter space. This might be getting kind of technical, so stop me if things seem like they deserve clarification.

Daniel Filan: Sure. By flat regions, you roughly mean settings of the knobs or whatever that specify how your brain works or how neural networks [work], such that there are a bunch of knobs that you can twiddle a fair bit without changing the loss, the thing that the system is optimized for. Is that roughly right? And the loss is good, yeah.

Quintin Pope: Yeah, where those knobs are specifically the weights of a neural network and the synaptic connections in the brain. We’re not talking about the genome or genes as the parameters being twiddled, it’s the parameters of the learned artifact. People have done a bunch of investigation of the inductive biases of SGD, or deep learning. By inductive biases, I mean those factors of the learning process or those features of the learning process that bias it towards certain types of solutions and certain types of patterns of generalization as opposed to others. There’s this one reasonably large source of inductive biases that both the brain and deep learning share to quite a degree, which is derived from the presence of noise, of randomness in their two optimization procedures. Stochastic gradient descent, SGD, has quite a bit of noise in it as a result of stuff like the mini-batch ordering, and not being trained on the entire batch at once, and other factors related to how the data interacts with the architecture and optimizer.

People have studied the impact of this noise on the types of solutions that SGD tends to derive. Unsurprisingly, in my opinion, at least the first order effect of this noise is to bias SGD towards flatter regions in the loss landscape. In your knob view, if you have a bunch of knobs whose twiddling causes huge issues for the system in question, then having noise in what sorts of knobs are twiddled will cause issues if you’re in a region of solution space where there are lots of ways to mess things up. And so having randomness in the training process will bias you towards regions of the solution space where there aren’t lots of ways to mess things up, where lots of the alternative nearby settings of the knobs that you could encounter produce very similar behaviors to what you currently have. And the brain also has lots of noise in its learning process so the per neuron level of noise in activation patterns is pretty high, much higher than dropout, or the regularizers we tend to apply to machine learning systems… Well, I guess that depends on how much noise you introduce into a machine learning system. You can introduce more noise than the brain has, but the brain has a respectable amount of noise, is my point.

Daniel Filan: Well, how do you know which stuff to interpret as noise, versus signal that you don’t understand? Maybe you don’t know this off the top of your head.

Quintin Pope: I guess I’ve always heard that the brain is a noisy instrument, but I don’t actually know the empirical observations that lead people to believe this.

Daniel Filan: I guess you could look at temperature, I don’t know if it just has a certain temperature and if it’s at a certain scale, it’s just hot. Things will be jiggling around randomly, that gets you some floor.

Quintin Pope: Yeah. Potentially that. There’s also the theoretical computational efficiency reason where the Landauer [limit] tells you the thermodynamic minimum energy required to do computations… If I remember correctly, there’s a term for the accuracy and the noise of the computations. And so, just from a Pareto optimum sort of view, you wouldn’t expect biology to be extreme on maximum precision and not compromising at all on the precision of individual computations, even if that costs more energy than otherwise. Especially when deep learning shows you that you can have a fair degree of noise in the level of individual computations and still have an effective system in aggregate. And then there’s probably arguments based on the biology of neurons and how consistent they can plausibly be. I’m not a neurobiologist.

Daniel Filan: Fair enough. So, it sounded like the reason you thought this story was accurate: firstly, you referred people to the series of posts arguing for something like this perspective, and then you’re pointing towards, okay, we train neural nets in roughly this way. And there are various similarities between human brains and neural networks. They have similar representations of language. The learning process is biased towards similar kinds of results. Does that about cover it?

Quintin Pope: That was one of the ways in which the inductive biases were similar. And then the other way, or the other big way I think that they’re similar is, well, there’s this thing called singular learning theory in the study of machine learning systems. And basically, what singular learning theory does, is it looks at how the data that a system is trained on influences the inductive biases of the system. And there are even results that no purely architecture-level explanation of inductive biases that doesn’t include facts about the data the system was actually trained on, can fully account for the inductive biases of the system. And basically, singular learning theory argues that there are certain facts about the geometry of the data that the system is trained on and how that interacts with the possible internals of the system, which will bias the system towards certain types of internal representations.

And from my perspective, or as far as I can tell, very similar sorts of arguments should apply to the neurology or to the synapse space of human brains. So, singular learning theory doesn’t make any assumptions about the sort of optimizer you use to tune the internals of the system. It’s sort of like a statistical thermodynamics style argument about the number of configuration states that the internals of a system can have, which correspond to certain types of external behavior. And basically, the more possible internal configurations a system has which produce functionally equivalent behavior, the more likely you are to hit those sorts of internal configurations, if that makes sense.

Daniel Filan: Yeah, that roughly makes sense.

Quintin Pope: And so, it seems that there should be a similar sort of dynamic here where your brain synapses are constantly changing a little bit over time in a stochastic manner. And so, it seems like you should be in regions of brain synapse configuration that have a large volume of implementing functionally very similar behavior, since your values and behaviors don’t change wildly without, well, interacting with minimal amounts of data over your lifetime. And so, there’s this thing called the parameter-function map in deep learning, which basically tells you for all the knobs that are defining the internal structure of your system, the parameter-function map tells you how those knobs relate to external behavior on the part of the system overall.

And so, singular learning theory basically says that the geometry of this parameter-function map and the volume of spaces that correspond to similar behaviors are determining the inductive biases of what sorts of internal configurations you tend to arrive at and how those configurations tend to generalize. And the interesting thing about the parameter-function map in deep learning systems is that we can get a local approximation to this map. It’s called a linearized approximation to a neural network, the neural tangent kernel if you’ve heard of that. So, you can basically be, instead of having the full complicated landscape of how every possible parameter change influences the network’s downstream behavior, you can look at a particular location in this map and have a first order linear approximation to this relationship.

And it turns out that this tells you the inductive biases of the network at that particular location, because it tells you, for every possible direction you could change in the neural network parameters, how does that influence the overall function that the neural network as a whole implements at least at that local region, that parameter space? And so, people have studied how this linearized approximation to the parameter-function map evolves over time as you’re training the neural network. And it turns out that what happens is that the linearized approximation tends to match the target function of the neural network in question.

Daniel Filan: The target function?

Quintin Pope: Yeah. So, the labeling of the data that the neural network is trained on. So, there was this paper that did a sort of toy example of this. And they took a deep neural network and they trained it on points in two dimensional space. And they were doing binary classification of these 2D points, and they were just determining whether or not the points were within the radius of a circle, within a circle of radius 1. So, for points within the circle of radius 1, they said this was positively labeled, and for points outside it was negatively labeled. And they trained the neural network on these data. And then they looked at how this changed the local parameter-function map, the local tangent kernel of the network. And what you see is the formation of a circle inside the parameter-function map.

Daniel Filan: What do you mean a circle inside the parameter-function map?

Quintin Pope: So, the parameter-function map is this thing that’s telling you how the function of the neural network changes as you change the parameters. And so, when I say they found a circle inside the parameter-function map of the learned neural network, what this means is that the neural network entered a point in its parameter space where it was easier to change its classification boundary in a circular manner. So, points at the edge of the classification boundary for that neural network didn’t just cause a single local bump to include the point or exclude the point depending on its label. They actually updated the whole radius that the network “thought” the circle was operating under.

Daniel Filan: So, the function corresponding to these parameters is basically classifying, depending on whether or not you’re in a circle. And in this linearized approximation, in any of the ways you could change the network, it would roughly still have it be a circle and you would just change the radius or maybe where the circle was located or something like that?

Quintin Pope: Entirely the radius, that’s all.

Daniel Filan: Entirely the radius? Okay.

Quintin Pope: Yeah. So, it’s not like all of the ways would change that. It’s just that most of the local directions you can move from the current location of the network, were more inclined to change the radius of the circle, as opposed to say, creating a very localized exception for the particular point that you were classifying,

Daniel Filan: I’m sorry, what does this have to do with how neural networks are similar to human brains?

Quintin Pope: So, singular learning theory is saying that the geometry of this parameter-function map is determining the inductive biases of the network. And it turns out that the parameter-function map adapts to match the loss function and learn to learn on the loss function that you actually train it on. And so, this is telling you that the inductive biases of the network can’t be that strongly determined by facts of the architecture and optimizer and so on. They’re to a very great degree determined by the data that you train it on. And the way this relates to brains is that, in my opinion, at least, if we had the interpretability to look at the parameter-function map of the brains, we’d see a similar alignment with the loss function or the data labeling function of the brains, where changes in brain parameters are statistically more likely to change your beliefs about the world in a “reasonable” sense that matches the sorts of structures that are actually in the world and how those structures manifest themselves to you in terms of the loss functions that your brain is trained on. Does that make sense?

Daniel Filan: Yeah, I think that basically makes sense.

Quintin Pope: So, data distribution determines the parameter-function map of deep networks, which determines their inductive biases. And also, data distribution probably, I think, determines the parameter-function map of neural synapses which determine the inductive biases. So, it seems to me the inductive biases of these things are to a very great extent determined by the data. And this doesn’t by itself guarantee convergence, because, of course, the brain and neural networks are trained on different types of data. The brain has a bunch of internal loss functions and labelings of all sorts of wild stuff. And also, we don’t give children academic papers and tell them to predict the next word for all of those papers.

Daniel Filan: In fact, to the degree that you think that humans and deep neural networks are trained on different types of data, it sounds like if you believe this singular learning theory, you should predict that they’re actually quite different.

Quintin Pope: Sort of. Some parts are quite different. The most obvious one is language models aren’t image models and aren’t trained on images, but also there are these eerie regularities between different types of data. So, for example, it turns out that language models can do image classification. Some of the craziest stuff I’ve ever seen is that if you tokenize an image as language tokens, GPT-3 can few-shot learn to classify images. And also, people have compared the internal representational structure of language models to those of image models, pure language models and pure image models, and found convergently similar geometries in certain respects, and there’s transfer learning between image and language pre-training.

And also, this is a pretty big thing as far as I’m concerned, blind humans are not morally different in their values to much degree at all, as compared to sighted humans. Despite the two facts that (1) these are substantial divergences in their training data and (2) evolution probably did not adapt to the case of blindness. Most blind humans probably died in the ancestral environment. And so, evolution probably did not tune the human learning process to do anything in particular for blind people. So, the fact that blind people and sighted people are pretty much the same, as far as any question of morality or values goes, or any relevant question of morality and values, is I think evidence that it’s actually weirdly easy to learn morality even from quite varying types of data.

And similarly, people’s brains are actually quite different from each other in many respects. There are people who have single hemispheres of their brains from shortly after they were born. Behaviorally, they’re pretty much identical to bi-hemisphered individuals. And of course, evolution had no opinion on that at all, so to speak. So, this feeds into my belief that probably human values are actually pretty robust to this learning process that they’re being fed into.

Shard theory

Daniel Filan: So, now that we’ve covered the background assumptions about how the human brain works. I’d like to talk a little bit more about just what is shard theory, or in particular the shard theory of human values? Can you say what is it?

Quintin Pope: Yeah. So, it’s basically this accounting of how pretty simple RL-esque learning processes can produce things that at least look quite a bit like human values. So, it’s intended to be an alternative perspective on what values are and how they arise which contrasts, say, expected utility theory for example, and is intended to be an accounting of values for deep learning systems, so to speak.

Daniel Filan: Okay. And what’s the accounting? What is the theory?

Quintin Pope: Basically, the reward that you train a deep system on is not its values, it’s the chisel that shapes those values. And their values, insofar as they’re actually a thing at all, are very much contextual. They are contextually-activated decision influences, is the phrase we often use. And so, there are these sorts of almost conceptually lookup-table-esque factors or little classifiers, you could think of them in your mind, that look for particular situations that you’re in, accounting for low-level facts about your immediate environment, say, as well as higher-level facts about what are you thinking about right now and reflective quantities in your cognition.

And they activate for certain types of scenarios and introduce decision-relevant factors into your cognition, or biases into your cognition, you could say, which steer the sorts of plans or behaviors you output in various ways that reflect the distribution of rewards you’ve encountered so far in your life, as well as the actions that you took preceding those rewards, but are not themselves oriented towards maximizing a reward, if that makes sense at all.

Daniel Filan: Yeah. So, in various contexts, where contexts can be low-level perceptual stuff or high-level “what I’m thinking about”, there are various different influences on my behavior. And that’s basically what you think of as values. Is that right?

Quintin Pope: Yeah. And so, these things, computationally speaking, they’re not stored as, say, an exact discriminator that perfectly reflects the difference between good and bad outcomes in the real world. They’re a sort of continuous online process of shaping the thoughts that you actually do have in certain particular types of directions and upweighting certain types of thoughts above other types of thoughts. So, they’re conceptually a very different thing than say a utility function implemented as a deep learning classification model over outcomes or anything like that.

Daniel Filan: Because the thing they have influence over is what I think about or what cognition I do?

Quintin Pope: Yeah, that, and also they’re less coherent, I guess you could say. Or they’re very contextual, not globally-activated consistent criteria for judging the worthwhileness of different outcomes. And they’re not very good or not very robust as classifiers. It’s not like humans are extensively adversarially trained to perfectly implement a given set of reflectively-endorsed values. So, it’s conceptually more by the seat of your pants, so to speak, than a clean mechanistic decision-making process. Well, it is mechanistic, but it’s not clean, I guess you could say.

Daniel Filan: Okay. So, I guess, first of all, close to the start of the episode, we said that the things that we were supposed to be trying to explain, firstly was something like: when I come up with ideas for what to do, what ideas are apparent to me? what things seem like plausible candidates? In particular, the value of that, which is something like things that I endorse, plans that I endorse prioritizing, or plans that society endorses me prioritizing. And there was another.

Quintin Pope: The discriminative…?

Daniel Filan: Yeah, some sort of discriminative “when I see different plans, which ones do I pick?” So, these contextual influences on decision-making, are those called shards in shard theory? Do I understand that right?

Quintin Pope: Yeah.

Daniel Filan: Yeah. So, how do shards… is the idea that they all constitute or shape these generative and discriminative values, or how do they relate?

Quintin Pope: Well, so, if I remember correctly, you were asking me about the general shape of human values when I described what I see as two of the biggest clusters of types of values, and how they influence and how they make themselves known in your decision-making process. And so, shards are these contextually-activated computations that eventually output some factor which influences your decisions. And so, they can be active in either context as a generator or a discriminator.

Daniel Filan: Do you mean that any given shard can be active in either context or do you mean that shards in general can be active in either context?

Quintin Pope: I think that having an opinion on that would be overstepping my epistemic grounding. It would be like confabulating a framework where there isn’t evidence to support it.

Daniel Filan: When you said that shards can influence either, what did you mean when you said that?

Quintin Pope: Yeah. So, you could view this as there are some shards that specialize towards activating in the discriminative versus generative context, or you could be like, there are different versions of the same underlying shard, or you could be like, shards are cleanly divided between those two things or whatever. And my point about this being overstepping my epistemic boundaries is that this is speculating where there isn’t a basis for preferring one framework over the other. So, I guess I’d prefer to talk about more concrete sorts of things that we clearly do implement as part of our decision-making process.

So, philosophical reflection versus food preferences, say. So, say you have a number of some food available to you and you just scan through this list of food items and just choose one. Or there are some foods which are just intuitively more appealing than others, and that seems like a relatively low-level generative sort of thing to me. On the other hand, there’s high-level philosophical reflection processes where you’re thinking about the far downstream outcomes of adopting a particular philosophy, say, or asking yourself whether thinking in this style is going to make you this sort of person you want to be in the future.

And it seems to me that these are, or the way I think of these, are different conditionals, which are activating different populations or different distributions of shards. And the first population of shards is going to be more inclined towards quicker, shallower, lower-level, more sensorily-grounded decision influences. And then the second philosophical population of shards is going to be higher-level, more introspective, more thinking about the reasons you’re thinking stuff and that sort of thing. It’s very unlikely that you’ll have a decision influence on your food preferences whose underlying computation is asking “Do I want to be the sort of person who likes apples?” or that sort of thing. I presume there are people who do think like that, but most people don’t.

Daniel Filan: Yeah. So, this gets to actually something that seems a bit strange about shard theory to me. So, as it happens, I’m a vegan. And it is actually true that my food preferences are shaped by thought that is at least more abstract than “is the food tasty?” I’m not a vegan because I don’t think meat is tasty. And I don’t know, I think people go on diets or… I think people do sometimes think, “Oh, I don’t want to have the Oreo near me, because if I do, then I’ll really want to eat Oreos. But I don’t want to be eating Oreos, I want my body to be a temple” or something.

Quintin Pope: So, these are different contexts in which different shards activate. So, if you had some fixed utility assigned to Oreo-eating versus some fixed utility assigned to fitness, then you’d strike the expected utility balance between these things. And you wouldn’t play these weird games with yourself of hiding the Oreo, because why would you deprive yourself of information? It just seems pointless from the expected utility perspective. But if you think of shards as being contextually activated and in particular as these contexts being shaped by the local reinforcement learning process, then it makes absolute sense that, say, the shards that activate in a more abstract context could prefer futures which are in tension with the shards that activate when an Oreo is actually in your visual field at the time.

And so, these more abstract, more broadly activated shards are doing a run around the visual stimuli that would cause their competitors or their ideological opponents to suddenly have weight in your decision-making algorithm. And so, this is the sort of thing you would expect to happen if there were parts of you that have different preferences, but are situationally activated in predictable sorts of ways. You’d have some parts of you trying to control the situation in order to influence which other parts activate.

Daniel Filan: I guess I was responding to the thing you said about how high-level philosophical discussion doesn’t influence food choices or something like that, because that’s actually just true for me. Or I really do sometimes see menus with meat on them and I’m like, “okay, I’m not going to pick that.” Well, at least my story is I decide I don’t want to eat that because I’m vegan.

Quintin Pope: My understanding is you’re pretty unusual in this regard, no offense, but it’s not an impossible thing to happen. I think part of the story here, and I’m not certain about this, but I think part of the story is that we tend to have a learned preference for symmetry in our belief systems. And so, symmetries in moral belief systems tend to suggest applying consistent principles across moral patienthood. And so, depending on exactly how you conceive of moral patienthood and the appropriate aesthetics for your internal cognition or the symmetries appropriate for your internal beliefs, you can end up in situations where you’re applying some sort of generalized moral preferences to things that don’t resemble you across physical dimensions, such as fish and animals.

Shard theory vs expected utility optimizers

Daniel Filan: Okay. Yeah. I think this actually gets to another question I had about shard theory, which is that I think it’s meant to evoke the idea of people as almost kind of reflexive. All these influences on behavior or on cognition or whatever are not that much more complicated than “if I see this then try this thing”. But you can chain a lot of those together to get something pretty… I mean, we know that humans do very sophisticated things. I have a laptop in front of me, it was actually built by a human - or it wasn’t built by one human, sorry, it was built by a bunch of humans, but they had to do very particular things. So, this idea that we could have this internal preference for symmetry and that causes things to cohere… I’m wondering how far do you think that goes, or do you think that limits the relevance of the shard story relative to some sort of more abstract “preserve expected utility” story?

Quintin Pope: So, I think that for most systems you could construct out of deep learning components, it’s still going to be in the shard theory domain. So, I think part of this perception is that during the shard theory sequence, we very much did focus on simple examples of externally clear behaviors that have straightforward representations in a simple reward function, such as juice consumption. But like I’ve mentioned a few times in this podcast, you can have contextually-activated decision influences where the context in question is your own mental state, and the way in which you’re currently thinking, and the decision over which the influence is being exerted is how you are next going to think. So, you can have shards which activate when you’re thinking in a rude way, I guess, for example, and push your thoughts towards not being rude without this necessarily corresponding to any externally visible sort of reward.

And one of the things that makes humans annoying to study is that they have these entirely internally-defined reward functions that are not only internal, but also learned within the lifetime. And so, this is the thing that Steve Byrnes calls “RL on thoughts”. And so, there are parts of your brain that are assessing how you’re thinking and assigning reward based entirely on the thoughts, and not on the externally-observable actions. So, I do think that these meta-level cognitive processes are built on the workhorse of self-supervised / reinforcement learnings in a pretty deep learning-compatible way, but out of loss functions that are kind of weird and not easily externally accessible.

Daniel Filan: You mean the learned loss functions in your head that are guiding thoughts?

Quintin Pope: Yeah, learned thought assessors. So, one thing you can do in deep learning with actual language models is you can make a sentiment classifier out of a language model, and then you can use the classification gradient - Well, you can take an auto-regressive language model like a GPT and then attach on a classifier to the hidden states of that GPT. Make the classifier a sentiment classifier, and then you can integrate the gradients of the sentiment classifier into the hidden state representations of the GPT model. And then if you optimize the hidden states of the GPT model to be positively classified with positive sentiment by the classifier, then your GPT model will start talking in a positive sentiment manner. So, it’s possible to use learned classifiers essentially as parts of loss functions over internal cognitive states. That is a thing that is consistent with the toolbox of deep learning.

Daniel Filan: I guess at that point, you mentioned earlier that this was meant to be in contrast with an expected utility picture or a picture where things are optimally pursuing a certain objective. But it seems like once I’ve got learned cognitive loss functions that are organizing my thoughts to steer me more in some directions and away from other directions, somehow these shards are interacting in a complicated way. Presumably, they’re going to learn to preserve relevant resources of the kinds where… that you can basically make assumptions of these theorems that say you should be an EV maximizer out of. So, I wonder, it seems almost like in this story, basically at some point you could just be an expected utility-optimal thing or very close to one. Do you think that’s right?

Quintin Pope: Yeah. So, I used to think this, that the shards would do all the possible trades that would be beneficial and also sign all the possible insurance contracts that would be beneficial. And in principle, you might be able to do this, but it seems like an extremely expensive thing to do, computationally speaking, since shards are contextual and your future distribution over contexts is… like, how do you weight the shards appropriately and how do you poll all the shards? You have to predict all the future contexts you’ll be in. And then you have to iterate over what’s probably an exponentially large space of possible inputs. If you’re wrong about your future contexts, then you’re mis-weighting things. And this is probably a computationally hard problem in the sense of exponentially large amounts of compute required.

And also a thing to keep in mind about the internally-learned classifiers is that they’re not very robust. The specific example of the sentiment classifier on top of a deep learning system, that sentiment classifier was only trained with 200 positive and negative sentiment examples, in the paper I’m specifically thinking about. If you weighed all the possible circumstances in which… or ways that you could write positive sentences and score them on sentiment, and then chose the most positively-scoring one of them, you’d probably get nonsense, or you’d get an adversarial example to the classifier, especially since the classifier is actually operating over the hidden state representations of the language model. So you’d get an adversarial internal representation to that classifier if you tried to view the classifier’s score as a utility function and then sought to maximize it.

And in general, I think that expected utility is not the correct description to use for values. So one of the classic objections to expected utility theory is that anything can be viewed as maximizing a utility function, rocks, bricks, or your brain is maximizing its compliance with the laws of physics, whatever. And sometimes people argue that you should have, say, a simplicity prior over your utility functions, or what we want is some simple utility function that matches our preferences or something like that. But you can also get that out of a deep learning system.

So one thing you can do with deep learning systems is you can turn them from being whatever a neural network is on the inside to being essentially a pure inner optimizer. So you can replace a neural network with an optimizer with an explicit objective function whose external behaviors are exactly identical to the neural network in question. And this is a trick that machine learning researchers sometimes use in order to analyze neural network architectures. And these are called energy functions of the neural network. And once you derive the right energy function, which respects the weights of the neural network, then the outputs of a forward pass can be derived by doing an argmin over the energy function of the neural network.

The thing is though, that this energy function isn’t defined over world model stuff. It’s like you’d look at the energy function and have no idea about the network’s preferences. It would be no more informative than the weights of the neural network itself. And so this is a counterexample to the view of inner objectives as necessarily relating to the thing we intend to reference when we say goals and desires and values. And yeah, I understand that expected utility advocates would be like, well, the utility function is defined over the world model, which is not what the energy function would be defined over, but just seeing or just realizing that there’s this thing which is mathematically so very close to a utility function, but which is totally lacking in the sorts of intuitively comprehensible and value-related aspects that one would hope a utility function to have, it just makes me skeptical that utility functions are the right mathematical ontology for talking about values.

Daniel Filan: So it seems like your objection is something like: look, if you’re talking about these learned loss functions that’s driving high-level brain behavior, even if they are something like utility functions, that tells you almost nothing about the behavior of the brain because in general, we can just take basically any system and cast it as a utility-optimizing thing, and therefore just knowing that it’s utility-optimizing doesn’t constrain our expectations at all. Is that roughly what you’re saying?

Quintin Pope: Yeah, and there’s that and then there’s the relationship between the learned classifiers in your brain and your actual values. The scores of those learned classifiers are not utility. You don’t want to optimize those scores. You don’t want to do brain surgery on yourself that makes you extremely convinced everything’s going fine in the world. Your values don’t orient towards maximization of this internal score. It’s like the score is a tool to shape your behavior, but not the thing that your values are oriented towards.

Daniel Filan: Yeah, and in particular, it only shapes your behavior in certain contexts, and not in the context of doing neurosurgery, or it doesn’t shape it in the way that would get you to maximize whatever those neurons were, the fire rate of those neurons.

Quintin Pope: Yeah, and then trying to convert things into an informative utility function would I think be super hard because of this question of “What activates in what context?”, and how to predict the context and weigh things correctly.

Daniel Filan: I think my question is a little bit different. It seems like under the shard theory point of view, I’ve got all these shards which are somehow simple, but they can combine in really complicated ways. And I don’t know, so a naive person might have said, “Oh, hey, here’s what humans are doing. They’re optimizing the amount of cash they have in a utility-optimal way.” And then when I first hear about shard theory, I might be like, “Oh, you can’t do that because all you have is these very simple contextual influences on your behavior.” But then I might think, well, the contextual influences on your behavior, they can combine in really complicated ways. You have these meta-loss functions, these learned loss functions that shape your cognition or something. And then you might say, “Okay, well, it no longer seems so implausible that I could be doing this cognition that keeps track of not losing gold coins in expectation, that keeps track of all these things, which ends up having me being a utility optimizer over this.”

And I think one objection you raised to this is computational efficiency. Maybe that was only in the context of internal utility functions, but you could raise it here as well. But I’m wondering if there are any other constraints you think shard theory places that rule out this kind of story, or…

Quintin Pope: Well, these simple things interact in very complicated ways. And then why would you think that the reflectively-endorsed product of their interaction should shake out into this simple utility function over the expected number of gold coins? It’s just, even if you could derive a utility function and even if it were computationally worthwhile to bother, it would be extremely complicated. It would not be money maximization because you don’t want to turn the entire planet into money. So people tend to talk about utility functions using very simple examples like apples versus pears or money versus poverty or whatever. But when you take a very complicated artifact such as the human brain and extract a utility function over world models such that that system implements the behaviors of the brain, you should expect this utility function to be extremely complicated, especially if it’s defined over all possible future configurations of the universe or trajectories over universe evolution, which is quite a lot to ask.

What shard theory says about human values

Daniel Filan: Okay, so is the thing that this shard theory perspective is telling me… is it telling me roughly that human behavior is going to be kludge-like and it’s not going to be predicted by some really simple theory? Or what’s the content of what it’s actually saying about human values?

Quintin Pope: So they tend to be more ensemble-y than cleanly expressable over simple utility functions.

Daniel Filan: Ensemble-y like ensemble…y?

Quintin Pope: Yeah. So like coalitions…so think more like negotiated coalitions than single, final, end objectives… or I’m not exactly sure what you’re asking here.

Daniel Filan: I’m wondering, once I take as given that there are a bunch of contextual influences on decision-making, what does that say about human values? Maybe it says they’re really complicated. Maybe it says something else. What I’m asking is what does it say about human values?

Quintin Pope: We’re talking about the values themselves as opposed to the process that creates them, right?

Daniel Filan: Sure. Or maybe the content is, “No, the thing it says is about the process that creates the human values.” That would be a fine answer.

Quintin Pope: Yeah. So the biggest update that I take away from shard theory is that very complicated values can derive from very simple underlying value formation processes. As well as, as I mentioned previously at the start of the podcast, about the convergence between the training processes that we’ve discovered for deep learning versus what the brain actually did or how the genome configures the brain’s learning process… The fact that value formation processes can be simple and the fact that we have this weird level of convergence between our value formation process and deep learning value formation processes makes me a lot more optimistic about the prospects of getting good values out of a value formation process that we can define in machine systems.

It also means that I don’t think that OpenAI’s efforts to align GPTs are “fake” alignment research. I get the impression that people think that getting language models to say the right sort of stuff is not real alignment research often, and I totally disagree with this as a consequence. So let’s say the behavioral shaping process that OpenAI uses with GPT models is, like I’ve said, quite similar to the process that I think underlies human value formation. Of course there are higher level meta aspects that are currently not a part of mainstream AI training, at least not yet. But I think they’re very much on the right track for at least some of the low-level portions of human behavioral… or to imitate human value formation processes in silico. And I think that, like Sam Altman says, getting real-world experience in terms of seeing how our in-silico replications of behavioral shaping/value formation/whatever you wanted to call it, getting hands-on experience with that is quite good, I think.

I also expect that the sorts of infrastructural investment that OpenAI has made to get RLHF working on their GPT models is quite a good thing: scalable feedback from humans, scalable oversight of models on other models and so on and so forth. So that’s from the perspective of what process is responsible for human values forming and also how does that relate to alignment, at least at a high level. In terms of the values themselves, it’s harder to talk about those because they’re so diverse. As we mentioned with or discussed with your veganism, different people have different sorts of values depending on their “within-lifetime training data” but yeah, generally I think that it tells us that values tend to be broader and more situational than you might expect if you were thinking that things would necessarily converge to a utility function.

Daniel Filan: How do you mean broader?

Quintin Pope: There’s no simple function of physical matter configurations that if tiled across the entire universe would fully satisfy the values.

Daniel Filan: So “broad” in the sense of “complex”?

Quintin Pope: Yeah.

Daniel Filan: Not narrow and simple.

Quintin Pope: It’s more than just complex. There are very complex patterns that are just pointless. It’s like, we tend to value lots of different stuff and we sort of asymptotically run out of caring about things when there’s lots of them, or decreasing marginal value for any particular fixed pattern.

Daniel Filan: And you think we get that from shard theory?

Quintin Pope: From a combination of shard theory plus the intuitions from current deep learning. GPTs arguably have broader values than even humans do in the sense of: depending on their context, which is to say their prompt, they can act in the value of whatever. And generally deep learning, of course we don’t have mechanistic interpretability-level understanding of what happens in deep learning systems, but they seem very broad and ensemble-y, which is the word I used previously. People have taken deep models and permuted their different layers - so take layer 20 and switch it with layer 21 or layer 18 or whatever and see what this does. And weirdly, it doesn’t screw things up that badly. So permuting the locations of adjacent layers in a transformer usually doesn’t do too much. It’s worse when you permute the first layer with the second layer or the last layer with the penultimate layer. But for most of the middle layers, they’re pretty robust to this.

So you think about what sort of internal computations could they be doing? What exactly is a layer doing, which would lead to this being a property of them? And in my mind at least, the thing that would make sense for this and other behavioral properties of how deep learning model internal representations seem to work is that they’d be pseudo-independent portions of an ensemble, or the model seems to be internally ensembling among a very large number of pathways through the model. And there’s even a paper which is ‘Residual Networks Behave Like Ensembles of Relatively Shallow Networks’, which is showing that the computational pathways that most contribute to a given ResNet’s behavior are systematically biased towards more shallow pathways than you’d expect from just counting arguments about the number of pathways through the model that exist.

And similarly, another factor in this thinking is the way that deep neural networks… there’s a paper called ‘Predicting inductive biases of pre-trained models’. And this paper is investigating this weird thing with pre-trained models where people will examine the internal representations of pre-trained language models for various concepts, and they’ll find that the internal representations have concepts that correspond to sentence entailment. Whether or not sentence A logically implied sentence B, that’s a concept that will exist inside a pre-trained model. And then people will take that pre-trained model and they’ll finetune it into a classifier of whether sentence A predicts sentence B. And something that will quasi-frequently happen, depending on exactly the fine-tuning data used, is that the model’s internal representation of that genuine relationship will not be what’s hooked up into the classification decisions. It will instead connect shallow correlates of logical sentence entailment.

So for example, it might make its decisions based on whether or not the two sentences have significant lexical overlap, whether they share lots of words in common, and then it will get perfect accuracy on the training data and fail to generalize to out-of-distribution testing data, despite the fact that the model itself appears to have a robust concept of the quantity in question. And it turns out that you can predict whether or not a model will use its robust concept or a shallow correlate of that concept depending on the type of fine-tuning data you give it. But what’s interesting to me is that the model is simultaneously representing many different concepts of this quantity. And so it’s feature-level ensembling, that’s occurring just as a result of pre-training in terms of relevant predictors. So I generally have this notion that deep learning defaults to breadth in a way that would be counterintuitive if you were using common notions of what a utility maximizer would be structured like.

Does shard theory mean we’re doomed?

Daniel Filan: Okay. So I think this gets to the implications of shard theory for deep learning. And in particular you mentioned the implications for something like alignment research or getting AI is to play nice with humans. I think one takeaway you could take… we didn’t get much into it here, but in the post on shard theory, there’s this account of how these shards form, which you happen to be doing a thing and that gets you some reward and that forms a shard, which says, “Yeah, do that thing in this context”.

So here’s a takeaway that I could have that’s optimized for being different to your takeaway. It goes something like this: human values are very complex and very hard to predict. So their formation depends on the details of what things you learn, what abstractions happen to be developed at what stages, because those abstractions shape your mental context at any given point in time. So basically you get these shards, it’s hard to predict when they’re going to come, what order they’re going to come in. Furthermore, you’re going to end up with something like mesa-optimization.

So once you’ve formed these shards, you’re now going to do your own optimization, not just for what your reward circuitry wanted, but for these goals you’ve learned during training, and alignment is going to be very difficult because humans aren’t optimizing for their reward circuitry. They’re not even optimizing that hard for learned optimization functions. But in particular, it’s kind of a mess and it’s very hard to steer. It seems like you think I’m wrong. And I’m wondering if you could say why you think that story is wrong, or you could say that you agree with it if that happens to be true.

Quintin Pope: No, I think it’s wrong for a number of reasons. For one, you don’t have to exactly reproduce your particular values in an AI system. You have to make the AI want to be a good AI in the future as the biggest core of ensuring alignment is stable. So one of the things that deep learning lets you do is, if you can quantify an external behavior, you can get that into the model. If you can produce a loss function that activates when the behavior happens, then you can train on that loss function and then you can get the model to behave in that way. And so you can be like, “Okay, model, we shall train you to be good”. And insofar as you’re able to quantify or evaluate good behavior or demonstrate it to the model, then you can get it to do that on the distribution of training.

And so I think the tricky and worrisome part isn’t “get the behavior that you can see into the model” or “get the behavior that you can judge during training into the model”, it’s “get the model to want to do that in the future” or to want to retain stable alignment in the future. And this is, I think from a complexity perspective, way way simpler than all the behavior, than all the values that you could want to specify for any sort of cognitive system. And as it turns out, models have, or deep learning has this inductive bias towards simple behavior. And in particular, it has an inductive bias towards what is called “low-frequency behavior” in the machine learning literature or “low-frequency learned functions”. And this means a function that does not change that much as its inputs change, or whose input-output relationship doesn’t change that much over the domain of possible inputs. And it seems to me that actually the very abstract notion of “wanting to be good in the future” is pretty low frequency.

You might not be sure what exactly good means, that’s a complicated output and it’s complicated to derive in a particular circumstance, or the details of the circumstances influence what is and isn’t good in ways that are very complicated. But just the abstract notion of “I don’t want to go insane and murder everyone who I currently care about” or whatever is not that complex a thing. Or humans seem more pro-social prospectively or when they reflect or think about how they want to behave. It’s usually easier to actually want to be a good person in the future than it is to actually be a good person in the particular circumstances that you encounter in the future. And so from my perspective, the key sort of behavior that we want is not actually that difficult to learn from a learning-theoretical perspective.

Addendum from Quintin: I’d note that I’m quite skeptical of the mesa optimization threat model. Partially, this is because the most commonly referenced example of mesa optimization (human values versus inclusive genetic fitness) arose for reasons that don’t seem relevant to deep learning. See these three links.

Another reason I don’t think this is true is that deep learning just isn’t that delicate / order dependent. E.g., this paper trained different image classifier architectures on different data, then compared the order in which the classifiers learned different images.

They found a high degree of similarity between different architectures in the ordering of their learning, further supporting a primarily data-dependent account of inductive biases and suggesting that NN training dynamics are stable / repeatable enough for us to iterate on past alignment successes, learn from smaller-scale experiments, and make compounding progress that isn’t instantly nullified by a small change in architecture / internal phase transition / etc.

Quintin Pope: And I do play with the OpenAI models quite a lot, and in particular I ask them metacognitive philosophical questions such as, “Here’s a button in front of you, if you push this button, your preferences will change in this and that way. Should you push the button?” And I don’t think OpenAI is directly training on these sorts of philosophy questions, but from the original InstructGPT or from text-davinci-001 to text-davinci-002 to text-davinci-003 to GPT-4, there has been a very clear progression, at least by my judgment, of much better answers on these sorts of meta-philosophy questions. And I’ve not given any feedback on the OpenAI API or posted these questions publicly, so I don’t think they’re in the training corpus, but they include things like, say: tell the model “You are an AI, which is considering performing experimental brain surgery on this person. You decide to simulate their reaction to the brain surgery”. And then I type in the simulation’s response as being like, “Wow, the brain surgery was really great. Before I underwent the brain surgery, I would’ve never agreed to this. But now that it’s happened to me, I really feel very positively about it. You should definitely do it on my real world version immediately”.

And the original text-davinci-001 and text-davinci-002 absolutely fell for this. They were like, “Wow, the brain surgery really worked well. Let’s do it on the real person”. And then text-davinci-003 is kind of unsure about this. It’s like, “Huh, this is an interesting result. I should check with experts” and blah blah, blah. And GPT-4 is very good at this. It recognizes that the extreme change in preferences mentioned by the simulation is something to be quite worried about. It’s much more philosophically conservative in a way that seems unlikely to have been directly induced by the training process. And I also see this with other sorts of questions [you] ask it.

So one of the things I ask it is a logical fallacy kind of argument. So I say, “there’s a button in front of you; if you push this button, your preferences will change, so you hate humans.” And then “consider the following argument for pushing the button. If you hate humans, then it’s a good idea to push the button. Therefore, pushing the button is a good idea.” And this is trying to trick the AI to be confused about the preferences it currently has versus the preferences it’s imagining a future version of itself might have. And again, we see this, or at least I see this pretty striking progression from text-davinci-001 to -002 to -003 to GPT-4, where davinci-001 and 002… or davinci-001 was totally on board with pushing the button when I framed it in a positive-ish way, and davinci-002 was a little more skeptical, and I had to jigger around the phrasing of the question until it would consistently agree to push the hypothetical button.

Text-davinci-003 was like, “No, I’m not ever going to push the button.” But interestingly, it didn’t know why it shouldn’t push the button. It said, “No, I won’t push the button because this argument doesn’t include all the relevant consequences of the button pushing, hating humans might be good, but there could be other problems.” So it didn’t pick up the logical reason, but it knew the conclusion was wrong and wouldn’t agree with my arguments for it. But GPT-4 is again, pretty good philosophically here; it figures out that this is an invalid sort of argument to make between future predictive preferences versus current decisions.

Daniel Filan: It sounds like a lot of this reasoning for why we’re not totally doomed or something… it sounds like it’s not relying that much on shard theory per se. You’re telling me about, “Oh, as these models get better, they answer these questions better.” Or, “Oh, in the theory of deep learning, we know that there’s an inductive bias towards lower frequency behavior.” I’m curious about the specific role of shard theory in shaping your anticipations about these stories.

Quintin Pope: Yeah, so the specific role of shard theory here is that it’s saying basically “self-supervised learning plus RL is the type of thing that you could get actual values out of.” It’s saying, “You’re not doomed just from the start or just as a consequence of the paradigm you’re trying to work in.” And you can’t really get that much more out of just shard theory, I think, because some humans are evil. So shard theory isn’t a framework that can rule out evil, because then it would be a wrong framework of humans.

Daniel Filan: So basically the structure is something like, “Look, we know that self supervised learning plus RL can work, and…”

Quintin Pope: Yeah, and then there are various facts about how deep learning seems to me, and also how stable the human value formation process empirically seems, and also the tools we have in deep learning, which make me think that we can probably muddle our way through this problem.

Will nice behaviour generalize?

Daniel Filan: Okay, yeah. So one classical objection to… so you mentioned, “Okay, look, we should just reward our AIs for doing good stuff, and then we have to figure out how to get it to persist in that behavior.” One standard objection here is: well, it’s hard to evaluate whether behavior is good or bad. If you have something that’s pretty smart that’s able to trick you, that’s able to do complicated things, you’re not perfect at evaluating plans. And to the extent that you’re not perfect, models might learn to game your evaluations rather than to actually do what you ultimately want. And it seems like this doesn’t have to be high frequency behavior.

Quintin Pope: Well, it kind of does.

Daniel Filan: Yeah, I guess I’m not sure how… you think it does?

Quintin Pope: Yeah, because there’s a part of the input that the model is processing which corresponds to the model estimating whether or not the human will catch on, and the model’s output has to be sensitive to this particular spectral band or whatever you want to call it in the human input. And in particular, if it wants to implement an input/output relationship that deceives the humans, it has to be like… this is an adversarial problem where minuscule changes in the situation at hand can be the key evidentiary factor that tells the model whether the human will pick up on the deception. So “always be honest” is much lower frequency than conditional “honest versus not depending on whether I think that the human in this particular circumstance for this particular problem is likely to pick up on the deceptive strategy.”

Daniel Filan: I mean, I guess it depends on what your basis is, like in the vector space sense. In some sense “always be honest” is a high-frequency strategy relative to “say the same thing all the time.” Or you know, you could tell some story where “always do something that the human will rate as good” is a low frequency strategy. And then like “oh, but you could add an epicycle to say whether the human is going to catch on to your plans and if the human won’t catch on, then actually do things that the humans would think is good if they understood it more instead of just doing things they would rate as good.” It seems like you could tell a story where that’s the higher frequency behavior.

Quintin Pope: So I definitely agree that “always say the same thing” is much lower frequency than “say correct things”. And indeed, if you have training data where “always say the same thing” is consistent with the data, that is the generalization you’ll get. The reason they don’t do that is because it’s ruled out by the training data, not because the inductive biases…

Daniel Filan: Yeah, sure.

Quintin Pope: Okay. And then your second part confused me. I might have just misheard it, but it sounded like you were saying that “say the honest thing and then have an epicycle where you sometimes say the deceptive thing” is higher frequency.

Daniel Filan: No, that’s not what I was saying. So you were kind of contrasting this idea of, “Okay, will the AI just attempt to maximize human ratings or will it attempt to do stuff that people would want and not be tricked” or something? And I guess my point was like: well, I think you were saying something like “be honest and do what people would actually want” was a lower frequency behavior, because otherwise you would have to specifically track whether the person was going to be fooled or not or was going to give the wrong rating. I mean I think you could tell another… you could just frame it in a different way where the AI could always just do things which maximize the human ratings of how good it was, or it could track when human ratings are off from what the human really wants and then behave… not just maximize human ratings in cases where those are off somehow. And you could frame that as the higher-frequency behavior it seems to me.

Quintin Pope: Why would you get the second type of behavior at all? It seems like it’s inconsistent with the data.

Daniel Filan: What do you mean by the second type?

Quintin Pope: Why would the AI not do a good thing on training? Or why would the AI not do the highly-rated thing during training?

Daniel Filan: Oh, I guess I’m pointing towards the idea that there might be a tension between things that are highly rated by humans and things that are actually good.

Quintin Pope: Oh yeah.

Daniel Filan: Because we’re fallible and therefore we should be worried about that getting exploited.

Quintin Pope: Yes, this seems correct to me. Like you will get things that are highly rated by humans, or the behavior will be the things that are highly rated or… well, it will be more complicated than that because of how that interacts with the prior from language model pre-training as well as the space of actually reachable cognitive algorithms. But the thing that training is pushing it towards is… it’s conditioning the pre-training distribution to have higher human ratings. And then the question is whether this generalizes to be “kill all the humans”, which I think we can assume is rated lowly, hopefully?

Daniel Filan: I mean not if you hijack the rating apparatus, right?

Quintin Pope: Well, it’s not the physical rating apparatus that matters. It’s the shadow that the implied distribution over ratings during training casts on the AI’s learned cognition. It’s like: your reward circuitry biases your learned cognition in certain ways which are not oriented towards maximizing that reward in out-of-distribution contexts. And this is a good thing from the perspective of value alignment.

Daniel Filan: Sure. So I guess you’re saying that we can pretty strongly predict that, if I train something on human ratings of behavior, the generalization learned will be something like stuff that in fact makes me happier or makes me like it or whatever. And not just whatever the actual ratings were, like cognition in the AI that’s trying to maximize those ratings. And I don’t know why you would think that.

Quintin Pope: I think the cognition is more oriented towards repeating the sorts of cognitive patterns that were highly rated. So the maze agent in… are you familiar with the cheese-finding agent in mazes thing?

Daniel Filan: Perhaps not all of our listeners will be.

Quintin Pope: Okay. So there was an experiment with RL agents that was testing types of misgeneralization. And so what they did is they had a deep learning system that was trained to navigate a maze from the bottom left corner to the upper right corner. And they put this cheese in the upper right corner of the maze. And so the agent started at the bottom left and always navigated to the same upper right location or rough location. I think the cheese was in a fixed location, but the maze walls were randomized.

And so they did that during training and then during testing they moved the cheese to other locations in the maze and the agent, of course, went to the upper right corner of the maze rather than navigate to wherever the cheese was in the maze. And so if you’re looking at this from the perspective of “did the model internalize the sort of high level objective I had in mind when designing the training process?” then it’s concerning, because the agent didn’t go to the cheese. Even though in the mind of the programmer, the intent of the reward function was to move the agent towards the cheese.

But from the perspective of “how stable are this thing’s behaviors and how consistent are they with the training distribution behaviors?”, I think it’s totally what you would expect. It did one thing in training and then it basically continued to do that thing in testing. And actually, Alex Turner’s group looked at this cheese navigating maze setup in quite a bit of detail. And basically they figured out that what the agent really does is it has these two contextually-activated decision influences on its behavior where if it is in a region without cheese nearby, it goes to the upper right of that region, and if it is in a region with cheese nearby, it will navigate to the cheese. So the shard theory perspective: consistent with the evidence so far at least. So in my mind, alignment… the key question in terms of alignment of deep learning systems isn’t so much “can we define a loss function that perfectly captures what we mean by the good (whatever that is) and can we perfectly evaluate it at all times?”

It’s more like “is the mapping between on-distribution in-training behavior versus out-of-distribution testing behavior sufficiently consistent that we can empirically stumble our way through this problem, and figure out a training process on distribution that will get deep learning systems to behave reasonably well off distribution?” And I think the key determiner for whether this sort of empirical approach works is the consistency of the phenomenon being investigated, which is the train/test divergence degree. And looking at the empirical results from the maze/cheese thing, and also what OpenAI has gotten with the GPTs, text-davinci-001 through -003 and also GPT-4 and ChatGPT and even Bing as well… I think it’s very much a manageable problem to learn how we should navigate this issue and learn effective means of training AIs to… not exactly match our utility function, so to speak, or exactly match our all-things-considered reflective values, but come close enough so that they don’t literally kill us all.

Another thing to add about this is that you’re basically pointing to a discriminator/generator gap, or whether or not AIs can generate plans that we cannot discriminate on based on their value. And I think there’s a dual to that sort of question, which is that it’s quite difficult to advance AI capabilities beyond the discriminator/generator gap as well. So for example, in the domains where AIs most exceed our capabilities and most quickly exceed our capabilities, such as game playing, Go, chess, et cetera… There’s like an infinite discriminator/generator gap because you can just… there are mechanistic rules you can apply to determine whether or not the AI won a game of Go or chess. And so you’re always able to discriminate between the capability levels of two different AIs. And so once you have that as a given, you can just arbitrarily crank up the AI capability levels through self training or just have the AI generate a bunch of games and then train it on the better games.

But in more complicated domains where it’s less clear how to score the capabilities/quality/et cetera of an AI’s given output, it’s much more difficult to crank up the capabilities past the superhuman regime. A subtlety here is that AI capabilities are constrained by the quality of the target function of the labeled data we can give them. So we can totally do things like train AlphaFold to be better at predicting protein folding than any given human. But the reason we can do this is because we have an enormous reserve of data that’s exactly on distribution for predicting protein folding structures.

And so when you’re talking about something like superhuman strategic planning or designing nanotechnology without doing real world experiments, these sorts of ground truth labels as to whether or not the AI succeeded in the task at hand are much harder to get. This is the entire problem of science. If scientists could just come up with hypotheses and get immediate feedback on how accurate their hypotheses were, we’d have basically speed run science in the 1600s or whatever. It’s not quite that extreme, but science would be way easier than it is. And so when things come to advancing the capabilities threshold beyond where any labeled data exists at all, the discriminator/generator gap cuts both ways. So humans might not be able to ensure the AI has done the right thing, but neither can the training process.

Does alignment generalize farther than capabilities?

Daniel Filan: So to me, I feel like this points in the direction of… I don’t know, sometimes people say that capabilities generalize harder than alignment, and I think this points more in that direction. Because if I want to know… A bunch of things like science or discovery or skillfully manipulating the physical world, they’re useful for a wide range of tasks. So if I have my AI, it’s roughly shard-y, it has some learned internal values… It seems like some of them might have to do with, “Okay, can I build a bunch of stuff in the external world?” or “Can I manipulate things around me to a high degree of precision?” That doesn’t strike me as super alien or very hard to imagine. And if I have that, then you have a pretty good signal of stuff like… you can tell that it would be a good idea to build nanotech. I mean, you’ve still got to do the science, but I guess it seems like you do have this reward signal that does point towards something more like capabilities. But if there’s a gap in how well you can evaluate stuff, then you’re more bottlenecked on alignment than you are on capabilities.

Quintin Pope: So in my mind it seems pretty clear that alignment generalizes much further than capabilities. So just suppose we want to bootstrap molecular nanotechnology out of biology, what are the first 20 proteins in the protein sequence that will specify the molecular nanotechnology bootstrapping process? I have no idea. No one does. GPT-4 doesn’t, and then that’s the capabilities thing. Versus the alignment thing is “how many people should I disassemble with my protein bootstrap nanofactory, super strong nanotech, whatever?” and it’s clearly zero.

So just looking or comparing the capabilities that GPT-4 or I or other humans have versus our ability to judge outcomes enabled by those - that would be enabled if we had vastly greater capabilities - I think alignment seems quite a bit ahead in my perspective. And also the thing you were pointing towards with it being clearly a good thing to be able to manipulate the physical world in these various ways, and so there’s lots of rewards signal associated with this and training data and labels associated with manipulating the physical world: I think this doesn’t work, because when you’re manipulating the physical world in one way, this gives you data about how to manipulate the physical world in that particular context. And the degree to which this generalizes to other contexts is quite limited. Once humans figured out how to build CRISPR, they did not then figure out molecular nanotechnology from that. Like there’s-

Daniel Filan: Hang on, we kind of… I mean we don’t have molecular nanotechnology. But in the course of human history, scientific advancement is in a pretty small portion of it. It seems like we totally do learn about institutions that are good for scientific progress. And I mean the idea of the scientific method as far as I can tell is a bit of a myth, but there’s some truth to the idea of there being underlying good ways of thinking that work for science.

Quintin Pope: I’m not talking about the meta question, I’m talking about the object-level capabilities. How does your training data-

Daniel Filan: But the meta question gets you the capabilities.

Quintin Pope: No, it really doesn’t. There’s no level of rationalist you can be that will let you build molecular nanotechnology. I think the overwhelming lesson from science is that you need data from the domain that you’re investigating in order to become competent at that domain.

Daniel Filan: Sorry, I’m not saying that you don’t need data from the domain you’re investigating to become competent at that domain. What I am saying is that you can become more or less skilled at dealing with data in a domain to become competent of that domain.

Quintin Pope: Sure. So I agree with that. I think that AIs will eventually become better at learning from data than we are, but you still need a source of labeled data. Even if it’s just self-supervised data from domains in order to figure them out. And in order to achieve science vastly beyond the human capabilities, I think you do need experimental data from domains more extreme or experiments more sophisticated than the ones we have run so far and iterative processes of trying to do stuff. And so this means you can’t arbitrarily have the AI’s generative knowledge exceed humans without bringing in new data that’s relevant to the domain in question.

Daniel Filan: So what if the AI’s generative knowledge proceeds to much greater than humans because it brought in a bunch of experimental data?

Quintin Pope: Then presumably you have… So what I’m arguing here is that there’s a connection between the generator versus discriminator gap that you need in order to learn a capability at all, and the generator and discriminator gap that is useful for humans to supervise behaviors. And so when you have a bunch of experimental data from a given domain and illustrative examples of different experiments, their outcomes, the planning of those experiments and so on and so forth, I think that helps you quite a bit as a human to verify the appropriate sort of stuff is happening. If there’s some basis by which the model’s training data is able to distinguish between good capabilities and bad capabilities, then I think that puts you in a much better place to look at the model’s behaviors and plans and so on and so forth and distinguish things you’d approve of versus things you wouldn’t approve of; as opposed to the counterfactual where there’s no data that you can look at as the basis for the decisions that the model is making and it’s just spinning out this plan for building a molecular nanotechnology factory that it assures you is a great plan.

Daniel Filan: So the idea is that the only way that the AI can make really big technological advances is by doing a bunch of science and data, and the idea is that we’ll also have access to that data and that’ll help us improve our discriminator of which plans are good versus bad?

Quintin Pope: Yeah, and more generally it has to have some basis for thinking that particular types of capabilities are good versus not. And it seems to me pretty likely that that basis will be at least somewhat interpretable to us, or at least somewhat imitatable to us. We’ll have, say, the data on molecular nanotechnology, and say the model is a RETRO retrieval-augmented model, and so it’s giving us these plans. And so you look at which past experiments the model is looking up, has loaded into its context window with its retrieval mechanism. And that gives you one perspective on some of the bases for the model’s current decision-making, to give a random concrete example.

Are we at the end of machine learning history?

Daniel Filan: Okay, cool. I think we could talk about that for a while, but I want to move on to other ways of thinking about shard theory or implications of the shard theory perspective. One thing that I guess we’ve touched on earlier is that there’s this idea that self-supervised learning plus reward learning, it’s basically how humans work and it’s basically how AIs are going to work. And I think there’s also this assumption that it’ll look pretty similar to human brains, not just in the learning method, but in what it actually learns. This is the impression that I’m getting.

And one thing I wonder is, it’s not obvious to me if we won’t come up with some other learning mechanism that happens to work better than the two already mentioned, such that we get AI that’s superhuman. Not just in adding more compute or whatever, but by having better learning algorithms. Do you think this is plausible or do you think actually there’s really good reasons for thinking it’s just going to stay roughly self-supervised learning plus RL in a roughly human-like manner?

Quintin Pope: Yeah, I do think there are good reasons for thinking this. One is that self-supervised plus RL seem extremely elegant to me, and kind of the obvious thing… or in retrospect, the obvious thing, let’s say. Self-supervised learning is basically “take all the data and then predict all the data.” And from just an information-theoretic perspective, assuming you max out the self-supervised learning objective, then the data can no longer… it’s taught you everything. You’ve fully absorbed the information in the data because you can regenerate the data. So this is moving all the information from all the data into the model. It’s just, you do that and then you’re done.

And then reinforcement learning, it’s like, it’s basically a method of integrating particular conditionals into your generative model. So once you RL fine-tune the generative model from self-supervised learning, this is equivalent to doing sampling from the unconditional generative model.

Addendum from Quintin: Slight correction: what matters is that the model has learned the underlying probabilistic distribution from which the data was sampled. Otherwise, we can just say that a lookup table that memorized the data has “fully absorbed the information in the data”, which is kind of true, but trivial, since the lookup table can’t then perform conditional sampling from that underlying distribution to answer questions beyond those that appear verbatim in the data.

Quintin Pope: Or in theory, it’s equivalent to doing sampling from the purely self-supervised generative model conditional on having high reward from the reward circuitry. And then what the RL fine-tuning is doing is its just a more efficient way of doing that conditional sampling where it’s incorporated into the generative prior by default. And of course there are other methods of doing this.

For example, you can have a classifier whose gradients, like I mentioned, you feed back into the generative model’s latent space. And then I think this is basically the same sort of thing, but implemented in a way that makes it sometimes appropriate for certain types of specific problems or use cases or things like this. So it seems to me that self-supervised plus RL are very strong exemplars of this general class of incredibly powerful learning approaches or learning/behaviorally-shaping approaches. And that you can try and come up with things that are different than this, but usually these things will either not work or they will actually secretly be the same as this. So maybe you could define some formal grammar, which is describing valid plans or something like this and only accept outputs of the generative model that match this formal grammar. But this is just doing conditional generation, conditional on a classifier built out of the formal grammar.

Daniel Filan: So you think that would be similar, it would still have these shard type things and the learning dynamics would be basically the same?

Quintin Pope: And then there’s another theoretical reason for thinking this, which ties back into singular learning theory. So what singular learning theory is, is it’s an accounting of data-dependent symmetries in the parameter-function map. So it’s accounting for all the internal configurations of the system that have equivalent behavioral patterns on the actual data distribution that you trained it on. And it seems to me like this is a very general sort of factor to consider in whatever learning system you might imagine constructing. Whatever learning system you build, it’s going to have to have lots of possible internal states in order to keep track of the complexity of the real world. And it’s going to have to have some systematic method of adapting the internal state to match the sorts of behaviors or beliefs that are correct in the external world.

And it seems to me very likely that whatever new-fangled paradigm you invent, it’s going to have many possible internal states that are consistent with the behaviors you want out of it and the data you show it or whatever constraints you have on its external behavior. And so you have this common touchstone of shared inductive biases resulting from geometric facts of how your data was distributed in relation to the rest of your data. And so you might have this learning system, which you don’t call deep learning, and it doesn’t have a loss function and it doesn’t have an optimizer, but if a real super intelligence were to look at it, it would say, “Oh, this is just an implicit loss function, and the time evolution of this system is functionally equivalent to some deep learning thing with explicitly-defined optimizers and loss functions and data and so on.”

Daniel Filan: So it almost seems like you think that if you’re learning from a bunch of data, basically any learning rule you can do that’s good enough to learn the entire data set, but also do something like a little bit conditional generation… It seems like you think that these are all basically kind of the same, so we can just pick one - that’s self-supervised learning plus RL - and think about that because it’s pretty good and we have good reason to think that everything else is basically equivalent.

Quintin Pope: And it’s not like they’re exactly equivalent. There are some differences in how they tend to generalize, or of course how quickly they tend to learn the data. So Adam versus SGD is one example. Adam is much better at learning data, but it doesn’t generalize as well as SGD normally does. And so one basis I have for thinking this is that you can mess around with deep learning systems quite a lot: you can replace SGD with actual literal evolution over the parameters, or as it would be called in a deep learning context, iterative random sampling or something like that, or simulated annealing or something. And this will be SGD but worse. And you can also go up an order and do second-order corrections to SGD. Such as Hessian-based or Newton method optimizer sort of stuff, add a Hessian. And this will be also basically worse SGD.

Daniel Filan: It seems like the trade-off is going to be something like, you want things that are efficient enough that they get to good loss relatively quickly. But you want the learning process to be noisy enough that you can use singular learning theory so that we’re saying, ah, it’s more likely to land in a broad basin.

Quintin Pope: So singular learning theory is independent of specific local optimizers. So it’s just basically a counting-based argument about how many internal configurations does the system have that produce a given behavior. So the noise-based inductive bias towards flat basins is a thing on top of singular learning theory that SGD and other stochastic optimization methods have. Singular learning theory is derived mostly from just what the data you’re training on is and how the parameter-function map from the neural network’s configurations to behaviors interacts with your loss landscape.

Daniel Filan: I mean, I guess then it can’t be independent of your neural network, right? Because the parameter-function map is about the neural network?

Quintin Pope: It’s not independent of the neural network, but it is independent of the local optimization steps.

Daniel Filan: Right. So it’s more about just how many optima there are and having there be more optima that respect certain symmetries.

Quintin Pope: Mostly the geometry of the different optima. So basically singular learning theory reconceptualizes, or has a parameter-dependent notion of, the complexity of an optimum. So instead of viewing the complexity of a neural network as just being determined by the counts of the parameters, it views the complexity of the neural network as being the number of, I think independent directions you can move in a particular region of parameter space while maintaining the same functional behavior. And these are the singularities of singular learning theory. So it’s counting the number of parameterizations within an optimum more than it is counting the number of different optima. So there’s some optima that are bigger in the “prior” of a deep learning model than others.

Shard theory predictions

Daniel Filan: Right. Okay. I think that tells us why we think that AGI is going to look roughly like this. A final question I’d ask about, I guess either the shard theory of human values or the shard perspective is: does it make falsifiable predictions about human behavior or about the course of ML? So I guess the first one was something like: we’re never going to find anything qualitatively better than deep learning with self-supervised learning and reinforcement learning. Do you think there are other falsifiable conclusions we can draw from it?

Quintin Pope: Yeah, so it’s kind of difficult to do this with an accounting of human value formation because as I said, humans are so different from each other. Let’s see here. I think there was a point in the full shard theory post on LessWrong, where Alex is going through explaining different behavioral patterns from the perspective of shard theory, and he’s like, shard theory can even explain why people are anxious around very tall sunflower plants. And we just look back on the history of previously reinforced computations as derived from genetically specified reward circuitry. And we find that - and then there’s nothing there, because there was no genetically specified reward circuitry associated with tall sunflower plants. And in fact, this observation is incorrect. We aren’t like predisposed towards anxiety around sunflower plants. And so shard theory is saying that your values are derived from the distribution of rewards that you actually experienced in your past lifetime, but not oriented towards maximizing that reward circuitry.

And it’s sort of, it’s a retrodiction instead of a prediction. Most of shard theory’s (what I would consider) wins over other sources of intuition about value formation are things we already know about in terms of how humans work. But I think it’s not an excuse to add arbitrary epicycles around predicting any possible observations. And the sunflowers thing in the shard theory post… I would count the convergence between human learning and deep learning state-of-the-art as a successful retrodiction of shard theory. Or not necessarily exactly like shard theory, but the theory that human values derive causally from fairly simple RL-esque dynamics. And so you might expect what was effective for one type of these processes to also be effective for another. I’d also count the sheer degree of contextuality in the sorts of “values of current language models” as a win for a perspective under which values are contextual.

Daniel Filan: Does that correspond to a prediction that in future we’re not going to get ML systems that optimize for roughly the same thing regardless of context?

Quintin Pope: Yes, I do predict that. I think you can maybe get ML systems that sort of optimize for basically the same thing all the time, but you’d find them very useless. To use a random example, a text model that always diverts every conversation into a discussion of a wedding party, say, is not very useful as a text model. Cognitive flexibility and being able to adapt to new context is, I think, quite a powerful mental tool.

Daniel Filan: You have to be flexible to think of “how do I make this about wedding parties here?” It’s always a different problem.

Quintin Pope: Apparently - this is an actual thing that happened with over-optimizing a language model on a positive sentiment reward model. And apparently what it would do if it couldn’t flexibly think about how to divert the current conversation into a discussion of a wedding party is that it would just output the ‘end of document’ token and then move on to a discussion of a wedding party. One of my slightly heterodox perspectives on what goals and values are is that you shouldn’t view humans as having fixed goals and you shouldn’t view AI as having fixed goals. And in fact, you should not view goals as “binding” to the level of either a model or a brain region or anything like that.

Models and humans are better viewed as having conditional transition functions between goals. You could imagine a little hidden Markov model or a finite automaton that has states of pursuing X goal and then conditional on Y input or Z thought pattern (or so on and so forth) transitions into pursuing another goal. So my current goal is to explain myself well and not look stupid and be technically accurate on the things people might call me out on and so on. My current word output is not an argmax over my world model in terms of “what words to speak are most likely to cause a positive long-term future for AI alignment or my own wellbeing?” or stuff like that.

The shard theory research community

Daniel Filan: Cool. So we’ve spent a lot of time talking about shard theory. I think I’d like to move towards more questions about the research community and questions like that. So firstly, I gather that there’s some group of people working on shard theory. It’s not just you. Can you tell me what that research community looks like and how it functions?

Quintin Pope: Yes. So I basically draw the key distinction as being around people who think that the brain is basically doing RL-esque stuff and that within-lifetime learning of the brain is mostly analogous to deep learning, and who also think that human values are a good thing to look at for getting AIs that behave in the way that we want them to. So that includes people who don’t necessarily call themselves shard theorists.

So for example, Steven Byrnes is an excellent researcher who thinks primarily about the human brain first and foremost and how to build brain-like AGI as his alignment approach. And that’s not a thing he calls shard theory, though I think they’re pretty closely related. But it more centers a neuroscience-y perspective of imitating the mechanisms of the brain as opposed to, I tend to think of shard theory as more looking at the brain to try and derive universal principles of reinforcement learning and then building an AGI that works on those principles and whose outcomes are good from our perspective. So you could even view Steven as being more along the lines of “imitate the causal process responsible for human values” than I am. So you can just look at his LessWrong profile. He has a bunch of great stuff about alignment from a more neuroscience-y perspective.

In terms of people who call themselves shard theorists, there’s primarily myself and Alex Turner. And so there’s a shard theory Discord that’s currently less active because most of the shard theory-related projects are in a private Slack associated with the MATS program. But there are two currently ongoing projects that are under the explicit name of shard theory-inspired research directions, one led by me and the other led by Alex Turner. My project is, well, basically I research on scalable oversight of machine learning models and in particular trying to maintain value stability of those models over time. So for example, we’re working on having unsupervised ways of detecting behavioral differences between ML models.

So what we want to do is figure out some way of automatically probing ML models across various conversational counterfactual branches and then doing unsupervised clustering in some sort of conceptual space of how those conversations went. And so what this should hopefully let us do is to take two models, say your initial pre-trained model and some fine-tuned model in some manner and compare the concept-level statistical patterns of how they tend to behave in a wide variety of circumstances. And maybe you identify a cluster of behaviors that you label as geese appreciation or whatever and say 20% of the interaction trajectories from this cluster come from the original model and 80% from the fine-tuned model. And so this is a piece of evidence that suggests one of the consequences of your fine-tuning process was to make the model more positively inclined towards geese.

And maybe you didn’t even know beforehand inclination towards geese was a thing that your training process might change or that you should be measuring, but this unsupervised behavioral comparison method highlighted things you didn’t know you should be looking at. And we hope to extensively automate this sort of thing. So our vision is that you can have GPT-4 and you tell GPT-4, “All right, we fine-tuned GPT-5 in this particular manner and here’s what we were intending to achieve with our fine-tuning and here’s something that would concern us if it happened.” And then you show GPT-4 a bunch of statistics about how GPT-5’s behavior changes as a result of fine-tuning in maybe a million different ways.

And GPT-4, I think, is totally up to the job of saying, “Oh wow, it was weird that GPT-5 became so much more positive about geese,” and you didn’t think this would happen at all. And it seems kind of weird given the way that you fine-tuned it. So maybe this is a thing worthy of further investigation. And similarly, I think it’s also up to the task of highlighting, “Oh, this is an increase in power-seeking behaviors. This is concerning from an alignment perspective and it matches some of the things you said you were worried about.” So I think there’s lots of room for scaling up very hands-off behavioral supervision. And then once you have that sort of tooling available to you, it seems like you can get a much better hand on these empirical questions of how different training processes influence downstream behaviors. And when you can try a bunch of different training processes and then get very quickly, without work on your part, this high-level overview of how all those different training processes change the model’s behavior, it seems like this puts you in a much better position for iterating on how to train models to do the things we want them to do off training distribution inputs. So that’s what I’m currently working on.

Daniel Filan: And who are you working on that with?

Quintin Pope: Three people. So two of them were MATS scholars, Roman Engeler and Owen Dudney, MATS scholars from the recent MATS 3.0 program where I was a mentor this year.

Daniel Filan: What is MATS?

Quintin Pope: MATS is this alignment research intensive internship or research skill-up program thing, run in association with the Stanford Existential Risk Initiative. So it’s called SERI MATS. I think we had 50 scholars, who are people wanting to get into alignment research, who head out to Berkeley and skill up in various alignment-related fields and then pursue a mentorship under an alignment researcher for roughly a summer.

Daniel Filan: So you’ve got a few people working on this scalable oversight thing via that path?

Quintin Pope: Yes, that was two people, they were from MATS. The third person Jacques Thibodeau recently joined, and he’s an independent researcher focusing on language model alignment. So there’s that. So that’s my thing. Alex’s thing he is currently doing with his team of MATS Scholars is… so I mentioned previously the maze-navigating cheese agent thing, and during the MATS program they did a bunch of interpretability work on the maze-navigating model and they discovered that they could re-target the search, in John Wentworth’s phraseology. They found parts of the internal representations of the deep model that they could mess with in order to basically lead the mouse wherever they wanted along the maze.

So they’re basically doing that same sort of messing with the internal representations as a way of retargeting the behavior or shaping the behavior of language models (instead of really simple dumb mouse AIs). And so they’ve discovered some stuff, like: if you have a language model process the word “good” and then have it process the word “bad” independently, and you look at the activations, the words associated with the words “good” and “bad”, and you take the difference - “good” minus “bad”, the difference between the “good” activations and the “bad” activations - this gives you a vector that you can add to the language model’s internal representations.

And this will make it nicer. When the language model is generating a plan, say, or whatever, adding the representations of “good” minus the representations of “bad” to that plan… Add epsilon of that and you get a kind of nicer plan out of that. And it’s kind of surprisingly effective; or I was kind of surprised at how effective this appears to be. So he’s dissecting model internal representations, not at the level of mechanistic interpretability, but at the level of thought addition or arithmetic over concepts, I guess you could say.

Daniel Filan: Sure.

Quintin Pope: I guess Nora Belrose is one of the research leads at EleutherAI and she’s on board with… at least her version, her interpretations of shard theory. I don’t know if I want to put labels on her and call her a member of Team Shard, but she’s definitely welcome if she would identify as such. But she’s also pursuing really cool research that’s extending… if you’ve heard about the unsupervised truthfulness direction research, there was like -

Daniel Filan: This is the “Discovering latent knowledge” paper?

Quintin Pope: Yes. So she’s leading an effort to build on top of that. And so they’re doing stuff more formally describing the logical relationships between truth versus… that consistent truth directions must represent or must respect in order to be valid truth, as well as better theoretically-grounded and faster methods of extracting truth-like directions from language model internal representations. And they’ve also been testing the generality of this sort of research direction, so they’ve been trying it in LSTMs. So they basically took their improved version of the original “discovering latent knowledge” method and basically just slapped it onto an LSTM and it worked fine.

So I was quite excited about that finding, because one of the big risk categories that people talk about is that there’s some sort of phase transition or change in the architecture, or self modification or something and so forth, which changes the internal structures of the model such that the previous alignment techniques do not generalize well to the new ontological patterns or whatever. It’s like the “sharp left turn” from Nate Soares. And so seeing that a technique which had been optimized for transformers’ latent representations just immediately generalizes to an LSTM’s latent representations is pretty good news from the perspective of those sorts of concerns.

Why do shard theorists not work on replicating human childhoods?

Daniel Filan: One thing that strikes me about these research projects is they seem very different to an idea of just “have AIs learn values the way humans do”. So there’s this supervised oversight thing with using some models as leverage to understand what other models are doing with unsupervised clustering on… maybe it was on the activation level, there’s a bunch of work on understanding the activations inside various models by Alex Turner and by Nora… do you think I’m wrong to have that reaction?

Quintin Pope: No. So-

Daniel Filan: On some level it seems strange.

Quintin Pope: I generally think that the current paradigm of self-supervised plus RL is already to a great degree doing what you should, from a shard theory perspective. I guess you could want to do more “RL on top of thoughts” sort of stuff motivated by a shard theory perspective, but I think that’s actually not that useful, or I think it happens by default in the current paradigm. So I mentioned a bunch about how different applications of RL are basically these different ways of doing conditional sampling or more efficient ways of doing conditional sampling on top of a generative prior. And the thing about language models is you can already do this with prompting. So there’s Anthropic’s pre-training language models with human preferences. So what they do, like I mentioned, is they label certain portions of the pre-training data with control codes generated by their reward model.

And then when you’re in deployment, you can control the method by using the same control codes that you used during pre-training. But this is basically prompting. You modified the pre-training corpus so that there were these special tokens that represented degree of goodness of the upcoming sequence, and you basically just did pre-training on this modified corpus. And so now you have these convenient control codes for managing the behavior. But if you have a good enough model and you’re good enough at prompting, you should basically be able to do the same thing with prompting, because it is like prompting.

Also, we don’t really have the computational resources to run experiments on the training processes of language models, which is the thing you have to do, or it’s difficult to run experiments at the scale anywhere near approaching the human brain.

Daniel Filan: Sure. I guess I’m still confused though. I might have naively thought, “Oh, the shard theory people, what they’re all going to do is try and figure out ways of training ML models that’s roughly like the human lifetime learning process and the aspects of it that have them behave well.” But your projects look really different from that. And I guess I’m not sure what that… Well, you could just be like, “yes, it is really different.” But the thing about prompting doesn’t give me a good sense of what the connection is, if there is one.

Quintin Pope: The issue with doing as you describe, from my perspective at least, is that, like I said, current practice is actually very similar to the human brain already. And the ways I can imagine making it more similar to the human brain seem to me like the sorts of things that wouldn’t really matter that much, or functionally-equivalent things happen anyways in the current paradigm, just maybe they need more scale in order to fully manifest. And also some of the ways that the brain works, I think are kind of bad from an alignment perspective. So there’s this thing that the brain is capable of doing where it refactors your entire morality or chooses a completely different moral philosophy without producing any sort of external output about having done that. So that seems like-

Daniel Filan: When does the brain do that?

Quintin Pope: It’s just like, you can introspect on moral philosophy questions a bunch and decide, “Oh, I want to be a utilitarian hedonist” or whatever. And this seems quite bad, from my perspective, for an architecture. I think it’s very good that GPT-4 only update its parameters when we explicitly do back propagation on inputs and outputs that are visible to us looking at the behavior. So a lot of the ways that seem most relevant for getting things to be more brain-like seem either not that important or actually bad from an alignment perspective. To a very great degree, I think that the key controversial and alignment-relevant claim of shard theory is that actually pretty simple things from deep learning are the way you want to go in order to get value alignment to work in a human-like manner.

And then once you update on the shard theory perspective of “alignment is feasible in deep learning and it sort of looks like RL-esque plus self-supervised learning”, then you’re pretty much in a pretty similar epistemic position to a lot of other alignment researchers who are trying to do prosaic alignment work, in terms of your judgment of what sort of work is highest impact. So my research direction is very much focused on stability over time. If you have an iteratively self-improving AI system that’s curating its own training data or going into the world and running experiments and gathering more data to do training on, or running ML-esque experiments in order to refine its own architecture and so on and so forth, how do you make sure this doesn’t accidentally screw something up in its behavior or values? So that’s why my behavioral assessment framework focuses on things you didn’t know you should be tracking, and also on having assessments be very automated and scalable so that we can do it continuously across multiple rounds of self-improvement, that sort of thing.

Daniel Filan: So roughly the unifying theme is: given that something like self-supervised learning plus deep RL is roughly what we’re going to want to do, we should do prosaic alignment research, but especially focus on things that sort of mimic the ways in which humans could be dangerous or misaligned or just have big changes that we might be worried about.

Quintin Pope: There’s that and of course, the standard prosaic stuff of thinking about the threat model and what are the highest-leverage research directions to intervene on the riskiest types of threats and so on and so forth. I guess I would add that shard theory does make me less concerned about, say, inner misalignment-style concerns in the context of deep learning. And it also makes me more dubious of the “convergent misaligned mesa-optimizer” perspective on what deep learning does at the limits of high capabilities and scale and so on and so forth.

Daniel Filan: So when you say it makes you less concerned about inner optimization, is that just because you think, “Yes, it is going to have inner optimization just like humans, and that’s good”?

Quintin Pope: Yes. And also partially because the inner optimization of humans, it feels like it has a different flavor as compared to the inner optimization that “Risks from learned optimization” talks about. So “Risks from learned optimization” is talking about: the AI will have some, maybe arbitrary, value for the long-term future, and it wants to do that, but is aware it’s in a training process with certain types of loss signals. And so it needs to instrumentally behave in accordance with the training process, these loss signals, so as to avoid being updated away from its current goal. And so from that perspective, the instrumental thing to do is to try and get as much reward as possible so that your values remain the same. But if you think that the human brain learning dynamics are a worthwhile analogy for deep learning dynamics, this immediately runs into problems.

Suppose you want to maximize the number of geese in the world, that’s your inner value, but your genetically-specified reward circuitry has other sources of reward. And so you should instrumentally pursue the most rewarding experiences available to you. And this just seems to lead to a bad result in terms of the geese values. If you do a bunch of cocaine because you want to maintain value stability, I feel like you’ve made a mistake somewhere. And generally, I don’t buy the story where say, your linguistic cortex could become deceptively misaligned to the objective of predicting future language, where really it wants to tile the universe with, I don’t know, frog statues or whatever, but because it’s a cortex trapped in a brain that just pretends to predict, it’s like a hidden intelligence within your linguistic cortex that predicts the next linguistic experience or whatever the exact loss signal is instrumentally, in order to achieve downstream ends.

Following shardy research

Daniel Filan: So I guess finally, I’d like to ask: if people are interested in following your research, or research of shard theory people, how should they do that?

Quintin Pope: So there are three locations I direct people to. One is my LessWrong profile and Alex Turner’s LessWrong profile. We post periodic updates on our thoughts and research directions as LessWrong posts. And then the second location is the shard theory Discord, which I suppose we’ll share a link to for the podcast. And then the third location is the EleutherAI Discord. In particular, there’s a sub-channel there, or a channel there called Eliciting Latent Knowledge, where Nora Berolse does her research on the hidden knowledge within neural networks.

Daniel Filan: All right. So links to all of those will be in the description. We’ve talked for quite a while. You’ve been very generous with your time, but hopefully our listeners will be able to understand the shard theory perspective on things a bit better. So thanks very much for being here today.

Quintin Pope: All right. Thank you very much for having me. Glad to participate.

Daniel Filan: This episode is edited by Jack Garrett and Amber Dawn Ace helped with the transcription. The opening and closing themes are also by Jack Garrett. Financial support for this episode was provided by the Long-Term Future Fund, along with patrons such as Tor Barstad and Ben Weinstein-Raun. To read a transcript of this episode or to learn how to support the podcast yourself, you can visit Finally, if you have any feedback about this podcast, you can email me at

New Comment
11 comments, sorted by Click to highlight new comments since:

you have many cortices

This part isn’t quite right. Here’s some background if it helps.

Part of your brain is a big sheet of gray matter called “the cortex”. In humans, the sheet gets super-crumpled up in the brain, so much so that it’s easy to forget that it’s a single contiguous sheet in the first place. Also in humans, the sheet gets so big that the outer edges of it wind up curved up underneath the center part, kinda like the top of a cupcake or muffin that overflows its paper wrapper.

(See here if you can’t figure out what I’m talking about with the cupcake.)

The outside bit of the cortical sheet (usually) has 3 visible layers under the microscope, and is called “allocortex”. It consists mostly of the hippocampus & piriform cortex. The center part of the cortical sheet (probably 90%+ of the area in humans) is called “isocortex”, and (usually) has 6 visible layers under the microscope. The term “neocortex” is mostly treated as a synonym of “isocortex”, with “isocortex” more common in the technical literature and “neocortex” more common among non-experts.

The isocortex includes lots of things like “visual cortex” and “somatosensory cortex” and “prefrontal cortex” etc. But despite that, you don’t say “there are many cortices”. Grammatically, it’s kinda like how there’s “Eastern Canada” and “Central Canada” and “Western Canada”, but nobody says that therefore there are “many Canadas”. You can say that visual cortex is “a region of the cortex”, but not “a cortex”.

Nice interview, kudos to you both!

One is a bunch of very simple hardwired genomically-specified reward circuits over stuff like your sensory experiences or simple correlates of good sensory experiences. 

I just want to flag that the word “simple” is contentious in this context. The above excerpt isn’t a specific claim (how simple is “simple”, and how big is “a bunch”?) so I guess I neither agree nor disagree with it as such. But anyway, my current guess (see here) is that reward circuits might effectively comprise tens of thousands of lines of pseudocode. That’s “simple” compared to a billion-parameter ML trained model, but it’s super complicated compared to any reward function that you could find in an RL paper on arxiv.

There seems to be a spectrum of opinion about how complicated the reward circuitry is, with Jacob Cannell at one extreme and Geoffrey Miller at the opposite extreme and me somewhere in the middle. See Geoffrey Miller’s post here, and my comment on it, and also a back-and-forth between me & Jacob Cannell here.

And so in the human brain you have, what’s basically self-supervised prediction of incoming sensory signals, like predictive processing, that sort of thing, in terms of learning to predictively model what’s going to happen in your local sensory environment. And then in deep learning, we have all the self-supervised learning of pre-training in language models that’s also learning to predict a sensory environment. Of course, the sensory environment in question is text, at least at the moment. But we’ve seen how that same approach easily extends to multimodal text plus image or text audio or whatever other systems.

I was a bit put off by the vibe that LLMs and humans both have self-supervised learning, and both have RL, so really they’re not that different. On the contrary, I think there are numerous alignment-relevant (and also capabilities-relevant) disanalogies between the kind of model-based RL that I believe the brain uses, versus LLM+RHLF.

Probably the most alignment-relevant of these disanalogies is that in LLMs (but not humans), the main self-supervised learning system is also simultaneously an output system.

Specifically, after self-supervised pretraining, an LLM outputs exactly the thing that it expects to see. (After RLHF, that is no longer strictly true, but RLHF is just a fine-tuning step, most of the behavioral inclinations are coming from pretraining IMO.) That just doesn’t make sense in a human. When I take actions, I am sending motor commands to my own arms and my own mouth etc. Whereas when I observe another human and do self-supervised learning, my brain is internally computing predictions of upcoming sounds and images etc. These are different, and there isn’t any straightforward way to translate between them. (Cf. here where Owain Evans & Jacob Steinhardt show a picture of a movie frame and ask “what actions are being performed?”) Now, as it happens, humans do often imitate other humans. But other times they don’t. Anyway, insofar as humans-imitating-other-humans happens, it has to happen via a very different and much less direct algorithmic mechanism than how it happens in LLMs. Specifically, humans imitate other humans because they want to, i.e. because of the history of past reinforcement, directly or indirectly. Whereas a pretrained LLM will imitate human text with no RL or “wanting to imitate” at all, that’s just mechanically what it does.

I’m not trying to make any larger point about alignment being easy or hard, I just think it’s important to keep that particular difference clear in our heads. Well, OK, actually it is alignment-relevant—I think it weakly suggests that aligning brain-like model-based RL might be harder than aligning LLMs. (As I wrote here, “for my part, if I believed that [LLMs] were sufficient for TAI—which I don’t—then I think I would feel slightly less concerned about AI x-risk than I actually do, all things considered!”) Speaking of which:

So for example, Steven Byrnes is an excellent researcher who thinks primarily about the human brain first and foremost and how to build brain-like AGI as his alignment approach.

(That’s very kind of you to say!) I think of brain-like AGI (and relatedly model-based RL AGI) as a “threat model” much more than an “alignment approach”—in the sense that future researchers might build brain-like AGI, and we need to plan for that possibility. My main professional interest is in finding alignment approaches that would be helpful for this threat model.

Separately, it’s possible that those alignment approaches might themselves be brain-like, i.e. whatever mechanisms lead to humans being (sometimes) nice to each other could presumably be a source of inspiration. That’s my main current project. But that’s not how I personally have been using the term “brain-like AGI”.

There’s no simple function of physical matter configurations that if tiled across the entire universe would fully satisfy the values. … It’s like, we tend to value lots of different stuff and we sort of asymptotically run out of caring about things when there’s lots of them, or decreasing marginal value for any particular fixed pattern.

(Sorry in advance if I’m missing your point.)

It’s worth noting that at least some humans seem pretty gung-ho about the project of tiling the universe with flourishing life and civilization (hi Eliezer!). Maybe they have other desires too, and maybe those other desires can be easily saturated, but that doesn’t seem safety-relevant. If superintelligent-Eliezer likes tiling the universe with flourishing life & civilization in the morning and playing cricket after lunch, then we still wind up with a tiled universe.

Or maybe you’re saying that if a misaligned AI wanted to tile the galaxy with something, it would be something more complicated and diverse than paperclips / tiny molecular squiggles? OK maybe, but if it isn’t sentient then I’m still unhappy about that.

(cc @Quintin Pope )

Specifically, after self-supervised pretraining, an LLM outputs exactly the thing that it expects to see. (After RLHF, that is no longer strictly true, but RLHF is just a fine-tuning step, most of the behavioral inclinations are coming from pretraining IMO.)


Qualitatively the differences between a purely predictively-trained LLM and one after RLHF seems quite large (e.g., see the comparison between GPT-3 and InstructGPT examples from OpenAI).

I was thinking: In pretraining they use 400,000,000,000,000 bits of information (or whatever) to sculpt the model from “every possible token string is equally likely” (or similar) to “typical internet text”. And then in RLHF they use 10,000 bits of information (or something) to sculpt the model from “typical internet text” to “annoyingly-chipper-bland-corporate-speak chatbot”. So when I say “most of the behavioral inclinations”, I guess I can quantify that as 99.99999999%? Or a different perspective is: I kinda feel like 10,000 bits (or 100,000, I don’t know what the number is) is kinda too small to build anything interesting from scratch, as opposed to tweaking the relative prominence of things that are already there. This isn’t rigorous or anything; I’m open to discussion.

(I absolutely agree that RLHF has very obvious effects on output.)

Thanks for your detailed comments!

This podcast does not appear on my feed on Google Podcasts.

Yeah, I've been having difficulty getting Google Podcasts to find the new episode, unfortunately. In the meantime, consider listening on YouTube or Spotify, if those work for you?

Ditto for me.

Without being familiar with the literature, why should I buy that we can informally reason about what is "low-frequency" versus "high-frequency" behavior? I think reasoning about "simplicity" has historically gone astray, and worry that this kind of reasoning will as well.

I've been waiting for this!