AXRP Episode 22 - Shard Theory with Quintin Pope

you have many cortices

This part isn’t quite right. Here’s some background if it helps.

Part of your brain is a big sheet of gray matter called “the cortex”. In humans, the sheet gets super-crumpled up in the brain, so much so that it’s easy to forget that it’s a single contiguous sheet in the first place. Also in humans, the sheet gets so big that the outer edges of it wind up curved up underneath the center part, kinda like the top of a cupcake or muffin that overflows its paper wrapper.

(See here if you can’t figure out what I’m talking about with the cupcake.)

The outside bit of the cortical sheet (usually) has 3 visible layers under the microscope, and is called “allocortex”. It consists mostly of the hippocampus & piriform cortex. The center part of the cortical sheet (probably 90%+ of the area in humans) is called “isocortex”, and (usually) has 6 visible layers under the microscope. The term “neocortex” is mostly treated as a synonym of “isocortex”, with “isocortex” more common in the technical literature and “neocortex” more common among non-experts.

The isocortex includes lots of things like “visual cortex” and “somatosensory cortex” and “prefrontal cortex” etc. But despite that, you don’t say “there are many cortices”. Grammatically, it’s kinda like how there’s “Eastern Canada” and “Central Canada” and “Western Canada”, but nobody says that therefore there are “many Canadas”. You can say that visual cortex is “a region of the cortex”, but not “a cortex”.

[-]Steven Byrnes2yΩ560

Nice interview, kudos to you both!

One is a bunch of very simple hardwired genomically-specified reward circuits over stuff like your sensory experiences or simple correlates of good sensory experiences.

I just want to flag that the word “simple” is contentious in this context. The above excerpt isn’t a specific claim (how simple is “simple”, and how big is “a bunch”?) so I guess I neither agree nor disagree with it as such. But anyway, my current guess (see here) is that reward circuits might effectively comprise tens of thousands of lines of pseudocode. That’s “simple” compared to a billion-parameter ML trained model, but it’s super complicated compared to any reward function that you could find in an RL paper on arxiv.

There seems to be a spectrum of opinion about how complicated the reward circuitry is, with Jacob Cannell at one extreme and Geoffrey Miller at the opposite extreme and me somewhere in the middle. See Geoffrey Miller’s post here, and my comment on it, and also a back-and-forth between me & Jacob Cannell here.

And so in the human brain you have, what’s basically self-supervised prediction of incoming sensory signals, like predictive processing, that sort of thing, in terms of learning to predictively model what’s going to happen in your local sensory environment. And then in deep learning, we have all the self-supervised learning of pre-training in language models that’s also learning to predict a sensory environment. Of course, the sensory environment in question is text, at least at the moment. But we’ve seen how that same approach easily extends to multimodal text plus image or text audio or whatever other systems.

I was a bit put off by the vibe that LLMs and humans both have self-supervised learning, and both have RL, so really they’re not that different. On the contrary, I think there are numerous alignment-relevant (and also capabilities-relevant) disanalogies between the kind of model-based RL that I believe the brain uses, versus LLM+RHLF.

Probably the most alignment-relevant of these disanalogies is that in LLMs (but not humans), the main self-supervised learning system is also simultaneously an output system.

Specifically, after self-supervised pretraining, an LLM outputs exactly the thing that it expects to see. (After RLHF, that is no longer strictly true, but RLHF is just a fine-tuning step, most of the behavioral inclinations are coming from pretraining IMO.) That just doesn’t make sense in a human. When I take actions, I am sending motor commands to my own arms and my own mouth etc. Whereas when I observe another human and do self-supervised learning, my brain is internally computing predictions of upcoming sounds and images etc. These are different, and there isn’t any straightforward way to translate between them. (Cf. here where Owain Evans & Jacob Steinhardt show a picture of a movie frame and ask “what actions are being performed?”) Now, as it happens, humans do often imitate other humans. But other times they don’t. Anyway, insofar as humans-imitating-other-humans happens, it has to happen via a very different and much less direct algorithmic mechanism than how it happens in LLMs. Specifically, humans imitate other humans because they want to, i.e. because of the history of past reinforcement, directly or indirectly. Whereas a pretrained LLM will imitate human text with no RL or “wanting to imitate” at all, that’s just mechanically what it does.

I’m not trying to make any larger point about alignment being easy or hard, I just think it’s important to keep that particular difference clear in our heads. Well, OK, actually it is alignment-relevant—I think it weakly suggests that aligning brain-like model-based RL might be harder than aligning LLMs. (As I wrote here, “for my part, if I believed that [LLMs] were sufficient for TAI—which I don’t—then I think I would feel slightly less concerned about AI x-risk than I actually do, all things considered!”) Speaking of which:

So for example, Steven Byrnes is an excellent researcher who thinks primarily about the human brain first and foremost and how to build brain-like AGI as his alignment approach.

(That’s very kind of you to say!) I think of brain-like AGI (and relatedly model-based RL AGI) as a “threat model” much more than an “alignment approach”—in the sense that future researchers might build brain-like AGI, and we need to plan for that possibility. My main professional interest is in finding alignment approaches that would be helpful for this threat model.

Separately, it’s possible that those alignment approaches might themselves be brain-like, i.e. whatever mechanisms lead to humans being (sometimes) nice to each other could presumably be a source of inspiration. That’s my main current project. But that’s not how I personally have been using the term “brain-like AGI”.

There’s no simple function of physical matter configurations that if tiled across the entire universe would fully satisfy the values. … It’s like, we tend to value lots of different stuff and we sort of asymptotically run out of caring about things when there’s lots of them, or decreasing marginal value for any particular fixed pattern.

(Sorry in advance if I’m missing your point.)

It’s worth noting that at least some humans seem pretty gung-ho about the project of tiling the universe with flourishing life and civilization (hi Eliezer!). Maybe they have other desires too, and maybe those other desires can be easily saturated, but that doesn’t seem safety-relevant. If superintelligent-Eliezer likes tiling the universe with flourishing life & civilization in the morning and playing cricket after lunch, then we still wind up with a tiled universe.

Or maybe you’re saying that if a misaligned AI wanted to tile the galaxy with something, it would be something more complicated and diverse than paperclips / tiny molecular squiggles? OK maybe, but if it isn’t sentient then I’m still unhappy about that.

(cc @Quintin Pope )

[-]garymm2y30

Specifically, after self-supervised pretraining, an LLM outputs exactly the thing that it expects to see. (After RLHF, that is no longer strictly true, but RLHF is just a fine-tuning step, most of the behavioral inclinations are coming from pretraining IMO.)

Qualitatively the differences between a purely predictively-trained LLM and one after RLHF seems quite large (e.g., see the comparison between GPT-3 and InstructGPT examples from OpenAI).

[-]Steven Byrnes2y21

I was thinking: In pretraining they use 400,000,000,000,000 bits of information (or whatever) to sculpt the model from “every possible token string is equally likely” (or similar) to “typical internet text”. And then in RLHF they use 10,000 bits of information (or something) to sculpt the model from “typical internet text” to “annoyingly-chipper-bland-corporate-speak chatbot”. So when I say “most of the behavioral inclinations”, I guess I can quantify that as 99.99999999%? Or a different perspective is: I kinda feel like 10,000 bits (or 100,000, I don’t know what the number is) is kinda too small to build anything interesting from scratch, as opposed to tweaking the relative prominence of things that are already there. This isn’t rigorous or anything; I’m open to discussion.

(I absolutely agree that RLHF has very obvious effects on output.)

[-]DanielFilan2yΩ220

Thanks for your detailed comments!

[-]niplav2y50

This podcast does not appear on my feed on Google Podcasts.

[-]DanielFilan2y44

Yeah, I've been having difficulty getting Google Podcasts to find the new episode, unfortunately. In the meantime, consider listening on YouTube or Spotify, if those work for you?

[-]DragonGod2y42

It's working now!

https://podcasts.google.com/feed/aHR0cHM6Ly9heHJwb2RjYXN0LmxpYnN5bi5jb20vcnNz/episode/ODVlM2RkNmItMTdkZi00MWYwLTg2YjAtOWIxY2JkOTBlYjgw?ep=14

[-]DragonGod2y20

Ditto for me.

[-]TurnTrout2yΩ231

Without being familiar with the literature, why should I buy that we can informally reason about what is "low-frequency" versus "high-frequency" behavior? I think reasoning about "simplicity" has historically gone astray, and worry that this kind of reasoning will as well.

[-]DragonGod2y20

I've been waiting for this!

LESSWRONG
LW

LESSWRONG
LW

52

AXRP Episode 22 - Shard Theory with Quintin Pope

52

Ω 30

52

Ω 30

Why understand human value formation?

Why not design methods to align to arbitrary values?

Postulates about human brains

Sufficiency of the postulates

Reinforcement learning as conditional sampling

Compatibility with genetically-influenced behaviour

Why deep learning is basically what the brain does

Shard theory

Shard theory vs expected utility optimizers

What shard theory says about human values

Does shard theory mean we’re doomed?

Will nice behaviour generalize?

Does alignment generalize farther than capabilities?

Are we at the end of machine learning history?

Shard theory predictions

The shard theory research community

Why do shard theorists not work on replicating human childhoods?

Following shardy research