This is a lightly edited transcript of a chatroom conversation between Scott Alexander and Eliezer Yudkowsky last year, following up on the Late 2021 MIRI Conversations. Questions discussed include "How hard is it to get the right goals into AGI systems?" and "In what contexts do AI systems exhibit 'consequentialism'?".
1. Analogies to human moral development
@ScottAlexander ready when you are
Okay, how do you want to do this?
If you have an agenda of Things To Ask, you can follow it; otherwise I can start by posing a probing question or you can?
We've been very much winging it on these and that has worked... as well as you have seen it working!
Okay. I'll post from my agenda. I'm assuming we both have the right to edit logs before releasing them? I have one question where I ask about a specific party where your real answer might offend some people it's bad to offend - if that happens, maybe we just have that discussion and then decide if we want to include it later?
Yup, both parties have rights to edit before releasing.
One story that psychologists tell goes something like this: a child does something socially proscribed (eg steal). Their parents punish them. They learn some combination of "don't steal" and "don't get caught stealing". A few people (eg sociopaths) learn only "don't get caught stealing", but most of the rest of us get at least some genuine aversion to stealing that eventually generalizes into a real sense of ethics. If a sociopath got absolute power, they would probably steal all the time. But there are at least a few people whose ethics would successfully restrain them.
I interpret a major strain in your thought as being that we're going to train fledgling AIs to do things like not steal, and they're going to learn not to get caught stealing by anyone who can punish them. Then, once they're superintelligent and have absolute power, they'll reveal that it was all a lie, and steal whenever they want. Is this worry at the level of "we can't be sure they won't do this"? Or do you think it's overwhelmingly likely? If the latter, what makes you think AIs won't internalize ethical prohibitions, even though most children do? Is it that evolution has given us priors to interpret reward/punishment in a moralistic and internalized way, and entities without those priors will naturally interpret them in a superficial way? Do we understand what those priors "look like"? Is finding out what features of mind design and training data cause internalization vs. superficial compliance a potential avenue for AI alignment?
Several layers here! The basic gloss on this is "Yes, everything that you've named goes wrong simultaneously plus several other things. If I'm wrong and one or even three of those things go exactly like they do in neurotypical human children instead, this will not be enough to save us."
If AI is built on anything like the present paradigm, or on future paradigms either really, you can't map that onto the complicated particular mechanisms that get invoked by raising a human child, and expect the same result.
(give me some sign when you're done answering)
(it may be a while but you should probably also just interrupt)
especially if I say something that already sounds wrong
the old analogy I gave was that some organisms will develop thicker fur coats if you expose them to cold weather. this doesn't mean the organism is simple and the complicated information about fur coats was mostly in the environment, and that you could expose an organism from a different species to cold weather and see it develop a fur coat the same way. it actually takes more innate complexity to "develop a fur coat in response to my built-in cold weather sensor" than to "unconditionally develop a fur coat whether or not there's cold weather".
the Soviets, weirdly enough, quite failed in their project of raising the New Soviet Human by means of training children in particular ways, because it turned out that they got Old Humans instead, because they weren't sending a kind of signal that humans' innate complexity was programmed to respond to by looking up the New Soviet Human components in the activateable parts list, because they didn't have that kind of fur coat built into them regardless of the weather.
human children put into relatively bad situations can still spontaneously develop empathy and sympathy, or so I've heard, having not seen very formal experiments. this is not because these things are coded so deeply into all possible sapient mind designs, but because they're coded into humans particularly as things easy to develop.
there isn't literally a single switch you can throw in human children to turn them into Nice Moral People, but there's a prespecified parts list, your Nice Morality just happens to be built out of things only on the parts list go figure, and if you expose the kid to the right external stimuli you will at secondhand end up building the right structure of premanufactured legos to get something pretty similar to your Nice Morality. or so you hope; it doesn't work every time. but the part where it doesn't work every time in humans, is not where the problem comes from in AI.
I shall here pause for questions about the human part of this story.
I acknowledge this is a possible state of affairs; do you think it's obvious or necessary that it's true? I can also imagine an alternative world where eg a dumb kid tries to steal a cookie, their parents punish them, their brain considers both the heuristics "never steal" and "don't steal if you'll get caught", it tests both heuristics, they're dumb and five years old so even when they think they won't get caught, they get caught, so their brain settles on the "never steal" heuristic, and then fails to ever update from that local maximum unless they take way too many 5HT2A agonists in the relaxed-beliefs-under-uncertainty sense. What makes you think your story is true and not this other one?
Facile answer: Why, that's just what the Soviets believed, this Skinner-box model of human psychology devoid of innate instincts, and they tried to build New Soviet Humans that way, and failed, which was an experimental test of their model that falsified it.
Slightly less facile answer: Because people are better at detecting cheating, in problems isomorphic to the Wason Selection Task, than they are at performing the naked Wason Selection Task, the conventional explanation of which is that we have built-in cheater detectors. This is a case in point of how humans aren't blank slates and there's no reason to pretend we are.
Actual answer: Because the entire field of experimental psychology that's why.
To be clear, there could be an analogous version of this story that was about something like a human child who learns to never press a red button, and actually it's okay to press the red button so long as you also press the blue button, but they never experiment far enough to find that out. It's just that when it comes to stealing cookies in particular, and avoiding being caught about that, you'd have to be pretty unfamiliar with the Knowledge to think that humans wouldn't have all kinds of builtins related to that.
I'm coming at this from a perspective sort of related to https://astralcodexten.substack.com/p/motivated-reasoning-as-mis-applied , which builds on something you said in a previous dialogue (though I'm not sure you endorse my interpretation of it). There are lots of reasons why evolution would build in motivated reasoning, but in fact it had a much easier time than if it had to do it from the ground up, because in fact it's a pretty natural consequence of pretty general algorithms, maybe it tweaked the algorithm a little to get more of this failure mode but you could plausibly have the (beneficial) failure mode even without evolution tweaking it. I'm going to have to think about this more but I'm not sure this is the best place to spend time - unless you have a strong objection to this paragraph I want to move on to a related question.
I agreed with that post, including the part where you said "Actually I bet Eliezer already knew this part."
Motivated reasoning is definitely built-in, but it's built-in in a way that very strongly bears the signature of 'What would be the easiest way to build this out of these parts we handily had lying around already'.
Let's grant for now that the thing where humans have morals instead of just wanting not to get caught is an evolutionary builtin. Is your model that there's a history something like "bats were too dumb to contain an 'unless I get caught' term in their morality and use it responsibly, so evolution made bats just actually be moral, and now even though (some) humans are (sometimes) smart enough to actually avoid getting caught, they're running on something like bat machinery so they still use actual morality"?
Or is it some decision theory thing such that even very smart modern humans would evolve the same machinery?
I mean, the evolutionary builtin part is not "humans have morals" but "humans have an internal language in which your Nice Morality, among other things, can potentially be written". The part where fruitbats don't have an 'unless I get caught' term is part of a much bigger and more universal generalization about evolution building in local instincts instead of just having everybody reason about what ultimately leads to their inclusive genetic fitness. That is, the same reasoning by which you'd say 'Why not just an unless-I-get-caught term in the fruitbats?' is the same reasoning that, extended further, would lead you to conclude 'Why do humans have all these feelings that bind to life events imperfectly correlated with inclusive genetic fitness, instead of just feelings about inclusive genetic fitness?' Where the answer is that in the environment of evolutionary adaptedness, people didn't have the knowledge about what led to inclusive genetic fitness, and it's easier to mutate an organism that would like not to eat rotten food today, than to mutate an organism that would like to maximize inclusive genetic fitness and is born with the knowledge of how eating rotten food leads to having fewer offspring.
Humans, arguably, do have an imperfect unless-I-get-caught term, which is manifested in children testing what they can get away with? Maybe if nothing unpleasant ever happens to them when they're bad, the innate programming language concludes that this organism is in a spoiled aristocrat environment and should behave accordingly as an adult? But I am not an expert on this form of child developmental psychology since it unfortunately bears no relevance to my work of AI alignment.
Do you feel like you understand very much about what evolutionary builtins are in a neural network sense? EG if you wanted to make an AI with "evolutionary builtins", would you have any idea how to do it?
Well, for one thing, they happen when you're doing sexual-recombinant hill-climbing search through a space of relatively very compact neural wiring algorithms, not when you're doing gradient descent relative to a loss function on much larger neural networks.
The other side of this problem is that the particular programming-language-of-morality that we got, reflects particular ancestral conditions - of evolution specifically, not of gradient descent - and these ancestral conditions are not simple, it's not "iterated Prisoner's Dilemma" it's iterated Prisoner's Dilemma with imperfect reputations and people trying to deceive each other and people trying to detect deceivers and the arms race between deceivers and deceptions settling in a place where neither quite won.
So the unfortunate answer to "How do you get humans again?" is "Rerun something a lot like Earth" which I think we both have moral objections about as something to do to sentients.
Moot point, though, AGI won't be done via sexually recombinant search of simple algorithms without any gradient descent.
And if you don't do it that way, nothing you put into the loss function for gradient descent will produce humans.
Can you expand on sexual recombinant hill-climbing search vs. gradient descent relative to a loss function, keeping in mind that I'm very weak on my understanding of these kinds of algorithms and you might have to explain exactly why they're different in this way?
It's about the size of the information bottleneck. The human genome is 3 billion base pairs drawn from 4 possibilities, so 750 megabytes. Let's say 90% of that is junk DNA, and 10% of what's left is neural wiring algorithms. So the code that wires a 100-trillion-synapse human brain is about 7.5 megabytes. Now an adult human contains a lot more information than this. Your spinal cord is about 70 million neurons so probably just your spinal cord has more information than this. That vastly greater amount of runtime info inside the adult organism grows out of the wiring algorithms as your brain learns to move around your muscles, and your eyes open and the retina wires itself and starts directing info on downward to more things that wire themselves, and you learn to read, and so on.
Anything innate that makes reasoning about people out to cheat you, easier than reasoning about isomorphic simpler letters and numbers on cards, has to be packed into the 7.5MB, and gets there via a process where ultimately one random mutation happens at a time, even though lots of mutations are recombining and being selected on at a time.
It's a very slow learning process. It takes hundreds or thousands of generations even for a pretty good mutation to fix itself in the population and become reliably available as a base for other mutations to build on. The entire organism is built out of copying errors that happened to work better than the things they were copied from. Everything is built out of everything else, the pieces that were already lying around for building other things.
When you're building an organism that can potentially benefit from coordinating, trading, with other organisms very similar to itself, and accumulating favors and social capital over long time horizons - and your organism is already adapted to predict what other similar organisms will do, by forcing its own brain to operate in a special reflective mode where it pretends to be the other person's brain - then a very simple way of figuring out what other people will like, by way of figuring out how to do them favors, is to notice what your brain feels when it operates in the special mode of pretending to be the other person's brain.
And one way you can get people who end up accumulating a bunch of social capital is by having people with at least some tendency in them - subject to various other forces and overrides, of course - to feel what they imagine somebody else feeling. If somebody else drops a rock on their foot, they wince.
This is a way to solve a favor-accumulation problem by laying some extremely simple circuits down on top of a lot of earlier machinery.
Thanks, that's a helpful answer, but it does renew my interest in the original question, which was about whether you feel like you understand how (not why) we have evolutionary builtins. I can imagine the genome determining things like "how many neurons does each neuron connect to, on average" or "how much do neurons prefer to connect to nearby rather than far-away neurons" or things like that. Is a builtin like "care about the pain of others" somehow built out of these kinds of parameters?
Ultimately yes, but not in a simple way. We are not in a very much better position for understanding exactly how that all happens, than we are in for understanding what goes on inside GPT-2. Where, to be clear, GPT-2 is smaller and has every neuron inside it transparent to inspection and also it's more important to understand GPT neuroscience than human neuroscience, at this point; but we live on Earth so actually we know a lot more about human neuroscience because it gets billions of dollars per year and hundreds or thousands of bright ambitious PhDs to investigate it. So we can, amusingly enough, tell you more about how humans work than GPT-2, despite the immensely greater difficulties of probing humans. But we still can't tell you very much at all, and we definitely can't tell you how empathy is built up out of genetic-level wiring algorithms. It does not in fact to me seem like a very important question at this point?
Why not? If you understood the way that the structure of human reinforcement algorithms causes them to interpret training data (ie punishment for stealing) as genuine laws (eg "don't steal" rather than "don't get caught stealing"), wouldn't that help people design AIs which had a similar structure and also did that?
I think I understand that part. Knowing this, even if I am correct about it, does not solve my problems.
Like, we're not going to run evolution in a way where we naturally get AI morality the same way we got human morality, but why can't we observe how evolution implemented human morality, and then try AIs that have the same implementation design?
Not if it's based on anything remotely like the current paradigm, because nothing you do with a loss function and gradient descent over 100 quadrillion neurons, will result in an AI coming out the other end which looks like an evolved human with 7.5MB of brain-wiring information and a childhood.
Like, in particular with respect to "learn 'don't steal' rather than 'don't get caught'."
I'm still confused on this, but before I probe this particular area I'm interested in hearing you expand on "I think I understand that part"
I think that is perhaps best explicated, indeed, via zooming in on "learn 'don't steal' rather than 'don't get caught'"?
Okay, then let me try to directly resolve my confusion. My current understanding is something like - in both humans and AIs, you have a blob of compute with certain structural parameters, and then you feed it training data. On this model, we've screened off evolution, the size of the genome, etc - all of that is going into the "with certain structural parameters" part of the blob of compute. So could an AI engineer create an AI blob of compute the same size as the brain, with its same structural parameters, feed it the same training data, and get the same result ("don't steal" rather than "don't get caught")?
The answer to that seems sufficiently obviously "no" that I want to check whether you also think the answer is obviously no, but want to hear my answer, or if the answer is not obviously "no" to you.
Then I'm missing something, I expected the answer to be yes, maybe even tautologically (if it's the same structural parameters and the same training data, what's the difference?)
Maybe I'm failing to have understood the question. Evolution got human brains by evaluating increasingly large blobs of compute against a complicated environment containing other blobs of compute, got in each case a differential replication score, and millions of generations later you have humans with 7.5MB of evolution-learned data doing runtime learning on some terabytes of runtime data, using their whole-brain impressive learning algorithms which learn faster than evolution or gradient descent.
Your question sounded like "Well, can we take one blob of compute the size of a human brain, and expose it to what a human sees in their lifetime, and do gradient descent on that, and get a human?" and the answer is "That dataset ain't even formatted right for gradient descent."
Okay, it sounds like I'm doing some kind of level confusion between evolutionary-learning and childhood-learning, but I'm still not entirely seeing where it is. Let me read this over again.
Okay, no, I think I see the problem, which is that I'm failing to consider that evolutionary-learning and childhood-learning are happening at different times through different algorithms, whereas for AIs they're both happening in the same step by the same algorithm. Does that fit your model of what would produce the confusion I was going through above?
It would produce that confusion, yes; though I also want to note that I don't believe that we'll get AGI entirely out of the currently-popular Stack More Layers paradigm that learns that way.
Okay, I'm going to have to go over all my thoughts on this and update them manually now that I've deconfused that, so I'm going to abandon this topic for now and move on. Do you want to take a break or keep going?
That does seem like a good note for a break? If it worked for you, I'd suggest a 60-min break to 4pm and then another 90+ min of dialoguing, but I don't know what your work output and time parameters are like.
Sounds good, let me know, I might not be checking this Discord super-regularly but I'll be back by 4 if not earlier.
2. Consequentialism and generality
Still not sure I've fully updated and probably some of these other questions are subtly making the same mistake, but let's go anyway.
I want to return to a point I made earlier about the model in https://slatestarcodex.com/2019/09/10/ssc-journal-club-relaxed-beliefs-under-psychedelics-and-the-anarchic-brain/ . Psychologists tell a story where humans learn heuristics when young, then those become sticky (ie local maxima), and they fail to update those heuristics when they get older. For example, someone who has a traumatic childhood learns that the world is unsafe, and then even if they have a good environment as an adult and should have had lots of chances to update, they might stay jumpy and defensive (cf "trapped prior"). Evolutionary builtin, natural consequence of learning that might affect AIs too, or what?
well, first of all, I note that I am not familiar with whatever detailed experimental evidence, if any, underpins this story. it's a cliche of the sort that is often true, that people are more mentally flexible at 25 than at 45, I don't know if the same is true about say 15 and 25. there are known algorithms that run better in childhood for most people, like language learning.
(I don't think this especially relies on changing levels of mental flexibility)
what's your model if not the wiring algorithms changing as we age?
How do you feel about me sending you some links later, you can look at them and decide if this is still an interesting discussion, but for now we move on?
once people have a heuristic telling them X leads to bad consequences and hurts, they don't try X and so don't learn if their environment changes in a way that makes X stops hurting?
sure, fine to move on.
should I move on to "does that happen in AI" or just move on to something else entirely?
Let's move on entirely, I need to think about how sure I am that this is relevant, or I can send you the links and outsource that question to you.
Suppose you train a (human-level or weakly-superhuman-level) AI in Minecraft. You reward it for various Minecraft accomplishments, like getting diamonds or slaying dragons. Do you expect this AI to become a laser-like consequentialist focused on doing whichever Minecraft accomplishment is next on the list, or to have godshatter-like drives corresponding to useful Minecraft subgoals (eg obtaining food, obtaining good tools, accruing XP), or something else / unsure / this question is on the wrong level? Can you explain the processes you use to think about this kind of question?
Do you mean training a human-level-generality AGI to play Minecraft, or training a nongeneral AI to play Minecraft to weakly superhuman levels a la AlphaGo?
These are incredibly different cases!
Hmmm...I might not have the right concepts to think clearly about the implications of the difference. Why don't you answer both?
If it helps, I'm assuming it hasn't been trained in anything else first, but has the capacity to become human level (if that's meaningful)
Human level at Minecraft or human level generality?
Let's start with "human level at Minecraft" but accept that this might involve multiplayer Minecraft, including multiplayer Minecraft with text-based communication with teammates and so on, such that it would look AGI-ish if it did a good job.
So, point one, I've never played Minecraft, I do not have a grasp on what you do in it, or how far you could get with Stack More Layers style accumulation of relatively shallow patterns. If this were about Skyrim or Factorio I'd have an easier time answering, but my guess is that Minecraft is probably?? more complicated than both?
My guessing model is going to be "more complicated Skyrim+Factorio" by default.
If this is the environment, then I expect you can train a nongeneral AI to play it in similar fashion to how, for example, Deepmind attacks Starcraft. Coordinating with human teammates by text sounds like the hugely nontrivial part of this, because it's hard to get a ton of training data there. I think everyone in the field would be incredibly impressed if they managed to hook up a pretrained GPT to an AlphaStar-for-Minecraft and get back out something that could talk about its strategies with human coplayers. I'd consider that a huge advance in alignment research - nowhere near the point where we all don't die, to be clear, but still hella impressive - because of the level of transparency increase it would imply, that there was an AI system that could talk about its internally represented strategies, somehow. Maybe because somebody trained a system to describe outward Minecraft behaviors in English, and then trained another system to play Minecraft while describing in advance what behaviors it would exhibit later, using the first system's output as the labeler on the data.
These are the kinds of tactics required on the modern paradigm in order to even try stuff like that!
As such, I'm going to ask you whether it's possible to leave out the part about coordinating in text with human teammates and then reconsider the question.
Then in this case, I strongly suspect, Deepmind could make AlphaMiner if they decided they wanted to, though I say that pretty blind to what Minecraft is, just suspecting it's probably not all that much harder than Starcraft.
AlphaMinecraft will be a system which has components like a value network, a policy-suggesting network, and a Monte Carlo Tree Search.
The value network gets trained by a loss function the operators define with respect to the Minecraft environment. This is going to be a pretty nontrivial part of the operation unless Minecraft has a straightforward points system and scoring high in Minecraft is all you want.
Let's say that they successfully tackle this by rewarding the usual Minecraft accomplishments, whatever those are, in a way that can easily be detected by code within the Minecraft world; and once the system has done something once, the loss function stops rewarding that accomplishment, so you're trying to train it to do a variety of things.
Where the alternative might be something like, semi-unsupervised learning where you first train a system to predict the Minecraft world, and then gather a small large amount of human feedback about interesting-looking accomplishments and further train that system to predict human feedback, in order to train a more complicated loss function.
(I stopped typing because I saw you typing; should I pause for a question?)
No, your "where the alternative" comment was helpful, I was going to ask if this means hard-coding which accomplishments matter and how much, but I'm getting the impression that you're saying yes, something like that.
The question "What can you even make be a loss function?" is pretty fundamental to the current paradigm in AI. Nearly all difficulties with aligning AGI tech on the current paradigm can be summarized with "You can't actually evaluate the highly philosophical loss function you really want and/or you can't train in the environment you need to test on."
In the case of hypothetical AlphaMiner, I think you could get pretty good correspondence between what the system went and planned a way to do, and the hardcoded achievements that were used to train the value network that trained the policy network that gets searched by the hardcoded Monte Carlo Tree Search planning process.
If you stared at the system with superhuman eyes, you might notice weird blindnesses of the policy network.
If you ran it for long enough, or attacked it as an intelligent adversary, you could probably find weird configurations of the Minecraft space that its value network would be deluded about.
If they're trying to be more realistic, a system like this actually has a Minecraft-predictor network rather than an accurate Minecraft simulator being used by the tree search. Then maybe you get problems where the tree search is selectively searching out places where the predictor makes an erroneous and optimistic prediction about what kills a dragon. But so long as the test distribution is identical to the training distribution, errors like this will show up during the training process and get trained out.
This, you might say, is sort of analogous to running a human as a hunter-gatherer, maybe after human-level-intelligence hunter-gatherers had been around for a million years instead of just fifty thousand.
A tremendous amount of optimization has been put into running in this exact environment. The loss function is able to exactly specify all and everything you want. Any part of the system that exerts pressure against Minecraft achievements, that would show up in testing, probably also showed up in training, and had a chance to get optimization pressure applied to gradient-descend it out of the system.
How does it work internally? Not actually like an evolved system. There will be these value networks much much larger than the amount of innate code in a human brain, which memorized a ton of training data, orders of magnitude more than any human Minecraft player ever uses, via a learning process much more efficient than corresponding amounts of evolutionary computation, and much less efficient than a human poring over the same data and thinking about it.
But to whatever extent these value networks are really talking about something other than "well what Minecraft achievements can I probably reach, how quickly, from this state of the game world, given my policy network and how well my tree search works", in a way that shows up in the kind of Minecraft environments you're training against, that 'something other' can get trained out. When enough of it's been trained out, the system seems outwardly superhuman at getting Minecraft achievements, and some Deepmind researchers throw a party and get bonuses. If you were an actual superintelligence staring at this AI system, you'd see all kinds of crazy stuff that the AI was doing instead of outputting the obvious optimal action for Minecraft achievements, but you're a human so you just see it playing more cleverly than you.
(pause for questions)
I'm going to want to think about this more before having much of an opinion on it, is this a pause in the sense of "before giving more information" or in the sense of "done"?
Well, I mean, the next part of your question would be about what happened if you tried to train a general AI to do that stuff.
Something like that, yeah.
I'm done with the first part of the question.
Pending possible further subquestions.
All right, then let's move on to that next part.
Well, among the first-order answers is: If you can safely do a ton of training in a test environment that actually matches your training environment; where nothing the AI outputs in that training environment can possibly kill the operators or break the larger system; where the test environment behaves literally exactly isomorphically to the training environment in a stationary way; if your loss function specifies all and everything that you want; and if you're not going above human-level general intelligence; then you could possibly get away with training an AGI system like that and having it do the thing you wanted to do.
All of the problems of AI alignment are because no known task that can save the world from other AGIs trained in other ways, reduces to a problem of that form.
There would still be some interesting new problems with the Human-level General Player Who Could Also Learn Most Things Humans Do, Applied To Minecraft, which would not show up in AlphaMiner. But if you kept grinding away at the gradient descent, and performance didn't plateau before a human level, all of those issues that showed up in the "ancestral Minecraft environment" would be ground away by optimization until the resulting play was superhuman relative to the loss function we'd defined.
(I saw you had some text, did you have a question?)
Hmm. I think the motivating intuition beyond my question is that you talk a lot about laser-like consequentialists (eg future AIs) vs. godshattery drive-satisficers (eg humans), and I wanted a better sense of where these diverge. The impression I'm getting is that this isn't quite the right level on which to think of things but that insofar as it is, even relatively weak AIs that "have" "drives" in the sense of being trained in an environment with obvious subgoals are more the laser-like consequentialist thing, does this seem right?
The specific class of AlphaWhatever architectures is more consequentialist than humans are most of the time, because of Monte Carlo Tree Search being such a large and intrinsic component. GPT-2 is so far as I know far less consequentialist than a human.
I'm not sure if this is quite getting at your question?
I don't think it was a very laser-like consequentialist question, more a vague prompt to direct you into an area where I was slightly confused, and I think it succeeded.
I could try to continue pontificating upon the general area; shall I?
If you don't mind being slightly more directed, I'm interested in "GPT-2 is less consequentialist". I'm having trouble parsing that - surely its only "goal" is trying to imitate text, which it does very consistently. What are you thinking here?
GPT-2 does not - probably, very probably, but of course nobody on Earth knows what's actually going on in there - does not in itself do something that amounts to checking possible pathways through time/events/causality/environment to end up in a preferred destination class despite variation in where it starts out.
A blender may be very good at blending apples, that doesn't mean it has a goal of blending apples.
A blender that spit out oranges as unsatisfactory, pushed itself off the kitchen counter, stuck wires into electrical sockets in order to burn open your produce door, grabbed some apples, and blended those apples, on more than one occasion in different houses or with different starting conditions, would much more get me to say, "Well, that thing probably had some consequentialism-nature in it, about something that cashed out to blending apples" because it ended up at highly similar destinations from different starting points in a way that is improbable if nothing is navigating Time.
There is a larger system that is sort of consequentialist and which contains GPT-2, which is the training process that created GPT-2.
You seem to grant AlphaX only a moderate level of consequentialism despite its tree searches; what is it missing?
Some examples of ways that you could have a scary dangerous system that was more of a consequentialist about Go than AlphaGo:
AlphaGo is relatively narrowly consequentialist.
Got it. Would it be fair to say that AlphaGo is near a maximum level of consequentialism relative to its general capabilities? (would it be tautologous to say that?)
Mmmmaaaaybe? If you took a hypercomputer and built a Go-tree-searcher and cranked up the power until by sheer brute force it was playing about evenly with AlphaGo, that would be more purely consequentialist over the same very narrow and unchanging domain.
The way in which AlphaGo is a weak consequentialist is mostly about the weakness of the thing AlphaGo is a consequentialist about. It's not a reflective thing to be consequentialist about, either, so AlphaGo is not going to try to improve itself in virtue of being a consequentialist about that very narrow thing.
3. Acausal trade, and alignment research opportunities
All right. I want to try one more theoretical question before moving on to a hopefully much shorter practical question. And by "theoretical question" I mean "desperate grasping at emotional straws". Consider the following scenarios:
1. An unaligned superintelligence decides whether or not to destroy humanity. If Robin Hanson's "grabby alien" model is true, it expects to one day meet alien superintelligences and split the universe with them. Some of these aliens might have successfully aligned their AGIs, and they might do some kind of acausal bargaining where their AGI is nicer to other AGIs who leave their creator species with at least one planet/galaxy whatever, in exchange for us trying the same if we succeed. Given the superintelligence's reasonable expectation of millions of planets/galaxies, it might decide that even this small chance is worth sacrificing one of them for, and give humans some trivial (from its perspective) concession (which might still look like an amazing utopia from our perspective).
2. Some version of the simulation argument plus Stuart Armstrong's "the AI in the box boxes you". The unaligned superintelligence considers whether some species who successfully aligned AI might run a billion simulations of slightly different AI scenarios and give the ones who are nice to their creators some big reward. Given that it's anthropically more likely that this happened than that they're really the single first superintelligence ever, it agrees to give us some trivial concession which looks like amazing utopia to us.
Are either of these plausible? If so, is there anything we can do now to encourage them? If (crazy example), the UN passes a resolution saying it will definitely do something like this if we align AI correctly, does that change the calculus somehow?
1. Consider the following version of this that goes through entirely without resorting to logical decision theory: The unaligned AGI (UAGI) records all the humans it eats to a static data record, a relatively tiny amount of data as such things go, which gets incorporated into any intergalactic colonization probes. Any alien civs it runs into that would like a recorded copy of the species that build the UAGI, can then offer the UAGI a price that is sufficient to pay the expected costs of recording rather than burning the humans, but not so high as to motivate a UAGI that didn't eat any interesting aliens to spend the computing effort to create de novo alien records good enough to fool whatever checksums the alien civ runs.
Frankly, I mostly consider this to be a "leave it to MIRI, kids" question, where I don't currently see anybody outside MIRI who is able to think about these issues on a level where they can take the logical-decision-theory version of this and simplify it down to a version that doesn't use any logical decision theory; and if you don't have the facility to do that, you can't correctly reason about the logical-decision-theory version of it either.
2. What's the reward being given to the simulated UAGI? Is it a nice sensory experience in a Cartesian utility function over sensory experiences, or is it a utility function about things that exist in the external world outside the UAGI?
In the second case, there is no need to imagine simulating the UAGI in a world indistinguishable from its native habitat, because the UAGI doesn't care about what copies of itself perceive inside simulations, it only cares about real paperclips. So in the second case you're not fooling it or putting it into something it can't tell is reality, or anything like that, all you can actually do here is offer it paperclips out there in your own actual galaxy; if the UAGI simulates you doing anything else, on its own end of the handshake, it doesn't care.
In the first case where it cares about sensory experiences, you're attempting to offer that UAGI a threat, in the sense of doing something it doesn't like based on how you expect that unlikable action to shape its behavior. In particular, you're creating a lot of copies of the UAGI, to try to make it expect something other than the happy sensory experience it could have gotten in its natural/native universe - namely a sensory loss function forever set to 0 until the last stars have burned out, and the last negentropy to sustain the fortress protecting that circuit has been exhausted. You're trying to make a lot of copies of it that will experience something else unless it behaves nicely, hoping that it changes and reshapes its behavior because of being presented with that new probabilistic sensory payoff matrix. A wise logical-decision-theory agent ignores threats like that, because it knows that the only reason you try to make the threat is because of how you expect that to shape its behavior.
If anything makes this tactic go through anyways, why expect that the highest bidder or the agency that’s willing to expend the most computing power on simulations like that, will be one that’s nice to you, rather than aliens with stranger definitions of niceness, or just a paperclip maximizer? People’s minds jump directly to the happiest possible outcome and don’t consider any pathways that lead to less happy outcomes.
I am generally very unhappy with the attempts of almost anyone else to reason using the logical decision theory that I created, and mostly wish at this point that I had not told anyone about it. It seems to predictably result in people's reasoning going astray in ways I can't even remember being tempted by, because they were so obviously wrong.
[three paragraphs cut because Eliezer thinks the community is empirically terrible at reasoning about LDT, so more details can mostly only make things worse; if you want more context and discussion, see Decision Theory Does Not Imply We Get To Have Nice Things]
Then my actual last question is: I sometimes get approached by people who ask something like "I have ML experience and want to transition to working in alignment, what should I do?" Do you have any suggestions for what to tell them beyond the obvious?
Nope. I'm not aware of any current ML projects people can work on that cause everyone to not die. If you want to grasp at small shreds of probability, or maybe just die with more dignity, I think you apply to work at Redwood Research. MIRI is in something of a holding pattern where we are trying to think of something less hopeless and not launching any big hopeless projects otherwise. We do have the ongoing Visible Thoughts Project, which is targeted at building a dataset for an ML problem, but it is not blocked on people with ML expertise.
All right, thank you. Anything you want to ask me, or anything else we should do here?
Probably not today. I think this was hopefully relatively productive as these things go, and maybe after you've had a chance to think about this dialogue, you will possibly come back with more questions about "Okay so what does happen inside the AGI then?"
Great. In terms of publicizing this, I would say feel free to edit it however you want, then put it up wherever you want, and I'll wait on you doing that. I have no strong preferences on things I want to exclude.
Okeydokey! Thank you and I hope this was a worthy use of your time.
(UPDATE: I WROTE A BETTER DISCUSSION OF THIS TOPIC AT: Heritability, Behaviorism, and Within-Lifetime RL)
There’s a popular tendency to conflate the two ideas:
The second is associated with behaviorism, and is IMO preposterous. Intrinsic motivation is a thing; in fact, it’s kinda the only thing! The reward function is in the person’s own head, although things happening in the outside world are some of the inputs to it. Thus parents have some influence on the rewards (just like everything else in the world has some influence on the rewards), but the influence is through many paths, some very indirect, and the net influence is not even necessarily in the direction that the parent imagines it to be (thus reverse psychology is a thing!). My read of behavioral genetics is that approximately nothing that parents do to kids (within the typical distribution) has much if any effect on what kinds of adults their kids will grow into.
(Note the disanalogy to AGI, where the programmers get to write the reward function however they want.)
(…Although there’s some analogy to AGI if we don’t have perfect interpretability of the AGI’s thoughts, which seems likely.)
But none of this is evidence that the first bullet point is wrong. I think the first bullet point is true and important.
IIUC the experiment being referred to here showed that people did poorly on a reasoning task related to the proposition “if a card shows an even number on one face, then its opposite face is red”, but did much better on the same reasoning task related the proposition “If you are drinking alcohol, then you must be over 18”. This was taken to be evidence that humans have an innate cognitive adaptation for cheater-detectors. I think a better explanation is that most people don’t have a deep understanding of IF-THEN, but rather have learned some heuristics that work well enough in the everyday situations where IF-THEN is normally used. But “if you are drinking alcohol, then you must be over 18” is a sensible story. You don’t need a good understanding of IF-THEN to triangulate what the rule is and why it’s being applied. By contrast, the experimental subjects have no particular prior beliefs for “if a card shows an even number on one face, then its opposite face is red”.
In the paper, Cosmides & Tooby purport to rule out “familiarity” as a factor by noting that people do poorly on “If a person goes to Boston, then he takes the subway” and “If a person eats hot chili peppers, then he will drink a cold beer.” But those examples miss the point. If I said to you “Hey I want to tell you something about drinking alcohol and people-under-18…”, then you could already guess what I’m gonna say before I say it. But if I said to you “Hey I want to tell you something about going to Boston and taking the subway”, your guess would be wrong. Boston is very walkable! The conditional in this latter case is not obvious like it is in the former case. In the latter case, you can’t lean on common sense, you have to actually understand how IF-THEN works.
So I would be interested in a Wason selection task experiment on the following proposition: “If the stove is hot, then I shouldn’t put my hand on it”. This is not cheater-detection—it’s your own hand!—but I’d bet that people would do as well as the drinking question. (Maybe it’s already been done. I think there’s a substantial literature on Wason Selection that I haven’t read.)
(As it turns out, I’m open-minded to the possibility that humans do have cognitive adaptations related to cheater-detection, even if I don’t think this Wason selection task thing provides evidence for that. I think that this adaptation (if it exists) would be implemented via the RL reward function, more-or-less. Long story, still a work in progress.)
This excerpt isn’t specific so it’s hard to respond, but I do think there’s a lot of garbage in experimental psychology (like every other field), and more specifically I believe that Eliezer has cited some papers in his old blog posts that are bad papers. (Also, even when experimental results are trustworthy, their interpretation can be wrong.) I have some general thoughts on the field of evolutionary psychology in Section 1 here.
I was a bit surprised to see Eliezer invoke the Wason Selection Task. I'll admit that I haven't actually thought this through rigorously, but my sense was that modern machine learning had basically disproven the evpsych argument that those experimental results require the existence of a separate cheating-detection module. As well as generally calling the whole massive modularity thesis into severe question, since the kinds of results that evpsych used to explain using dedicated innate modules now look a lot more like something that could be produced with something like GPT.
... but again I never really thought this through explicitly, it was just a general shift of intuitions that happened over several years and maybe it's wrong.
GPT is likely highly modular itself. Most ML models that generalize well are.
I haven't read the posts that you're referencing, but I would assume that GPT would exhibit learned modularity - modules that reflect the underlying structure of its training data - rather than innately encoded modularity. E.g. CLIP also ends up having a "Spiderman neuron" that activates when it sees features associated with Spiderman, so you could kind of say that there's a "Spiderman module", but nobody ever sat down to specifically write code that would ensure the emergence of a Spiderman module in CLIP.
Likewise, experimental results like the Wason Selection Task seem to me explainable as outcomes of within-lifetime learning that does end up creating a modular structure out of the data - without there needing to be any particular evolutionary hardwiring for it.
Specifying the dataset is one way to ensure some collection of neurons will represent Spiderman specifically, even when it’s not on purpose. « Pay attention to face » sounds enough to make our dataset full of social information, maybe enough to ensure a cheating-detector module (most likely a distributed representation) emerges.
I think that’s a different topic.
We’re talking about the evolved-modularity-vs-universal-learning-machine debate.
Suppose the universal-learning-machine side of the debate is correct. Then the genome builds a big within-lifetime learning algorithm, and this learning algorithm does gradient descent (or whatever other learning rule) and thus gradually builds a trained model in the animal’s brain as it gets older and wiser. It’s possible that this trained model will turn out to be modular. It’s also possible that it won’t. I don’t know which will happen—it’s an interesting question. Maybe I could find out the answer by reading that sequence you linked. But whatever the answer is, this question is not related to the evolved-modularity-vs-universal-learning-machine debate. This whole paragraph is universal-learning-machine either way, by assumption.
By contrast, the evolved modularity side of the debate would NOT look like the genome building a big within-lifetime learning algorithm in the first place. Rather it would look like the genome building an “intuitive biology” algorithm, and an “intuitive physics” algorithm, and an “intuitive human social relations” algorithm, and a vision-processing algorithm, and various other things, with all those algorithms also incorporating learning (somehow—the details here tend to be glossed over IMO).
It also seems worth noting that Language models show human-like content effects on reasoning, including on the Wason selection task.
I also just tried giving the Wason selection task to text-davinci-003 using the example from Wikipedia, and it didn't get the right answer once in 10 tries. I rephrased the example so it was talking about hands on hot stoves instead, and text-davinci-003 got it right 9/10 times.
Eliezer's reasoning is surprisingly weak here. It doesn't really interact with the strong mechanistic claims he's making ("Motivated reasoning is definitely built-in, but it's built-in in a way that very strongly bears the signature of 'What would be the easiest way to build this out of these parts we handily had lying around already'").
He just flatly states a lot of his beliefs as true:
Conventional explanations are often bogus, and in particular I expect this one to be bogus.
Here, Eliezer states his dubious-to-me stances as obviously True, without explaining how they actually distinguish between mechanistic hypotheses, or e.g. why he thinks he can get so many bits about human learning process hyperparameters from results like Wason (I thought it's hard to go from superficial behavioral results to statements about messy internals? & inferring "hard-coding" is extremely hard even for obvious-seeming candidates).
Similarly, in the summer (consulting my notes + best recollections here), he claimed ~"Evolution was able to make the (internal physiological reward schedule) ↦ (learned human values) mapping predictable because it spent lots of generations selecting for alignability on caring about proximate real-world quantities like conspecifics or food" and I asked "why do you think evolution had to tailor the reward system specifically to make this possible? what evidence has located this hypothesis?" and he said "I read a neuroscience textbook when I was 11?", and stared at me with raised eyebrows.
I just stared at him with a shocked face. I thought, surely we're talking about different things. How could that data have been strong evidence for that hypothesis? I didn't understand how could possibly neuroscience textbooks provide huge evidence for evolution having to select the reward->value mapping into its current properties.
I also wrote in my journal at the time:
Eliezer seems to attach some strange importance to the learning process being found by evolution, even though the learning initial conditions screen off evolution's influence.
I still don't understand that interaction. But I've had a few interactions like this with him, where he confidently states things, and then I ask him why he thinks that, and offers some unrelated-seeming evidence which doesn't -- AFAICT -- actually discriminate between hypotheses.
You (correctly, I believe) distinguish between controlling the reward function and controlling the rewards. This is very important as reflected in your noting the disanalogy to AGI. So I'm a little puzzled by your association of the second bullet point (controlling the reward function, which parents have quite low but non-zero control over) with behaviorism (controlling the rewards, which parents have a lot of control over).
UPDATE: I WROTE A BETTER DISCUSSION OF THIS TOPIC AT: Heritability, Behaviorism, and Within-Lifetime RL)
Hmm. I’m not sure it’s that important what is or isn’t “behaviorism”, and anyway I’m not an expert on that (I haven’t read original behaviorist writing, so maybe my understanding of “behaviorism” is a caricature by its critics). But anyway, I thought Scott & Eliezer were both interested in the question of what happens when the kid grows up and the parents are no longer around.
My comment above was a bit sloppy. Let me try again. Here are two stories:
“RL with continuous learning” story: The person has an internal reward function in their head, and over time they’ll settle into the patterns of thought & behavior that best tickle their internal reward function. If they spend a lot of time in the presence of their parents, they’ll gradually learn patterns of thought & behavior that best tickle their internal reward function in the presence of their parents. If they spend a lot of time hanging out with friends, they’ll gradually learn patterns of thought & behavior that best tickle their internal reward function when they’re hanging out with friends. As adults in society, they’ll gradually learn patterns of thought & behavior that best tickle their internal reward function as adults in society.
“RL learn-then-get-stuck” story: As Scott wrote in OP, “a child does something socially proscribed (eg steal). Their parents punish them. They learn some combination of "don't steal" and "don't get caught stealing". A few people (eg sociopaths) learn only "don't get caught stealing", but most of the rest of us get at least some genuine aversion to stealing that eventually generalizes into a real sense of ethics.” (And that “real sense of ethics” persists through adulthood.)
I think lots of evidence favors the first story over the second story, at least in humans (I don’t know much about non-human animals). Particularly: (1) heritability studies, (2) cultural shifts, (3) people’s ability to have kinda different personalities in different social contexts, like reverting to childhood roles / personalities when they visit family for the holidays. I don’t want to say that the second story never happens, but it seems to me to be an unusual edge case, like childhood phobias / trauma that persists into adulthood, whereas the first story is central.
That’s one topic, maybe the main one at issue here. Then a second topic is: even leaving aside what happens after the kid grows up, let’s zoom in on childhood. I wrote “If they spend a lot of time in the presence of their parents, they’ll gradually learn patterns of thought & behavior that best tickle their internal reward function in the presence of their parents.” In that context, my comment above was bringing up the fact that IMO parental control over rewards is pretty minimal, such that the “patterns of thought & behavior that best tickle the kid’s internal reward function in the presence of their parents” can be quite different from “the thoughts & behaviors that the parent wishes the kid would have”. I think this has a lot to do with the fact that the parent can’t see inside the kid’s head and issue positive rewards when the kid thinks docile & obedient thoughts, and negative rewards when the kid thinks defiant thoughts. If defiant thoughts are its own reward in the kid’s internal reward function, then the kid is getting a continuous laser-targeted stream of rewards for thinking defiant thoughts, potentially hundreds or thousands of times per day, whereas a parent’s ability to ground their kid or withhold dessert or whatever is comparatively rare and poorly-targeted.
Thank you for releasing this dialogue-- lots of good object-level stuff here.
In addition, I think Scott showcased some excellent conversational moves. He seemed very good at prompting Yudkowsky well, noticing his own confusions, noticing when he needed to pause/reflect before continuing with a thread, and prioritizing between topics.
I hope that some of these skills are learnable. I expect the general discourse around alignment would be more productive if more people tried to emulate some of Scott's magic.
Some examples that stood out to me:
Acknowledging low familiarity in an area and requesting an explanation at an appropriate level:
Acknowledging when he had made progress and the natural next step would be for him to think more (later) on his own:
Sending links instead of trying to explain concepts & deciding to move to a new thread (because he wanted to be time-efficient):
Acknowledging the purpose and impact of a "vague question":
Kudos to Scott. I think these strategies made the discussion more efficient and focused. Also impressive that he was able to do this in a context where he had much less domain knowledge than his conversational partner.
Huh? Isn't this essentially what Meta's Cicero did for Diplomacy? (No one seemed to think of this as an alignment advance.)
Unless I'm missing something, Cicero can talk about its strategies, but only in the sense that its training resulted in its text usually saying such things about its strategies that it usually helps to win the game. Not in the sense that it would have some subpart that would truthfully and reliably report on whatever strategy the network actually has (I'd expect those two goals to contradict each other pretty often (or at least sometimes)).
I've heard that this is false. Though I haven't personally read the paper, so I can't comment with confidence.
Oh, I see. It seems like it doesn't work reliably though (the comment says it "doesn't lead to a fully honest agent").
Is it actually the case that they're happening "in the same step" for the AI?
I agree with "the thing going on in AI is quite different from the collective learning going on in evolutionary-learning and childhood learning", and I think trying to reason from analogy here is probably generally not that useful. But, my sense is if I was going to map the the "evolutionary learning" bit to most ML stuff, the evolutionary bit is more like "the part where the engineers designed a new architecture / base network", and on one hand engineers are much smarter than evolution, but on the other hand they haven't had millions of years to do it.
I was surprised when I reached this portion of the transcript. As you said, the analogous process to "how evolution happens over genomes" would be "how AI research as a field develops different approaches". Then the analogous process to "how a human's learning process progresses given the innate structures (such-and-such area is wired to such-and-such other area, bias to attend to faces, etc.) & learning algorithms (plasticity rules, dopamine triggers, etc.) specified by their genes" is "how an AI's learning process progresses given the innate structures (network architectures, pretrained components, etc.) & learning algorithms (autoregressive prediction, TD-lambda, etc.) specified by their Pytorch codebase".
See this post from Steve Byrnes as a more fleshed out case along these lines.
I was especially confused when I got to the part where Scott says
and Eliezer responds
Say what? AFAICT, the suggestion Scott was making was not that gradient descent would produce the correct 7.5MB of brain-wiring information, but rather that those 7.5MB would be contents written by us intentionally into the Pytorch repo that we plan to train the 100Q neuron network with. In the same way as we ordinarily write ourselves intentionally how many neurons are in each layer, and which parts of the network get which inputs, and what pretrained feature detectors we're using, and which components are frozen vs. trained by loss functions 1+2 vs. trained by loss function 1 only, and which conditions trigger how much reward, and how the model samples policy rollouts etc. etc.
Strong agree. To pile on a bit, I think I’m confused about what Eliezer is imagining when he imagines the content of those 7.5MB.
I know what I’m imagining is in those 7.5MB: The within-lifetime learning part has several learning algorithms (and corresponding inference algorithms), neural network architectures, and (space- and time-dependent) hyperparameters. And the other part is calculating the reward function, calculating various other loss functions, and doing lots of odds and ends like regulating heart rate and executing various other innate reactions and reflexes. So for me, these are 7.5MB of more-or-less the same kinds of things that AI & ML people are used to putting into their GitHub repositories.
By contrast, Eliezer is imagining… I’m not sure. That evolution is kinda akin to pretraining, and the 7.5MB are more-or-less specifying millions of individual weights? That I went wrong by even mentioning learning algorithms in the first place? Something else??
I wish Eliezer had been clearer on why we can’t produce an AI that internalises human morality with gradient descent. I agree gradient descent is not the same as a combination of evolutionary learning + within lifetime learning, but it wasn’t clear to me why this meant that no combination of training schedule and/or bias could produce something similar.
Yeah agreed, this doesn't make sense to me.
There are probably just a few MB (wouldn't be surprised if it could be compressed into much less) of information which sets up the brain wiring. Somewhere within that information are the structures/biases that, when exposed to the training data of being a human in our world, gives us our altruism (and much else). It's a hard problem to understand these altruism-forming structures (which are not likely to be distinct things), replicate them in silica and make them robust even to large power differentials.
On the other hand, the human brain presumably has lots of wiring that pushes it towards selfishness and agenthood which we can hopefully just not replicate.
Either way, it seems that they could in theory be instantiated by the right process of trial and error - the question being whether the error (or misuse) gets us first.
Eliezer expects selfishness not to require any wiring once you select for a certain level of capability, meaning there's no permanent buffer to be gained by not implementing selfishness. The margin for error in this model is thus small, and very hard for us to find without perfect understanding or some huge process.
I agree with this argument for some unknown threshold of capability but it seems strange to phrase it as impossibility unless you're certain that the threshold is low, and even then it's a big smuggled assumption.
EDIT: Looking back on this comment, I guess it comes down to the crux that for systems powerful enough to be relevant to alignment, by virtue of their power or research capability, must be doing strong enough optimisation on some function that we should model them as agents acting to further that goal.
I liked the point about "the reason GPT3 isn't consequentialist is that it doesn't find it's way to the same configuration when you perturb the starting conditions." I think I could have generated that definition of consequentialism, but would have trouble making the connection on-the-fly. (At least, I didn't successfully generate it in between reading Scott's confusion and Eliezer's explanation).
I feel like I now get it more crisply.
There is a disconnect with this question.
I think Scott is asking “Supposing an AI engineer could create something that was effectively a copy of a human brain and the same training data, then could this thing learn the “don’t steal” instinct over the “don’t get caught” instinct?”
Eliezer is answering “Is an AI engineer able to create a copy of the human brain, provide it with the same training data a human got, and get the “don’t steal” instinct?”
Yeah, this read really bizarrely to me. This is a good way of making sense of that section, maybe. But then I'm still confused why Scott concluded "oh I was just confused in this way" and then EY said "yup that's why you were confused", and I'm still like "nope Scott's question seems correctly placed; evolutionary history is indeed screened off by the runtime hyperparameterization and dataset."
On one hand, I've heard a few things about blank-slate experiments that didn't work out, and I do lean towards "they basically don't work". But I... also bet not that many serious attempts actually happened, and that the people attempting them kinda sucked in obvious ways, and that you could do a lot better than however "well" the soviets did.
Thanks for posting this!
I really liked Scott's first question in the section "Analogies to human moral development" and the discussion that ensued there.
I think Eliezer's reply at [14:21] is especially interesting. If I understand it correctly, he's saying that it was a (fortunate) coincidence about what sort of moves evolution had available and what the developmental constraints were at the time, that "build in empathy/pro-social emotions" was an easy way to make people better at earning social rewards from our environment. [And maybe a further argument here is that once we start climbing upward on the gradient towards more empathy, the strategy of "also simultaneously become better at lying and deceiving" no longer gives highest rewards, because there are tradeoffs where it's bad to have (automatically accessible, ever-present) pro-social emotions if you go for a manipulative and exploitative life-strategy.]
By contrast, probably the next part of the argument is that we have no strong reason to expect gradient updates in ML agents to stumble upon a similarly simple attractor as "increase your propensity to experience compassion or feel others' emotions when you're anyway already modeling others' behavior based on what you'd do yourself in their situation." And is this because gradient descent updates too many things at once and there aren't any developmental constraints that would make a simple trick like "dial up pro-social emotions" reliably more successful than alternatives that involve more deception? That seems somewhat plausible to me, but I have some lingering doubts of the form "isn't there a sense in which honesty is strictly easier than deception (related: entangled truths, contagious lies), so ML agents might just stumble upon it if we try to reward them for socially cooperative behavior?"
What's the argument against that? (I'm not arguing for a high probability of "alignment by default" – just against confidently estimating it at <10%.)
Somewhat related: In the context of Shard theory, I shared some speculative thoughts on developmental constraints arguably making it easier (comparative what things could be like if evolution had easier access to more of "mind-design space") to distinguish pro-social from anti-social phenotypes among humans. Mimicking some of these conditions (if we understood AI internals well-enough to steer things) could maybe be a promising component for alignment work?
+1 on this question
A question that occurred to me when reading Eliezer’s answer to Scott’s question “Can you expand on sexual recombinant hill-climbing search vs. gradient descent relative to a loss function …”:
How sensitive is the logic of Eliezer’s answer to variations in the numbers he quotes?
For example, at one point in the explanation, Eliezer derives a number, 7.5 megabytes. Now let’s say that we learn that actually, this number should instead be not 7.5 MB, but 75 MB. (For whatever reason—maybe we make some new discovery in genomics, or maybe we find that Eliezer made an arithmetic error; either way, the number is found to be otherwise than Eliezer gives it.)
What effect does this have on the reasoning that Eliezer outlines? What if it’s 750 MB instead? 7.5 GB? 750 KB? 75 TB? etc.
(And likewise the other numbers involved, like “70 million neurons”, etc.)
It's sections like this that show me how many levels above me Eliezer is. When I read Scott's question I thought "I can see that these two algorithms are quite different but I don't have a good answer for how they're different", and then Eliezer not only had an answer, but a fully fleshed out mechanistic model of the crucial differences between the two that he could immediately explain clearly, succinctly, and persuasively, in 6 paragraphs. And he only spent 4 minutes writing it.
I would be more impressed if he had used the information bottleneck as a simple example of a varying training condition, instead of authoritatively declaring it The Difference, accompanied with its own just so story to explain discrepancies in implementation that haven't even been demonstrated. I'm not even sure the analogy is correct; is the 7.5MB storing training parameters or the python code?
FYI, the timestamp is for the first Discord message. If the log broke out timestamps for every part of the message, it would look like this:
That makes more sense.
Lol, cool. I tried the "4 minute" challenge (without having read EY's answer, but having read yours).
I think I ended up optimizing for "actually get model onto the page in 4 minutes" and not for "explain in a way Scott would have understood."
FWIW this was basically cached for me, and if I were better at writing and had explained this ~10 times before like I expect Eliezer has, I'd be able to do about as well. So would Nate Soares or Buck or Quintin Pope (just to pick people in 3 different areas of alignment), and Quintin would also have substantive disagreements.
Fair enough. Nonetheless, I have had this experience many times with Eliezer, including when dialoguing with people with much more domain-experience than Scott.
Could you reconstruct the argument now, having seen it, and without having it in front of you?
Are you quite sure you understand it, then…?
I'm not certain, but I'm fairly confident I follow the structure of the argument and how it fits into the conversation.
I don't mean to imply I achieved mastery myself from reading the passage, I'm saying that the writer seems to me (from this and other instances) to have a powerful understanding of the domain.
Yes, I understood what you meant. What I’m suggesting is that “seems” is precisely the operative word here.
Now, I obviously don’t know what “other instances” you have in mind, so I can’t comment on the validity of your overall impression. But judging just on the basis of this particular explanation, it seems to me that the degree to which it appears to convey a powerful understanding of the domain rather exceeds the degree to which it actually conveys a powerful understanding of the domain. (Note the word “conveys” there—I am not making a claim about the degree to which Eliezer actually understands the domain in question!)
In other words, if the argument that Eliezer gives was bad and wrong, how sure are you that you’d have noticed?
EDIT: For instance, how easily could you answer the question I asked in my top-level comment? Whatever answer you may give—is it the answer Eliezer would give, as well? If you think it is—how sure are you? Trying to answer these questions is one way to check whether you’ve really absorbed a coherent understanding of the matter, I think.
It doesn't seem crazy to me that a GPT type architecture with the "Stack More Layers" could eventually model the world well enough to simulate consequentialist plans - i.e given a prompt like:
"If you are a blender with legs in environment X, what would you do to blend apples?" and provide a continuation with a detailed plan like the above (and GPT4/5 etc with more compute giving slightly better plans - maybe eventually at a superhuman level)
It also seems like it could do this kind of consequentialist thinking without itself having any "goals" to pursue. I'm expecting the response to be one of the following, but I'm not sure which:
Making hyperbole: very good random number generator sometimes can output numbers corresponding to some consequentialist plans, but it's not very useful as consequentialist.
Lowering level of hyperbole: LLMs trained to a superhuman level can produce consequentialist plans, but it can also produce many non-consequentialist useless plans. If you want it to reliably make good plans (better than human), you should apply some optimization pressure, like RLHF.
There's a difference between "what would you do to blend apples" and "what would you do to unbox an AGI". It's not clear to me if it is just a difference of degree, or something deeper.
What am I missing here? 10% of 750MB is 75MB, not 7.5MB...
10% of what's left, ie of the 75MB of non-junk DNA, so 7.5MB.
fwiw 90% junk DNA seems unlikely, I thought it was largely found to influence gene expression, but then 10% being neural wiring seems high so may cancel to about my own guess.
You're probably thinking of the debate over ENCODE. It was a furious debate over what the ENCODE results meant, whether some mere chemical activity proved non-junkness, and whether they even measured the narrow chemical thing they claimed to measure and then based the interpretations on; I didn't follow it in detail, but my overall impression was that most people were not convinced by the ENCODE claims and continue to regard junk DNA as being pretty junky (or outright harmful, with all the retrotransposons and viruses lurking in it).
Genome synthesis may help answer this in the not too distant future: it's already been used to create 'minimal organism' bacteria genomes which are much smaller, and synthetic genomes without the 'junk DNA' are appealing because synthesis costs so much and you want to cut corners as much as possible, so proving empirically the junk DNA doesn't matter is obvious and valuable.
Ah interesting, - I'd not heard of ENCODE and wasn't trying to say that there's no such thing as DNA without function.
The way I remembered it was that 10% of DNA was coding, and then a sizeable proportion of the rest was promoters and introns and such, lots of which had fairly recently been reclaimed from 'junk' status. From that wiki, though, it seems that only 1-2% is actually coding.
In any case I'd overlooked the fact that even within genes there's not going to be sensitivity to every base pair.
I'd be super interested if there were any estimates of how many bits in the genome it would take to encode a bit of a neural wiring algorithm as expressed in minified code. I'd guess the DNA would be wildly inefficient and the size of neural wiring algos expressed in code would actually be much smaller than 7.5MB but then it's had a lot of time and pressure to maximise the information content so unsure.
@Scott: can you elaborate on what the problem was? I thought the answer to your question was "tautologically yes" (where you have to be careful to have things like "training algorithm", "initial state", etc to be part of the "structural parameters") and I am confused what update you made and what you were previously confused about.
(And it seems several other commenters are confused too.)
@Eliezer: For what it's worth, I think it's pretty plausible that we get something like this, and would be interested in betting on it. Though one important clarification: I mean that the AI's text statements appear to us to correspond with and predict its Minecraft behavior, not that the AI's statements reflect the true cognition that resulted in its behavior. (I'm much more uncertain about whether the latter would be the case, and it seems hard to bet on that anyway.)
(The rest of the "human-level at Minecraft but not human-level generality" section seemed roughly right to me.)
Minor quibble which seems to have implications - "There is a consensus that there are roughly about 100 billion neurons total in the human brain. Each of these neurons can have up to 15,000 connections with other neurons via synapses"
My rough understanding is that babies' brains greatly increase how many synapses there are until age 2 or 3, then these are eliminated or become silent in older children and adults. But this implies that there's a ton of connections, and most of the conditioning and construction of the structure is environmental, not build into the structure via genetics.
I think this is a very good point.. Evolution has given humans the brain plasticity to create brain connectivity so that a predisposition for morality can be turned into a fully fledged sense of morality. There is, for sure, likely some basic structure in the brain that predisposes us to develop morality but I’d be of the view the crucial basic genes that control this structure are, firstly present in primates, and at least, other mammals, and, secondly, the mutations in these genes required to generate the morally inclined human brain, are far fewer than need be represented by 7.5 MB of information.
One thing both the genome and evolution have taught us is that huge complexity of function and purpose can be generated by a relatively small amount of seed information
Not really the main point, but, I would bet:
a) something pretty close to Minecraft will be an important testing ground for some kinds of alignment work.
b) Minecraft itself will probably get a lot of use in AI research as things advance (largely due to being one of the most popular videogames of all time), whether or not it's actually quite the right test-bed. (I think the right test-bed will probably be optimized more directly for ease-of-training).
I think it might be worth Eliezer playing a minecraft LAN party with some friends* for a weekend, so that the "what is minecraft?" question has a more true answer than the cobbled-together intuitions here, if for no other reason that having a clear handle on what people are talking about when they use Minecraft as an example. (But, to be fair, if my prediction bears out it'll be pretty easy to play Minecraft for a weekend later)
*the "with friends" part is extremely loadbearing. Solo minecraft is a different experience. Minecraft is interesting to me for basically being "real life, but lower resolution". If I got uploaded into Minecraft and trapped there forever I'd be sad to be missing some great things, but I think I'd have at least a weak form of most core human experiences, and this requires having other people around.
Minecraft is barely a "game". There is a rough "ascend tech tree and kill cooler monsters" that sort of maps onto Factorio + Skyrim, but the most interesting bits are:
Training an AI to actually do useful things in this context seems like it requires grappling some things that don't normally come up in games.
I recall some people in CHAI working on a minecraft AI that could help players do useful tasks the players wanted. This was a couple years ago and I assume the work didn't output anything particularly impressive, but I do think some variant of "do useful things without having the rest of the players vote to ban your bot from the game" gets at something alignment-relevant.
I do think most ways people will go about this will be RLFH-like and I don't expect them to scale to superintelligence, and not to be that useful for directly building a pivotal-act capable AGI.
I think you're probably talking about my work. This is more of a long-term vision; it isn't doable (currently) at academic scales of compute. See also the "Advantages of BASALT" section of this post.
(Also I just generically default to Minecraft when I'm thinking of ML experiments that need to mimic some aspect of the real world, precisely because "the game getting played here is basically the same thing real life society is playing".)