This is something I've been thinking about a good amount while considering my model of Eliezer's model of alignment. After tweaking it a bunch, it sure looks like a messy retread of much of what Richard says here; I don't claim to assemble any new, previously unassembled insights here.

Tl;dr: For impossibly difficult problems like AGI alignment, the worlds in which we solve the problem will be worlds that came up with some new, intuitively compelling insights. On our priors about impossibly difficult problems, worlds without new intuitive insights don't survive AGI.

Object-Level Arguments for Perpetual Motion

I once knew a fellow who was convinced that his system of wheels and gears would produce reactionless thrust, and he had an Excel spreadsheet that would prove this - which of course he couldn't show us because he was still developing the system.  In classical mechanics, violating Conservation of Momentum is provably impossible.  So any Excel spreadsheet calculated according to the rules of classical mechanics must necessarily show that no reactionless thrust exists - unless your machine is complicated enough that you have made a mistake in the calculations.

And similarly, when half-trained or tenth-trained rationalists abandon their art and try to believe without evidence just this once, they often build vast edifices of justification, confusing themselves just enough to conceal the magical steps.

It can be quite a pain to nail down where the magic occurs - their structure of argument tends to morph and squirm away as you interrogate them.  But there's always some step where a tiny probability turns into a large one - where they try to believe without evidence - where they step into the unknown, thinking, "No one can prove me wrong".

Hey, maybe if you add enough wheels and gears to your argument, it'll turn warm water into electricity and ice cubes!  Or, rather, you will no longer see why this couldn't be the case.

"Right! I can't see why couldn't be the case!  So maybe it is!"

Another gear?  That just makes your machine even less efficient.  It wasn't a perpetual motion machine before, and each extra gear you add makes it even less efficient than that.

Each extra detail in your argument necessarily decreases the joint probability.  The probability that you've violated the Second Law of Thermodynamics without knowing exactly how, by guessing the exact state of boiling water without evidence, so that you can stick your finger in without getting burned, is, necessarily, even less than the probability of sticking in your finger into boiling water without getting burned.

I say all this, because people really do construct these huge edifices of argument in the course of believing without evidence.  One must learn to see this as analogous to all the wheels and gears that fellow added onto his reactionless drive, until he finally collected enough complications to make a mistake in his Excel spreadsheet.

Manifestly Underpowered Purported Proofs

If I read all such papers, then I wouldn’t have time for anything else. It’s an interesting question how you decide whether a given paper crosses the plausibility threshold or not … Suppose someone sends you a complicated solution to a famous decades-old math problem, like P vs. NP. How can you decide, in ten minutes or less, whether the solution is worth reading?

The techniques just seem too wimpy for the problem at hand. Of all ten tests, this is the slipperiest and hardest to apply — but also the decisive one in many cases. As an analogy, suppose your friend in Boston blindfolded you, drove you around for twenty minutes, then took the blindfold off and claimed you were now in Beijing. Yes, you do see Chinese signs and pagoda roofs, and no, you can’t immediately disprove him — but based on your knowledge of both cars and geography, isn’t it more likely you’re just in Chinatown? I know it’s trite, but this is exactly how I feel when I see (for example) a paper that uses category theory to prove NL≠NP. We start in Boston, we end up in Beijing, and at no point is anything resembling an ocean ever crossed.

What's going on in the above cases is argumentation from "genre savviness" about our physical world: knowing, based on the reference class that a purported feat would fall into, the probabilities of feat success conditional on its having or lacking various features. These meta-level arguments rely on knowledge about what belongs in which reference class, rather than on in-the-weeds object-level arguments about the proposed solution itself. It's better to reason about things concretely, when possible, but in these cases the meta-level heuristic has a well-substantiated track record.

Successful feats will all have a certain superficial shape, so you can sometimes evaluate a purported feat based on its superficial features alone. One instance where we might really care about doing this is where we only get one shot at a feat, such as aligning AGI, and if we fail our save everyone dies. In that case, we will not get lots of postmortem time to poke through how we failed and learn the object-level insights after the fact. We just die. We'll have to evaluate our possible feats in light of their non-hindsight-based features, then.

Let's look at the same kind of argument, courtesy Eliezer, about alignment schemes:

On Priors, is "Weird Recursion" Not an Answer to Alignment?

I remark that this intuition matches what the wise might learn from Scott’s parable of K’th’ranga V: If you know how to do something then you know how to do it directly rather than by weird recursion, and what you imagine yourself doing by weird recursion you probably can’t really do at all. When you want an airplane you don’t obtain it by figuring out how to build birds and then aggregating lots of birds into a platform that can carry more weight than any one bird and then aggregating platforms into megaplatforms until you have an airplane; either you understand aerodynamics well enough to build an airplane, or you don’t, the weird recursion isn’t really doing the work. It is by no means clear that we would have a superior government free of exploitative politicians if all the voters elected representatives whom they believed to be only slightly smarter than themselves, until a chain of delegation reached up to the top level of government; either you know how to build a less corruptible relationship between voters and politicians, or you don’t, the weirdly recursive part doesn’t really help. It is no coincidence that modern ML systems do not work by weird recursion because all the discoveries are of how to just do stuff, not how to do stuff using weird recursion. (Even with AlphaGo which is arguably recursive if you squint at it hard enough, you’re looking at something that is not weirdly recursive the way I think Paul’s stuff is weirdly recursive, and for more on that see https://intelligence.org/2018/05/19/challenges-to-christianos-capability-amplification-proposal/.)

It’s in this same sense that I intuit that if you could inspect the local elements of a modular system for properties that globally added to aligned corrigible intelligence, it would mean you had the knowledge to build an aligned corrigible AGI out of parts that worked like that, not that you could aggregate systems that corrigibly learned to put together sequences of corrigible thoughts into larger corrigible thoughts starting from gradient descent on data humans have labeled with their own judgments of corrigibility.

Eliezer often asks, "Where's your couple-paragraph-length insight from the Textbook from the Future"? Alignment schemes are purported solutions to problems in the reference class of impossibly difficult problems, in which we're actually doing something new, like inventing mathematical physics for the very first time, and doing so playing against emerging superintelligent optimizers. As far as I can tell, Eliezer's worry is that proposed alignment schemes spin these long arguments for success that just amount to burying the problem deep enough to fool yourself. That's why any proposed solution to alignment has to yield a core insight or five that we didn't have before -- conditional on an alignment scheme looking good without a simple new insight, you've probably just buried the hard core of the problem deep enough in your arguments to fool your brain.

So it's fair to ask any alignment scheme what its new central insight into AI is, in a paragraph or two. If these couple of paragraphs read like something from the Textbook from the Future, then the scheme might be in business. If the paragraphs contain no brand new, intuitively compelling insights, then the scheme probably doesn't contain the necessary insights but-distributed-across-its-whole-body either.[1]

  1. ^

    Though this doesn't mean that pursuing that line of research further couldn't lead to the necessary insights. The science just has to eventually get to those insights if alignment is to work.

New to LessWrong?

New Comment
18 comments, sorted by Click to highlight new comments since: Today at 12:42 PM

How are the following for "new, intuitively compelling insights"?

  • Human values arise from an inner alignment failure between the brain's learning system and its steering system.
    • Note that we don't try to maximize the activation of our steering system's reward circuitry via wireheading, and we don't want a future full of nothing but hedonium. Our values can't be completely aligned with our steering system.
    • See this comment for more details.
  • The human learning system implements a fundamentally simple learning algorithm, with relatively little in the way of ancestral environment-specific evolutionary complexity.
    • This derives from the bitter lesson as applied to the sorts of general learning architectures that evolution could have found.
    • The bitter lesson itself derives from applying the simplicity prior to the space of possible learning algorithms. It applies to evolution as much as to human ML researchers.
    • See this comment for more details.
  • Human values are actually fairly robust to small variations in the steering system.
    • The vast majority of humans would not destroy the world, even given unlimited power to enact their preferences unopposed. Most people would also oppose self-modifying into being the sorts of people who'd be okay with destroying the world. This is in stark contrast to how AIs are assumed to destroy the world by default. 
    • The steering system must have non-trivial genetic variation. Otherwise, we could not domesticate animals in as few generations as we manifestly are able to (e.g., foxes).
    • People with e.g., congenital blindness, or most other significant cognition-related genetic variations, still develop human values. The primary exception, psychopathy, probably develops from something as simple as not having a steering system that rewards the happiness of others.
    • Also note that the human steering system has components that are obviously bad ideas to include in an AI's reward function, such as rewards for dominance / cruelty. Most of us still turn out fine.
  • Implication of the above: there must exist simple learning processes that robustly develop human-compatible values when trained on reward signals generated from a reward model similar to the human steering system. 

This perspective doesn't so much offer a particular alignment scheme as highlight certain mistaken assumptions in alignment-related reasoning. That, by the way, is the fundamental mistake in using P=NP as an analogy to the alignment problem: the field of computational complexity can rely on formal mathematical proofs to establish firm foundations for further work. In contrast, alignment research is build on a large edifice of assumptions, any of which could be wrong. 

In particular, a common assumption in alignment thinking is that the human value formation process is inherently complex and highly tuned to our specific evolutionary history, that it represents some weird parochial corner of possible value formation processes. Note that this assumption then places very little probability mass on us having any particular value formation process, so it does not strongly retrodict the observed evidence. In contrast, the view I briefly sketched above essentially states that our value formation process is simply the default outcome of pointing a simple learning process at a somewhat complex reward function. It retrodicts our observed value formation process much more strongly. 

I think value formation is less like P=NP and more like the Fermi paradox, which seemed unsolvable until Anders Sandberg, Eric Drexler and Toby Ord published Dissolving the Fermi Paradox. It turns out that properly accounting for uncertainty in the parameters of the Drake equation causes the paradox to pretty much vanish. 

Prior to Dissolving the Fermi Paradox, people came up with all sorts of wildly different solutions to the paradox, as you can see by looking at its Wikipedia page. Rather than address the underlying assumptions that went into constructing the Fermi paradox, these solutions primarily sought to add additional mechanisms that seemed like they might patch away the confusion associated with the Fermi paradox.

However, the true solution to the Fermi paradox had nothing to do with any of these patches. No story about why aliens wouldn’t contact Earth or why technological civilizations invariably destroyed themselves would have ever solved the Fermi paradox, no matter how clever or carefully reasoned. Once you assume the incorrect approach to calculating the Drake equation, no amount of further reasoning you perform will lead you any further towards the solution, not until you reconsider the form of the Drake equation.

I think the Fermi paradox and human value formation belong to a class of problems, which we might call “few-cruxed problems” where progress can be almost entirely blocked by a handful of incorrect background assumptions. For few-crux problems, the true solution lies in a part of the search space that’s nearly inaccessible to anyone working from said mistaken assumptions. 

The correct approach for few-cruxed problems is to look for solutions that take away complexity, not add more of it. The skill involved here is similar to noticing confusion, but can be even more difficult. Oftentimes, the true source of your confusion is not the problem as it presents itself to you, but some subtle assumptions (the “cruxes”) of your background model of the problem that caused no telltale confusion when you first adopted them. 

A key feature of few-cruxed problems is that the amount of cognitive effort put into the problem before identifying the cruxes tells us almost nothing about the amount of cognitive work required to make progress on the problem once the cruxes are identified. The amount of cognition directed towards a problem is irrelevant if the cognition in question only ever explores regions of the search space which lack a solution. It is therefore important not to flinch away from solutions that seem “too simple” or “too dumb” to match the scale of the problem at hand. Big problems do not always require big solutions. 

I think one crux of alignment is the assumption that human value formation is a complex process. The other crux (and I don't think there's a third crux) is the assumption that we should be trying to avoid inner alignment failures. If (1) human values derive from an inner alignment failure wrt to our steering system, and (2) humans are the only places where human values can be found, then an inner alignment failure is the only process to have ever produced human values in the entire history of the universe. 

If human values derive from inner alignment failures, and we want to instill human values in an AI system, then the default approach should be to understand the sorts of values that derive from inner alignment failures in different circumstances, then try to arrange for the AI system to have an inner alignment failure that produces human-compatible values.

If, after much exploration, such an approach turned out to be impossible, then I think it would be warranted to start thinking about how to get human-compatible AI systems out of something other than an inner alignment failure. What we actually did was almost completely wall off that entire search space of possible solutions and actively try to solve the inner alignment "problem". 

If the true solution to AI alignment actually looks anything like "cause a carefully orchestrated inner alignment failure in a simple learning system", then of course our assumptions about the complexity of value formation and the undesirability of inner alignment failures would prevent us from finding such a solution. Alignment would look incredibly difficult because the answer would be outside of the subset of the solution space we'd restricted ourselves to considering. 

I want to flag Quintin's comment above as extremely important and—after spending over a month engaging with his ideas—I think they're probably correct. 

The vast majority of humans would not destroy the world [...] This is in stark contrast to how AIs are assumed to destroy the world by default.

Most humans would (and do) seek power and resources in a way that is bad for other systems that happen to be in the way (e.g., rainforests). When we colloquially talk about AIs "destroying the world" by default, it's a very self-centered summary: the world isn't actually "destroyed", just radically transformed in a way that doesn't end with any of the existing humans being alive, much like how our civilization transforms the Earth in ways that cut down existing forests.

You might reply: but wild nature still exists; we don't cut down all the forests! True, but an important question here is to what extent is that due to "actual" environmentalist/conservationist preferences in humans, and to what extent is it just that we "didn't get around to it yet" at our current capability levels?

In today's world, people who care about forest animals, and people who enjoy the experience of being in a forest, both have an interest in protecting forests. In the limit of arbitrarily advanced technology, this is less obvious: it's probably more efficient to turn everything into an optimal computing substrate, and just simulate happy forest animals for the animal lovers and optimal forest scenery for the scenery-lovers. Any fine details of the original forest that the humans don't care about (e.g., the internals of plants) would be lost.

This could still be good news, if it turns out to be easy to hit upon the AI analogue of animal-lovers (because something like "raise the utility of existing agents" is a natural abstraction that's easy to learn?), but "existing humans would not destroy the world" seems far too pat. (We did! We're doing it!)

The vast majority of humans would not destroy the world, even given unlimited power to enact their preferences unopposed.

I was specifically talking about the preferences of an individual human. The behavior of the economic systems that derive from the actions of many humans need not be aligned with the preferences of any component part of said systems. For AIs, we're currently interested in the values that arise in a single AIs (specifically, the first AI capable of a hard takeoff), so single humans are the more appropriate reference class. 

the world isn't actually "destroyed", just radically transformed in a way that doesn't end with any of the existing humans being alive

In fact, "radically transformed in a way that doesn't end with any of the existing humans being alive" is what I meant by "destroyed". That's the thing that very few current humans would do, given sufficient power. That's the thing that we're concerned that future AIs might do, given sufficient power. You might have a different definition of the word "destroyed", but I'm not using that definition. 

I believe that there are plenty of people who would destroy the world. I do know at least one personally. I don't know very many people to the extent that I could even hazard a guess as to whether they actually would or not, so either I am very fortunate (!) to know one of this tiny number, or there are at least millions of them and possibly hundreds of millions.

I am pretty certain that most humans would destroy the world if there was any conflict between that and any of their strongest values. The world persists only because there are no gods. The most powerful people to ever have existed have been powerful only because of the power granted to them by other humans. Remove that limitation and grant absolute power and superhuman intelligence along with capacity for further self-modification to a single person, and I give far better than even odds that what results is utterly catastrophic.

Let's suppose there are ~300 million people who'd use their unlimited power to destroy the world (I think the true number is far smaller). That would mean > 95% of people wouldn't do so. Suppose there were an alignment scheme that we'd tested billions of times on human-level AGIs, and > 95% of the time, it resulted in values compatible with humanity's continued survival. I think that would be a pretty promising scheme. 

grant absolute power and superhuman intelligence along with capacity for further self-modification to a single person, and I give far better than even odds that what results is utterly catastrophic.

If there were a process that predictably resulted in me having values strongly contrary to those I currently posses, I wouldn't do it. The vast majority of people won't take pills that turn them into murderers. For the same reason, an aligned AI at slightly superhuman capabilities levels won't self modify without first becoming confidant that its self modification will preserve its values. Most likely, it would instead develop better alignment tech than we used to create said AI and create a more powerful aligned successor.

I think that a 95% success rate in not destroying the human world would also be fantastic, though I note that there are plenty more potential totalitarian hellscapes that some people would apparently rate even worse than extinction.

Note that I'm not saying that they would deliberately destroy the world for shits and giggles, just that if the rest of the human world was any impediment to anything they valued more, then its destruction would just be a side effect of what had to be done.

I also don't have any illusion that a superintelligent agent will be infallible. The laws of the universe are not kind, and great power brings the opportunity for causing great disasters. I fully expect that any super-civilizational entity of any level of intelligence could very well destroy the human world by mistake.

"radically transformed in a way that doesn't end with any of the existing humans being alive" is what I meant by "destroyed"

Great, we're on the same page.

That's the thing that very few current humans would do, given sufficient power. That's the thing that we're concerned that future AIs might do, given sufficient power.

I think I'm expressing skepticism that inner-misaligned adaptations in simple learning algorithms are enough to license using current humans as a reference class quite this casually?

The "traditional" Yudkowskian position says, "Just think of AI as something that computes plans that achieve outcomes; logically, a paperclip maximizer is going to eat you and use your atoms to make paperclips." I read you as saying that AIs trained using anything like current-day machine learning techniques aren't going to be pure consequentialists like that; they'll have a mess of inner-misaligned "adaptations" and "instincts", like us. I agree that this is plausible, but I think it suggests "AI will be like another evolved species" rather than "AI will be like humans" as our best current-world analogy, and the logic of "different preferences + more power = genocide" still seems likely to apply across a gap that large (even if it's smaller than the gap to a pure consequentialist)?

...I think it suggests "AI will be like another evolved species" rather than "AI will be like humans"...

This was close to my initial assumption as well. I've since spent a lot of time thinking about the dynamics that arise from inner alignment failures in a human-like learning system, essentially trying to apply microeconomics to the internal "economy" of optimization demons that would result from an inner alignment failure. You can see this comment for some preliminary thoughts along these lines. A startling fraction of our deepest morality-related intuitions seem to derive pretty naturally / robustly from the multi-agent incentives associated with an inner alignment failure. 

Moreover, I think that there may be a pretty straightforward relationship between a learning system's reward function and the actual values it develops: values are self-perpetuating, context-dependent strategies that obtained high reward during training. If you want to ensure a learning system develops a given value, it may simply be enough to ensure that the system is rewarded for implementing the associated strategy during training. To get an AI that wants to help humans, just ensure the AI is rewarded for helping humans during training. 

To get an AI that wants to help humans, just ensure the AI is rewarded for helping humans during training.

To what extent do you expect this to generalize "correctly" outside of the training environment?

In your linked comment, you mention humans being averse to wireheading, but I think that's only sort-of true: a lot of people who successfully avoid trying heroin because they don't want to become heroin addicts, do still end up abusing a lot of other evolutionarily-novel superstimuli, like candy, pornography, and video games.

That makes me think inner-misalignment is still going to be a problem when you scale to superintelligence: maybe we evolve an AI "species" that's genuinely helpful to us in the roughly human-level regime (where its notion of helping and our notion of being-helped, coincide very well), but when the AIs become more powerful than us, they mostly discard the original humans in favor of optimized AI-"helping"-"human" superstimuli.

I guess I could imagine this being an okay future if we happened to get lucky about how robust the generalization turned out to be—maybe the optimized AI-"helping"-"human" superstimuli actually are living good transhuman lives, rather than being a nonsentient "sex toy" that happens to be formed in our image? But I'd really rather not bet the universe on this (if I had the choice not to bet).

Do you know if there's any research relevant to whether "degree of vulnerability to superstimuli" is correlated with intelligence in humans?

One aspect of inner alignment failures that I think is key to safe generalizations is that values tend to multiply. E.g., the human reward system is an inner alignment failure wrt evolution's single "value". Human values are inner alignment failures wrt the reward system. Each step we've seen has a significant increase in the breadth / diversity of values (admittedly, we've only seen two steps, but IMO it also makes sense that the process of inner alignment failure is orientated towards value diversification). 

If even a relatively small fraction of the AI's values orient towards actually helping humans, I think that's enough to avert the worst possible futures. From that point, it becomes a matter of ensuring that values are able to perpetuate themselves robustly (currently a major focus of our work on this perspective; prospects seem surprisingly good, but far from certain). 

maybe the optimized AI-"helping"-"human" superstimuli actually are living good transhuman lives, rather than being a nonsentient "sex toy" that happens to be formed in our image?

I actually think it would be very likely that such superstimuli are sentient. Humans are sentient. If you look at non-sentient humans (sleeping, sleep walking, trance state, some anesthetic drugs, etc), they typically behave quite differently from normal humans. 

If you look at non-sentient humans (sleeping, sleep walking, trance state, some anesthetic drugs, etc), they typically behave quite differently from normal humans.

Yeah, in the training environment. But, as you know, the reason people think inner-misalignment is a problem is precisely because capability gains can unlock exotic new out-of-distribution possibilities that don't have the same properties.

Boring, old example (skip this paragraph if it's too boring): humans evolved to value sweetness as an indicator of precious calories, and then we invented asparteme, which is much sweeter for much fewer calories. Someone in the past who reasoned, "If you look at sweet foods, they have a lot of calories; that'll probably be true in the future", would have been meaningfully wrong. (We still use actual sugar most of the time, but I think this is a lot like why we still have rainforests: in the limit of arbitrary capabilities, we don't care about any of the details of "original" sugar except what it tastes like to us.)

Better, more topical example: human artists who create beautiful illustrations on demand experience a certain pride in craftsmanship. Does DALL-E? Notwithstanding whether "it may be that today's large neural networks are slightly conscious", I'm going to guess No, there's nothing in text-to-image models remotely like a human artist's pride; we figured out how to get the same end result (beautiful art on demand) in an alien, inhuman way that's not very much like a human artist internally. Someone in the past who reasoned, "The creators of beautiful art will take pride in their craft," would be wrong.

key to safe generalizations is that values tend to multiply [...] significant increase in the breadth / diversity of values

"Increase in diversity" and "safe generalization" seem like really different things to me? What if some of the new, diverse values are actually bad from our perspective? (Something like, being forced to smile might make you actually unhappy despite the outward appearance of your face, but a human-smile-maximizer doesn't care about that, and this future is more diverse than the present because the present doesn't have any smile-maximizers.)

Basically, some of your comments make me worry that you're suffering from a bit of anthropmorphic optimism?

At the same time, however, I think this line of research is very interesting and I'm excited to see where you go with it! Yudkowsky tends to do this lame thing where after explaining the inner-alignment/context-disaster problem, he skips to, "And therefore, because there's no obvious relationship between the outer loss funciton and learned values, and because the space of possibilities is so large, almost all of it has no value, like paperclips." I think there's a lot of missing argumentation there, and discovering the correct arguments could change the conclusion and our decisions a lot! (In the standard metaphor, we're not really in the position of "evolution" with respect to AI so much as we are the environment of evolutionary adaptedness.) It's just, we need to be careful to be asking, "Okay, what actually happens with inner alignment failures; what's the actual outcome specifically?" without trying to "force" that search into finding reassuring fake reasons why the future is actually OK.

Basically, some of your comments make me worry that you're suffering from a bit of anthropmorphic optimism?

Ironically, one of my medium-sized issues with mainline alignment thinking is that it seems to underweight the evidence we get from observing humans and human values. The human brain is, by far, the most general and agentic learning system in current existence. We also have ~7 billion examples of human value learning to observe. The data they provide should strongly inform our intuitions on how other highly general and agentic learning systems behave. When you have limited evidence about a domain, what little evidence you do have should strongly inform your intuitions.

In fact, our observations of humans should inform our expectations of AGIs much more strongly than the above argument implies because we are going to train those AGIs on data generated by humans. It's well known in deep learning that training data are usually more important than details of the learning process or architecture.

I think alignment thinking has an inappropriately strong bias against anchoring expectations to our observations of humans. There's an assumption that the human learning algorithm is in some way "unnatural" among the space of general and effective learning algorithms, and that we therefore can't draw inferences about AGIs based on our observations of humans. E.g., Eliezer Yudkowsky's post My Childhood Role Model:

Humans are adapted to chase deer across the savanna, throw spears into them, cook them, and then—this is probably the part that takes most of the brains—cleverly argue that they deserve to receive a larger share of the meat.

It's amazing that Albert Einstein managed to repurpose a brain like that for the task of doing physics.  This deserves applause.  It deserves more than applause, it deserves a place in the Guinness Book of Records.  Like successfully building the fastest car ever to be made entirely out of Jello.

How poorly did the blind idiot god (evolution) really design the human brain?

This is something that can only be grasped through much study of cognitive science, until the full horror begins to dawn upon you.

All the biases we have discussed here should at least be a hint.

Likewise the fact that the human brain must use its full power and concentration, with trillions of synapses firing, to multiply out two three-digit numbers without a paper and pencil.

Yudkowsky notes that a learning algorithm hyper-specialized to the ancestral environment would not generalize well to thinking about non-ancestral domains like physics. This is absolutely correct, and it represents a significant misprediction of any view assigning a high degree of specialization to the brain's learning algorithm. 

In fact, large language models arguably implement social instincts with more adroitness than many humans possess. However, original physics research in the style of Einstein remains well out of reach. This is exactly the opposite of what you should predict if you believe that evolution hard coded most of the brain to "cleverly argue that they deserve to receive a larger share of the meat". 

Yudkowsky brings up multiplication as an example of a task that humans perform poorly at, supposedly because brains specialized to the ancestral environment had no need for such capabilities. And yet, GPT-3 is also terrible at multiplication (even after accounting for the BPE issue), and no part of its architecture or training procedure is at all specialized for the human ancestral environment. 

"Increase in diversity" and "safe generalization" seem like really different things to me? What if some of the new, diverse values are actually bad from our perspective? (Something like, being forced to smile might make you actually unhappy despite the outward appearance of your face, but a human-smile-maximizer doesn't care about that, and this future is more diverse than the present because the present doesn't have any smile-maximizers.)

But a smile maximizer would have less diverse values than a human? It only cares about smiles, after all. When you say "smile maximizer", is that shorthand for a system with a broad distribution over different values, one of which happens to be smile maximization? That's closer to how I think of things, with the system's high level behavior arising as a sort of negotiated agreement between its various values. 

IMO, systems with broader distributions over values are more likely to assign at least some weight to things like "make people actually happy" and to other values that we don't even know we should have included. In that case, the "make people actually happy" value and the "smile maximization" value can cooperate and make people smile by being happy (and also cooperate with the various other values the system develops). That's the source of my intuition that a broader distributions over values are actually safer: it makes you less likely to miss something important.

More generally, I think that a lot of the alignment intuition that "values are fragile" actually comes from a pretty simple type error. Consider: 

  • The computation a system executes depends on its inputs. If you have some distribution over possible inputs, that translates to having a distribution over possible computations.
  • "Values" is just a label we apply to particular components of a system's computations.
  • If a system has a situation-dependent distribution over possible computations, and values are implemented by those computations, then the system also has a situation-dependent distribution over possible values. 

However, people can only consciously instantiate a small, subset of discrete values at any given time. There thus appears to be a contrast between "the values we can imagine" and "the values we actually have". Trying to list out a discrete set of "true human values" roughly corresponds to trying to represent a continuous distribution with a small set of discrete samples from that distribution (this is the type error in question). It doesn't help that the distribution over values is situation-dependent, so any sampling of their values a human performs in one situation may not transfer to the samples they'd take in another situation.

Given the above, it should be no surprise that our values feel "fragile" when we introspect on them.

Preempting a possible confusion: the above treats a "value" and "the computation that implements that value" interchangeably. If you're thinking of a "value" as something like a principle component of an agent's utility function, somehow kept separate from the system that actually implements those values, then this might seem counterintuitive.

Under this framing, questions like the destruction of the physical rainforests, or other things we might value are mainly about ensuring a broad distribution of worthwhile values can perpetuate themselves across time and are influence the world to at least some degree. "Preserving X", for any value of X, is about ensuring that the system has at least some values orientated towards preserving X, that those values can persist over time, and that those values can actually ensure that X is preserved. (And the broader the values, the more different Xs we can preserve.)

I think the prospects for achieving those three things are pretty good, though I don't think I'm ready to write up my full case for believing such. 

(I do admit that it's possible to have a system that ends up pursing a simple / "dumb" goal, such as maximizing paperclips, to the exclusion of all else. That can happen when the system's distribution over possible values places so much weight on paperclip-adjacent values that they can always override any other values. This is another reason I'm in favor of broad distributions over values.)

Yudkowsky tends to do this lame thing where after explaining the inner-alignment/context-disaster problem, he skips to, "And therefore, because there's no obvious relationship between the outer loss funciton and learned values, and because the space of possibilities is so large, almost all of it has no value, like paperclips."

Agreed. It's particularly annoying because, IMO, there is a strong candidate for "obvious relationship between the outer loss funciton and learned values": learned values reflect the distribution over past computations that achieved high reward on the various shallow proxies of the outer loss function that the model encountered during training. 

(Thanks for your patience.)

In fact, large language models arguably implement social instincts with more adroitness than many humans possess.

Large language models implement social behavior as expressed in text. I don't want to call that social "instincts", because the implementation, and out-of-distribution behavior, is surely going to be very different.

But a smile maximizer would have less diverse values than a human? It only cares about smiles, after all. When you say "smile maximizer", is that shorthand for a system with a broad distribution over different values, one of which happens to be smile maximization?

A future with humans and smile-maximizers is more diverse than a future with just humans. (But, yes, "smile maximizer" here is our standard probably-unrealistic stock example standing in for inner alignment failures in general.)

That's the source of my intuition that a broader distributions over values are actually safer: it makes you less likely to miss something important.

Trying again: the reason I don't want to call that "safety" is because, even if you're less likely to completely miss something important, you're more likely to accidentally incorporate something you actively don't want.

If we start out with a system where I press the reward button when the AI makes me happy, and it scales and generalizes into a diverse coalition of a smile-maximizer, and a number-of-times-the-human-presses-the-reward-button-maximizer, and an amount-of-dopamine-in-the-human's-brain-maximizer, plus a dozen or a thousand other things ... okay, I could maybe believe that some fraction of the comic endowment gets used in ways that I approve of.

But what if most of it is things like ... copies of me with my hand wired up to hit the reward button ten times a second, my face frozen into a permanent grin, while drugged up on a substance that increases serotonin levels, which correlated with "happiness" in the training environment, but when optimized subjectively amounts to an unpleasant form of manic insanity? That is, what if some parts of "diverse" are Actually Bad? I would really rather not roll the dice on this, if I had the choice!

For AIs, we're currently interested in the values that arise in a single AIs (specifically, the first AI capable of a hard takeoff), so single humans are the more appropriate reference class. 

I'm sorry but I don't understand why looking at single AIs make single humans the more appropriate reference class.

I'm drawing an analogy between AI training and human learning. I don't think the process of training an AI via reinforcement learning is as different from human learning as many assume. 

We weren't guaranteed to be born in a civilization where the alignment problem was even conceived of by a single person, let alone taken seriously by a single person. The odds were not extremely high that we'd be born on a timeline where alignment ended up well-researched and taken seriously by several billionaires.

We live in a civilization that's disproportionately advantaged in the will and ability to tackle the alignment problem. We should maximize our privileged status by dramatically increasing the resources that are already allocated to the alignment problem; acknowledging that the problem is too hard for the ~100-expert approach so far. Especially because it might only be possible to solve the problem at the last minute, with whatever pool of experts and expertise has been already developed in the decades prior.

We should also acknowledge that AI is a multi-trillion dollar industry, with clear significance for geopolitical power as well. There are all sorts of vested interests, and plenty of sharks. It's unreasonable to think that AI advocacy should be targeted towards the general public, instead of experts and other highly relevant people. Nor does it make sense to believe that the people opposed to alignment advocacy are doing so out of ignorance or disinterest; there are plenty of people who are trying to slander AI for any reason at all (e.g. truck driver unions who are mad about self-driving cars), and plenty of people who are paid to counteract any perceived slander by any means necessary, no matter where it pops up and what form it takes. 

As a result, the people concerned with alignment needed pareto solutions; namely, dramatically increasing the number of people who take the alignment problem seriously, without running around like a headless chicken and randomly stepping on the toes of powerful people (who routinely vanquish far more intimidating threats to their interests). The history textbooks in the future will probably cover that extensively.

I'd go for:

Reinforcement learning agents do two sorts of planning. One is the application of the dynamic (world-modelling) network and using a Monte Carlo tree search (or something like it) over explicitly-represented world states. The other is implicit in the future-reward-estimate function. You need to have as much planning as possible be of the first type:

  1. It's much more supervisable. An explicitly-represented world state is more interrogable than the inner workings of a future-reward-estimate.
  2. It's less susceptible to value-leaking. By this I mean issues in alignment which arise from instrumentally-valuable (i.e. not directly part of the reward function) goals leaking into the future-reward-estimate.
  3. You can also turn down the depth on the tree search. If the agent literally can't plan beyond a dozen steps ahead it can't be deceptively aligned.