Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Suppose AI continues on its current trajectory: deep learning continues to get better as we throw more data and compute at it, researchers keep trying random architectures and using whatever seems to work well in practice. Do we end up with aligned AI “by default”?

I think there’s at least a plausible trajectory in which the answer is “yes”. Not very likely - I’d put it at ~10% chance - but plausible. In fact, there’s at least an argument to be made that alignment-by-default is more likely to work than many fancy alignment proposals, including IRL variants and HCH-family methods.

This post presents the rough models and arguments.

I’ll break it down into two main pieces:

  • Will a sufficiently powerful unsupervised learner “learn human values”? What does that even mean?
  • Will a supervised/reinforcement learner end up aligned to human values, given a bunch of data/feedback on what humans want?

Ultimately, we’ll consider a semi-supervised/transfer-learning style approach, where we first do some unsupervised learning and hopefully “learn human values” before starting the supervised/reinforcement part.

As background, I will assume you’ve read some of the core material about human values from the sequences, including Hidden Complexity of Wishes, Value is Fragile, and Thou Art Godshatter

Unsupervised: Pointing to Values

In this section, we’ll talk about why an unsupervised learner might not “learn human values”. Since an unsupervised learner is generally just optimized for predictive power, we’ll start by asking whether theoretical algorithms with best-possible predictive power (i.e. Bayesian updates on low-level physics models) “learn human values”, and what that even means. Then, we’ll circle back to more realistic algorithms.

Consider a low-level physical model of some humans - e.g. a model which simulates every molecule comprising the humans. Does this model “know human values”? In one sense, yes: the low-level model has everything there is to know about human values embedded within it, in exactly the same way that human values are embedded in physical humans. It has “learned human values”, in a sense sufficient to predict any real-world observations involving human values.

But it seems like there’s a sense in which such a model does not “know” human values. Specifically, although human values are embedded in the low-level model, the embedding itself is nontrivial. Even if we have the whole low-level model, we still need that embedding in order to “point to” human values specifically - e.g. to use them as an optimization target. Indeed, when we say “point to human values”, what we mean is basically “specify the embedding”. (Side note: treating human values as an optimization target is not the only use-case for “pointing to human values”, and we still need to point to human values even if we’re not explicitly optimizing for anything. But that’s a separate discussion, and imagining using values as an optimization target is useful to give a mental image of what we mean by “pointing”.)

In short: predictive power alone is not sufficient to define human values. The missing part is the embedding of values within the model. The hard part is pointing to the thing (i.e. specifying the values-embedding), not learning the thing (i.e. finding a model in which values are embedded).

Finally, here’s a different angle on the same argument which will probably drive some of the philosophers up in arms: any model of the real world with sufficiently high general predictive power will have a model of human values embedded within it. After all, it has to predict the parts of the world in which human values are embedded in the first place - i.e. the parts of which humans are composed, the parts on which human values are implemented. So in principle, it doesn’t even matter what kind of model we use or how it’s represented; as long the predictive power is good enough, values will be embedded in there, and the main problem will be finding the embedding.

Unsupervised: Natural Abstractions

In this section, we’ll talk about how and why a large class of unsupervised methods might “learn the embedding” of human values, in a useful sense.

First, notice that basically everything from the previous section still holds if we replace the phrase “human values” with “trees”. A low-level physical model of a forest has everything there is to know about trees embedded within it, in exactly the same way that trees are embedded in the physical forest. However, while there are trees embedded in the low-level model, the embedding itself is nontrivial. Predictive power alone is not sufficient to define trees; the missing part is the embedding of trees within the model.

More generally, whenever we have some high-level abstract object (i.e. higher-level than quantum fields), like trees or human values, a low-level model might have the object embedded within it but not “know” the embedding.

Now for the interesting part: empirically, we have whole classes of neural networks in which concepts like “tree” have simple, identifiable embeddings. These are unsupervised systems, trained for predictive power, yet they apparently “learn the tree-embedding” in the sense that the embedding is simple: it’s just the activation of a particular neuron, a particular channel, or a specific direction in the activation-space of a few neurons.

Neat example with “trees” from the paper linked above.

What’s going on here? We know that models optimized for predictive power will not have trivial tree-embeddings in general; low-level physics simulations demonstrate that much. Yet these neural networks do end up with trivial tree-embeddings, so presumably some special properties of the systems make this happen. But those properties can’t be that special, because we see the same thing for a reasonable variety of different architectures, datasets, etc.

Here’s what I think is happening: “tree” is a natural abstraction. More on what that means here, but briefly: abstractions summarize information which is relevant far away. When we summarize a bunch of atoms as “a tree”, we’re throwing away lots of information about the exact positions of molecules/cells within the tree, or about the pattern of bark on the tree’s surface. But information like the exact positions of molecules within the tree is irrelevant to things far away - that signal is all wiped out by the noise of air molecules between the tree and the observer. The flap of a butterfly’s wings may alter the trajectory of a hurricane, but unless we know how all wings of all butterflies are flapping, that tiny signal is wiped out by noise for purposes of our own predictions. Most information is irrelevant to things far away, not in the sense that there’s no causal connection, but in the sense that the signal is wiped out by noise in other unobserved variables.

If a concept is a natural abstraction, that means that the concept summarizes all the information which is relevant to anything far away, and isn’t too sensitive to the exact notion of “far away” involved. That’s what I think is going on with “tree”.

Getting back to neural networks: it’s easy to see why a broad range of architectures would end up “using” natural abstractions internally. Because the abstraction summarizes information which is relevant far away, it allows the system to make far-away predictions without passing around massive amounts of information all the time. In a low-level physics model, we don’t need abstractions because we do pass around massive amounts of information all the time, but real systems won’t have anywhere near that capacity any time soon. So for the foreseeable future, we should expect to see real systems with strong predictive power using natural abstractions internally.

With all that in mind, it’s time to drop the tree-metaphor and come back to human values. Are human values a natural abstraction?

If you’ve read Value is Fragile or Godshatter, then there’s probably a knee-jerk reaction to say “no”. Human values are basically a bunch of randomly-generated heuristics which proved useful for genetic fitness; why would they be a “natural” abstraction? But remember, the same can be said of trees. Trees are a complicated pile of organic spaghetti code, but “tree” is still a natural abstraction, because the concept summarizes all the information from that organic spaghetti pile which is relevant to things far away. In particular, it summarizes anything about one tree which is relevant to far-away trees.

Similarly, the concept of “human” summarizes all the information about one human which is relevant to far-away humans. It’s a natural abstraction.

Now, I don’t think “human values” are a natural abstraction in exactly the same way as “tree” - specifically, trees are abstract objects, whereas human values are properties of certain abstract objects (namely humans). That said, I think it’s pretty obvious that “human” is a natural abstraction in the same way as “tree”, and I expect that humans “have values” in roughly the same way that trees “have branching patterns. Specifically, the natural abstraction contains a bunch of information, that information approximately factors into subcomponents (including “branching pattern”), and “human values” is one of those information-subcomponents for humans.

Branching patterns for a few different kinds of trees.

I wouldn’t put super-high confidence on all of this, but given the remarkable track record of hackish systems learning natural abstractions in practice, I’d give maybe a 70% chance that a broad class of systems (including neural networks) trained for predictive power end up with a simple embedding of human values. A plurality of my uncertainty is on how to think about properties of natural abstractions. A significant chunk of uncertainty is also on the possibility that natural abstraction is the wrong way to think about the topic altogether, although in that case I’d still assign a reasonable chance that neural networks end up with simple embeddings of human values - after all, no matter how we frame it, they definitely have trivial embeddings of many other complicated high-level objects.

Aside: Microscope AI

Microscope AI is about studying the structure of trained neural networks, and trying to directly understand their learned internal algorithms, models and concepts. In light of the previous section, there’s an obvious path to alignment where there turns out to be a few neurons (or at least some simple embedding) which correspond to human values, we use the tools of microscope AI to find that embedding, and just like that the alignment problem is basically solved.

Of course it’s unlikely to be that simple in practice, even assuming a simple embedding of human values. I don’t expect the embedding to be quite as simple as one neuron activation, and it might not be easy to recognize even if it were. Part of the problem is that we don’t even know the type signature of the thing we’re looking for - in other words, there are unanswered fundamental conceptual questions here, which make me less-than-confident that we’d be able to recognize the embedding even if it were right under our noses.

That said, this still seems like a reasonably-plausible outcome, and it’s an approach which is particularly well-suited to benefit from marginal theoretical progress.

One thing to keep in mind: this is still only about aligning one AI; success doesn’t necessarily mean a future in which more advanced AIs remain aligned. More on that later.

Supervised/Reinforcement: Proxy Problems

Suppose we collect some kind of data on what humans want, and train a system on that. The exact data and type of learning doesn’t really matter here; the relevant point is that any data-collection process is always, no matter what, at best a proxy for actual human values. That’s a problem, because Goodhart’s Law plus Hidden Complexity of Wishes. You’ve probably heard this a hundred times already, so I won’t belabor it.

Here’s the interesting possibility: assume the data is crap. It’s so noisy that, even though the data-collection process is just a proxy for real values, the data is consistent with real human values. Visually:

Real human values are represented by the blue point, and the true center of our proxy measure is the red point. In this case, the data generated (other points) is noisy enough that it’s consistent with real human values. Disclaimer: this is an analogy, I don’t actually imagine values and proxies being directly represented in the same space as the data.

At first glance, this isn’t much of an improvement. Sure, the data is consistent with human values, but it’s consistent with a bunch of other possibilities too - including the real data-collection process (which is exactly the proxy we wanted to avoid in the first place).

But now suppose we do some transfer learning. We start with a trained unsupervised learner, which already has a simple embedding of human values (we hope). We give our supervised learner access to that system during training. Because the unsupervised learner has a simple embedding of human values, the supervised learner can easily score well by directly using that embedded human values model. So, we cross our fingers and hope the supervised learner just directly uses that embedded human values model, and the data is noisy enough that it never “figures out” that it can get better performance by directly modelling the data-collection process instead.

In other words: the system uses an actual model of human values as a proxy for our proxy of human values.

This requires hitting a window - our data needs to be good enough that the system can tell it should use human values as a proxy, but bad enough that the system can’t figure out the specifics of the data-collection process enough to model it directly. This window may not even exist.

(Side note: we can easily adjust this whole story to a situation where we’re training for some task other than “satisfy human values”. In that case, the system would use the actual model of human values to model the Hidden Complexity of whatever task it’s training on.)

Of course in practice, the vast majority of the things people use as objectives for training AI probably wouldn’t work at all. I expect that they usually look like this:

In other words, most objectives are so bad that even a little bit of data is enough to distinguish the proxy from real human values. But if we assume that there’s some try-it-and-see going on, i.e. people try training on various objectives and keep the AIs which seem to do roughly what the humans want, then it’s maybe plausible that we end up iterating our way to training objectives which “work”. That’s assuming things don’t go irreversibly wrong before then - including not just hostile takeover, but even just development of deceptive behavior, since this scenario does not have any built-in mechanism to detect deception.

Overall, I’d give maybe a 10-20% chance of alignment by this path, assuming that the unsupervised system does end up with a simple embedding of human values. The main failure mode I’d expect, assuming we get the chance to iterate, is deception - not necessarily “intentional” deception, just the system being optimized to look like it’s working the way we want rather than actually working the way we want. It’s the proxy problem again, but this time at the level of humans-trying-things-and-seeing-if-they-work, rather than explicit training objectives.

Alignment in the Long Run

So far, we’ve only talked about one AI ending up aligned, or a handful ending up aligned at one particular time. However, that isn’t really the ultimate goal of AI alignment research. What we really want is for AI to remain aligned in the long run, as we (and AIs themselves) continue to build new and more powerful systems and/or scale up existing systems over time.

I know of two main ways to go from aligning one AI to long-term alignment:

  • Make the alignment method/theory very reliable and robust to scale, so we can continue to use it over time as AI advances.
  • Align one roughly-human-level-or-smarter AI, then use that AI to come up with better alignment methods/theories.

The alignment-by-default path relies on the latter. Even assuming alignment happens by default, it is unlikely to be highly reliable or robust to scale.

That’s scary. We’d be trusting the AI to align future AIs, without having any sure-fire way to know that the AI is itself aligned. (If we did have a sure-fire way to tell, then that would itself be most of a solution to the alignment problem.)

That said, there’s a bright side: when alignment-by-default works, it’s a best-case scenario. The AI has a basically-correct model of human values, and is pursuing those values. Contrast this to things like IRL variants, which at best learn a utility function which approximates human values (which are probably not themselves a utility function). Or the HCH family of methods, which at best mimic a human with a massive hierarchical bureaucracy at their command, and certainly won’t be any more aligned than that human+bureaucracy would be.

To the extent that alignment of the successor system is limited by alignment of the parent system, that makes alignment-by-default potentially a more promising prospect than IRL or HCH. In particular, it seems plausible that imperfect alignment gets amplified into worse-and-worse alignment as systems design their successors. For instance, a system which tries to look like it’s doing what humans want rather than actually doing what humans want will design a successor which has even better human-deception capabilities. That sort of problem makes “perfect” alignment - i.e. an AI actually pointed at a basically-correct model of human values - qualitatively safer than a system which only manages to be not-instantly-disastrous.

(Side note: this isn’t the only reason why “basically perfect” alignment matters, but I do think it’s the most relevant such argument for one-time alignment/short-term term methods, especially on not-very-superhuman AI.)

In short: when alignment-by-default works, we can use the system to design a successor without worrying about amplification of alignment errors. However, we wouldn’t be able to tell for sure whether alignment-by-default had worked or not, and it’s still possible that the AI would make plain old mistakes in designing its successor.


Let’s recap the bold points:

  • A low-level model of some humans has everything there is to know about human values embedded within it, in exactly the same way that human values are embedded in physical humans. The embedding, however, is nontrivial. Thus...
  • Predictive power alone is not sufficient to define human values. The missing part is the embedding of values within the model. However…
  • This also applies if we replace the phrase “human values” with “trees”. Yet we have a whole class of neural networks in which a simple embedding lights up in response to trees. Why?
  • Trees are a natural abstraction, and we should expect to see real systems trained for predictive power use natural abstractions internally.
  • Human values are a little different from trees (they’re a property of an abstract object rather than an abstract object themselves), but I still expect that a broad class of systems trained for predictive power will end up with simple embeddings of human values (~70% chance).
  • Because the unsupervised learner has a simple embedding of human values, a supervised/reinforcement learner can easily score well on values-proxy-tasks by directly using that model of human values. In other words, the system uses an actual model of human values as a proxy for our proxy of human values (~10-20% chance).
  • When alignment-by-default works, it’s basically a best-case scenario, so we can safely use the system to design a successor without worrying about amplification of alignment errors (among other things).

Overall, I only give this whole path ~10% chance of working in the short term, and maybe half that in the long term. However, if amplification of alignment errors turns out to be a major limiting factor for long-term alignment, then alignment-by-default is plausibly more likely to work than approaches in the IRL or HCH families.

The limiting factor here is mainly identifying the (probably simple) embedding of human values within a learned model, so microscope AI and general theory development are both good ways to improve the outlook. Also, in the event that we are able to identify a simple embedding of human values in a learned model, it would be useful to have a way to translate that embedding into new systems, in order to align successors.

New Comment
94 comments, sorted by Click to highlight new comments since: Today at 4:15 PM
Some comments are truncated due to high volume. (⌘F to expand all)Change truncation settings

I think what you've identified here is a weakness in the high-level, classic arguments for AI risk -

Overall, I’d give maybe a 10-20% chance of alignment by this path, assuming that the unsupervised system does end up with a simple embedding of human values. The main failure mode I’d expect, assuming we get the chance to iterate, is deception - not necessarily “intentional” deception, just the system being optimized to look like it’s working the way we want rather than actually working the way we want. It’s the proxy problem again, but this time at the level of humans-trying-things-and-seeing-if-they-work, rather than explicit training objectives.

This failure mode of deceptive alignment seems like it would result most easily from Mesa-optimisation or an inner alignment failure. Inner Alignment / Misalignment is possibly the key specific mechanism which fills a weakness in the 'classic arguments' for AI safety - the Orthogonality Thesis, Instrumental Convergence and Fast Progress together implying small separations between AI alignment and AI capability can lead to catastrophic outcomes. The question of why there w... (read more)

Personally, I think a more likely failure mode is just "you get what you measure", as in Paul's write up here. If we only know how to measure certain things which are not really the things we want, then we'll be selecting for not-what-we-want by default. But I know at least some smart people who think that inner alignment is the more likely problem, so you're in good company.

‘You get what you measure’ (outer alignment failure) and Mesa optimisers (inner failure) are both potential gap fillers that explain why specifically the alignment/capability divergence initially arises. Whether it’s one or the other, I think the overall point is still that there is this gap in the classic arguments that allows for a (possibly quite high) chance of ‘alignment by default’, for the reasons you give, but there are at least 2 plausible mechanisms that fill this gap. And then I suppose my broader point would be that we should present:

Classic Arguments —> objections to them (capability and alignment often go together, could get alignment by default) —> specific causal mechanisms for misalignment

Am surprised you think that’s the main failure mode. I am fairly more concerned about failure through mesa optimisers taking a treacherous turn. 

I’m thinking we will be more likely to find sensible solutions to outer alignment, but have not much real clue about the internals, and then we’ll give them enough optimisation power to build super intelligent unaligned mesa optimisers, and then with one treacherous turn the game will be up.

Why do you think inner alignment will be easier?

Two arguments here. First, an outside-view argument: inner alignment problems should only crop up on a relatively narrow range of architectures/parameters. Second, an entirely separate inside-view argument: assuming that natural abstractions are a thing makes inner alignment failure look much less likely. Narrow range argument: inner alignment failure only applies to a specific range of architectures within a specific range of task parameters - for instance, we have to be optimizing for something, and there has to be lots of relevant variables observed only at runtime, and there has to be something like a "training" phase in which we lock-in parameter choices before runtime, and for the more disastrous versions we usually need divergence of the runtime distribution from the training distribution. It's a failure mode which assumes that a whole lot of things look like today's ML pipelines. On the other hand, the get-what-you-measure problem and its generalizations apply to any architecture, including tool AI, idealized Bayesian utility maximizers (i.e. the infinite data/compute regime), and (less obviously) human-mimicking systems. Natural abstractions argument: in an inner alignment failure, the outer optimizer is optimizing for X, but the inner optimizer ends up pointed at some rough approximation ~X. But if X is a natural abstraction, then this is far less likely to be a problem; we expect a wide range of predictive systems to all learn a basically-correct notion of X, so there's little reason for an inner optimizer to end up pointed at a rough approximation, especially if we're leveraging transfer learning from some unsupervised learner. (It's worth asking here why this argument doesn't apply to the divergence of human goals from evolutionary fitness. A human only has ~30k genes, and each one has a fairly simple function - e.g. catalyze one chemical reaction or stabilize a structure or the like. That's nowhere near enough to represent something like evolutiona

Natural abstractions argument: in an inner alignment failure, the outer optimizer is optimizing for X, but the inner optimizer ends up pointed at some rough approximation ~X. But if X is a natural abstraction, then this is far less likely to be a problem; we expect a wide range of predictive systems to all learn a basically-correct notion of X, so there's little reason for an inner optimizer to end up pointed at a rough approximation, especially if we're leveraging transfer learning from some unsupervised learner.

This isn't an argument against deceptive alignment, just proxy alignment—with deceptive alignment, the agent still learns X, it just does so as part of its world model rather than its objective. In fact, I think it's an argument for deceptive alignment, since if X first crops up as a natural abstraction inside of your agent's world model, that raises the question of how exactly it will get used in the agent's objective function—and deceptive alignment is arguably one of the simplest, most natural ways for the base optimizer to get an agent that has information about the base objective stored in its world model to actually start optimizing for that model of the base objective.

I mostly agree with this. I don't view deception as an inner alignment problem, though - for instance, it's an issue in any approval-based setup even without an inner optimizer showing up. To the extent that it is an inner alignment issue, it involves generalization failure from the training distribution, which I also generally consider an outer alignment problem (i.e. training on a distribution which differs from the deploy environment generally means the system is not outer aligned, unless the architecture is somehow set up to make the distribution shift irrelevant). A useful criterion here: would the problem still happen if we just optimized over all the parameters simultaneously at runtime, rather than training offline first? If the problem would still happen, then it's not really an inner alignment problem (at least not in the usual mesa-optimization sense).
That's certainly not how I would define inner alignment. In “Risks from Learned Optimization,” we just define it as the problem of aligning the mesa-objective (if one exists) with the base objective, which is entirely independent of whether or not there's any sort of distinction between the training and deployment distributions and is fully consistent with something like online learning as you're describing it.
The way I understood it, the main reason a mesa-optimizer shows up in the first place is that some information is available at runtime which is not available during training, so some processing needs to be done at runtime to figure out the best action given the runtime-info. The mesa-optimizer handles that processing. If we directly optimize over all parameters at runtime, then there's no place for that to happen. What am I missing?
Let's consider the following online learning setup: At each timestep t, πθt takes action at∈A and receives reward rt∈R. Then, we perform the simple policy gradient update θt+1=θt+rt∇θlog(P(at | πθt)). Now, we can ask the question, would πθt be a mesa-optimizer? The first thing that's worth noting is that the above setup is precisely the standard RL training setup—the only difference is that there's no deployment stage. What that means, though, is that if standard RL training produces a mesa-optimizer, then this will produce a mesa-optimizer too, because the training process isn't different in any way whatsoever. If π is acting in a diverse environment that requires search to be able to be solved effectively, then π will still need to learn to do search—the fact that there won't ever be a deployment stage in the future is irrelevant to π's current training dynamics (unless π is deceptive and knows there won't be a deployment stage—that's the only situation where it might be relevant). Given that, we can ask the question of whether π, if it's a mesa-optimizer, is likely to be misaligned—and in particular whether it's likely to be deceptive. Again, in terms of proxy alignment, the training process is exactly the same, so the picture isn't any different at all—if there are simpler, easier-to-optimize-for proxies, then π is likely to learn those instead of the true base objective. Like I mentioned previously, however, deceptive alignment is the one case where it might matter that you're doing online learning, since if the model knows that it might do different things based on that fact. However, there are still lots of reasons why a model might be deceptive even in an online learning setup—for example, it might expect better opportunities for defection in the future, and thus want to prevent being modified now so that it can defect when it'll be most impactful.
When I say "optimize all the parameters at runtime", I do not mean "take one gradient step in between each timestep". I mean, at each timestep, fully optimize all of the parameters. Optimize θ all the way to convergence before every single action. Think back to the central picture of mesa-optimization (at least as I understand it). The mesa-optimizer shows up because some data is only available at runtime, not during training, so it has to be processed at runtime using parameters selected during training. In the online RL setup you sketch here, "runtime" for mesa-optimization purposes is every time the system chooses its action - i.e. every timestep - and "training" is all the previous timesteps. A mesa-optimizer should show up if, at every timestep, some relevant new data comes in and the system has to process that data in order to choose the optimal action, using parameters inherited from previous timesteps. Now, suppose we fully optimize all of the parameters at every timestep. The objective function for this optimization would presumably be ∑trtlog(P[at|πθ]), with the sum taken over all previous data points, since that's what the RL setup is approximating. This optimization would probably still "find" the same mesa-optimizer as before, but now it looks less like a mesa-optimizer problem and more like an outer alignment problem: that objective function is probably not actually the thing we want. The fact that the true optimum for that objective function probably has our former "mesa-optimizer" embedded in it is a pretty strong signal that that objective function itself is not outer aligned; the true optimum of that objective function is not really the thing we want. Does that make sense?
The RL process is actually optimizing E[∑trt], the log just comes from the REINFORCE trick. Regardless, I'm not sure I understand what you mean by optimizing fully to convergence at each timestep—convergence is a limiting property, so I don't know what it could mean do it for a single timestep. Perhaps you mean just taking the optimal policy π∗ such that π∗=argmaxπE[∑trt | π]? In that case, that is in fact the definition of outer alignment I've given in the past, so I agree that whether π∗ is aligned or not is an outer alignment question.
Sure, π∗ works for what I'm saying, assuming that sum-over-time only includes the timesteps taken thus far. In that case, I'm saying that either: * the mesa optimizer doesn't appear in π∗, in which case the problem is fixed by fully optimizing everything at every timestep (i.e. by using π∗), or * the mesa optimizer does appear in π∗, in which case the problem was really an outer alignment issue all along.
4Ben Pace4y
Thank you for being so clear. On 2, I’m surprised if you think that natural selection isn’t a natural abstraction but that eudaemonia is. (If we’re getting an AGI singleton that want to fully learn our values.) Secondly I’ll say that if we do not understand it’s representation of X or X-prime, and if a small difference will be catastrophic, then that will also lead to doom. On 1: I think that’s quite plausible? Like, I assign something in the range of 20-60% probability to that. How much does it have to change for you to feel much safer about inner alignment? (I’m also not that clear it only applies to this situation. Perhaps I’m mistaken, but in my head subsystem alignment and robust delegation both have this property of ”build a second optimiser that helps achieve your goals” and in both cases passing on the true utility function seems very hard.)
Currently, my first-pass check for "is this probably a natural abstraction?" is "can humans usually figure out what I'm talking about from a few examples, without a formal definition?". For human values, the answer seems like an obvious "yes". For evolutionary fitness... nonobvious. Humans usually get it wrong without the formal definition. Also, natural abstractions in general involve summarizing the information from one chunk of the universe which is relevant "far away". For human values, the relevant chunk of the universe is the human - i.e. the information about human values is all embedded in the physical human. But for evolutionary fitness, that's not the case - an organism does not contain all the information relevant to calculating its evolutionary fitness. So it seems like there's some qualitative difference there - like, human values "live" in humans, but fitness doesn't "live" in organisms in the same way. I still don't feel like I fully understand this, though. Sure, inner alignment is a problem which mainly applies to architectures similar to modern ML, and modern ML architecture seems like the most-likely route to AGI. It still feels like outer alignment is a much harder problem, though. The very fact that inner alignment failure is so specific to certain architectures is evidence that it should be tractable. For instance, we can avoid most inner alignment problems by just optimizing all the parameters simultaneously at run-time. That solution would be too expensive in practice, but the point is that inner alignment is hard in a "we need to find more efficient algorithms" sort of way, not a "we're missing core concepts and don't even know how to solve this in principle" sort of way. (At least for mesa-optimization; I agree that there are more general subsystem alignment/robust delegation issues which are potentially conceptually harder.) Outer alignment, on the other hand, we don't even know how to solve in principle, on any architecture whatsoever
Currently, my first-pass check for "is this probably a natural abstraction?" is "can humans usually figure out what I'm talking about from a few examples, without a formal definition?". For human values, the answer seems like an obvious "yes". For evolutionary fitness... nonobvious. Humans usually get it wrong without the formal definition.

Hmm, presumably you're not including something like "internal consistency" in the definition of 'natural abstraction'. That is, humans who aren't thinking carefully about something will think there's an imaginable object even if any attempts to actually construct that object will definitely lead to failure. (For example, Arrow's Impossibility Theorem comes to mind; a voting rule that satisfies all of those desiderata feels like a 'natural abstraction' in the relevant sense, even though there aren't actually any members of that abstraction.)

Oh this is fascinating. This is basically correct; a high-level model space can include models which do not correspond to any possible low-level model. One caveat: any high-level data or observations will be consistent with the true low-level model. So while there may be natural abstract objects which can't exist, and we can talk about those objects, we shouldn't see data supporting their existence - e.g. we shouldn't see a real-world voting system behaving like it satisfies all of Arrow's desiderata.
2Ben Pace4y
Regarding your first pass check for naturalness being whether humans can understand it: strike me thoroughly puzzled. Isn't one of the core points of the reductionism sequence that, while "thor caused the thunder" sounds simpler to a human than Maxwell's equations (because the words fit naturally into a human psychology), one of them is much "simpler" in an absolute sense than the other (and is in fact true). Regarding your point about the human values living in humans while the organism's fitness is living partly in the environment, nothing immediately comes to mind to say here, but I agree it's a very interesting question. The things you say about inner/outer alignment hold together quite sensibly. I am surprised to hear you say that mesa optimisers can be avoided by just optimizing all the parameters simultaneously at run-time. That doesn't match my understanding of mesa optimisation, I thought the mesa optimisers would definitely arise during the training, but if you're right that it's trivial-but-expensive to remove them there then I agree it's intuitively a much easier problem than I had realised.
Despite humans giving really dumb verbal explanations (like "Thor caused the thunder"), we tend to be pretty decent at actually predicting things in practice. The same applies to natural abstractions. If I ask people "is 'tree' a natural category?" then they'll get into some long philosophical debate. But if I show someone five pictures of trees, then show them five other picture which are not all trees, and ask them which of the second set are similar to the first set, they'll usually have no trouble at all picking the trees in the second set. If you're optimizing all the parameters simultaneously at runtime, then there is no training. Whatever parameters were learned during "training" would just be overwritten by the optimal values computed at runtime.
2Ben Pace4y
Mm, quantum mechanics much? I do not think I can reliably tell you which experiments are in the category “real” and the category “made up”, even though it’s a very simple category mathematically. But I don’t expect you’re saying this, I just am still confused what you are saying. This reminds me of Oli’s question here, which ties into Abram’s “point of view from somewhere” idea. I feel like I expect ML-systems to take the point of view of the universe, and not learn our natural categories.
I'm talking everyday situations. Like "if I push on this door, it will open" or "by next week my laundry hamper will be full" or "it's probably going to be colder in January than June". Even with quantum mechanics, people do figure out the pattern and build some intuition, but they need to see a lot of data on it first and most people never study it enough to see that much data. In places where the humans in question don't have much first-hand experiential data, or where the data is mostly noise, that's where human prediction tends to fail. (And those are also the cases where we expect learning systems in general to fail most often, and where we expect the system's priors to matter most.) Another way to put it: humans' priors aren't great, but in most day-to-day prediction problems we have more than enough data to make up for that.

I’ll set aside what happens “by default” and focus on the interesting technical question of whether this post is describing a possible straightforward-ish path to aligned superintelligent AGI.

The background idea is “natural abstractions”. This is basically a claim that, when you use an unsupervised world-model-building learning algorithm, its latent space tends to systematically learn some patterns rather than others. Different learning algorithms will converge on similar learned patterns, because those learned patterns are a property of the world, not an idiosyncrasy of the learning algorithm. For example: Both human brains and ConvNets seem to have a “tree” abstraction; neither human brains nor ConvNets seem to have a “head or thumb but not any other body part” concept.

I kind of agree with this. I would say that the patterns are a joint property of the world and an inductive bias. I think the relevant inductive biases in this case are something like: (1) “patterns tend to recur”, (2) “patterns tend to be localized in space and time”, and (3) “patterns are frequently composed of multiple other patterns, which are near to each other in space and/or time”, and maybe other things. Th... (read more)

I'm fairly confident that the inputs to human values are natural abstractions - i.e. the "things we care about" are things like trees, cars, other humans, etc, not low-level quantum fields or "head or thumb but not any other body part". (The "head or thumb" thing is a great example, by the way). I'm much less confident that human values themselves are a natural abstraction, for exactly the same reasons you gave.

To help me check my understanding of what you're saying, we train an AI on a bunch of videos/media about Alice's life, in the hope that it learns an internal concept of "Alice's values". Then we use SL/RL to train the AI, e.g., give it a positive reward whenever it does something that the supervisor thinks benefits Alice's values. The hope here is that the AI learns to optimize the world according to its internal concept of "Alice's values" that it learned in the previous step. And we hope that its concept of "Alice's values" includes the idea that Alice wants AIs, including any future AIs, to keep improving their understanding of Alice's values and to serve those values, and that this solves alignment in the long run.

Assuming the above is basically correct, this (in part) depends on the AI learning a good enough understanding of "improving understanding of Alice's values" in step 1. This in turn (assuming "improving understanding of Alice's values" involves "using philosophical reasoning to solve various confusions related to understanding Alice's values, including Alice's own confusions") depends on that the AI can learn a correct or good enough concept of "philosophical reasoning" from unsupervised training. Correct?

If AI can learn "philosophical reasoning" from unsupervised training, GPT-N should be able to do philosophy (e.g., solve open philosophical problems), right?

There's a lot of moving pieces here, so the answer is long. Apologies in advance.

I basically agree with everything up until the parts on philosophy. The point of divergence is roughly here:

assuming "improving understanding of Alice's values" involves "using philosophical reasoning to solve various confusions related to understanding Alice's values, including Alice's own confusions"

I do think that resolving certain confusions around values involves solving some philosophical problems. But just because the problems are philosophical does not mean that they need to be solved by philosophical reasoning.

The kinds of philosophical problems I have in mind are things like:

  • What is the type signature of human values?
  • What kind of data structure naturally represents human values?
  • How do human values interface with the rest of the world?

In other words, they're exactly the sort of questions for which "utility function" and "Cartesian boundary" are answers, but probably not the right answers.

How could an AI make progress on these sorts of questions, other than by philosophical reasoning?

Let's switch gears a moment and talk about some analogous problems:

  • What is the type signature of the concept of
... (read more)
6Wei Dai4y
So similarly, a human could try to understand Alice's values in two ways. The first, equivalent to what you describe here for AI, is to just apply whatever learning algorithm their brain uses when observing Alice, and form an intuitive notion of "Alice's values". And the second is to apply explicit philosophical reasoning to this problem. So sure, you can possibly go a long way towards understanding Alice's values by just doing the former, but is that enough to avoid disaster? (See Two Neglected Problems in Human-AI Safety for the kind of disaster I have in mind here.) (I keep bringing up metaphilosophy but I'm pretty much resigned to be living in a part of the multiverse where civilization will just throw the dice and bet on AI safety not depending on solving it. What hope is there for our civilization to do what I think is the prudent thing, when no professional philosophers, even ones in EA who are concerned about AI safety, ever talk about it?)
I mostly agree with you here. I don't think the chances of alignment by default are high. There are marginal gains to be had, but to get a high probability of alignment in the long term we will probably need actual understanding of the relevant philosophical problems.
My take is that corrigibility is sufficient to get you an AI that understands what it means to "keep improving their understanding of Alice's values and to serve those values".  I don't think the AI needs to play the "genius philosopher" role, just the "loyal and trustworthy servant" role.  A superintelligent AI which plays that role should be able to facilitate a "long reflection" where flesh and blood humans solve philosophical problems. (I also separately think unsupervised learning systems could in principle make philosophical breakthroughs. Maybe one already has.)
In light of the previous section, there’s an obvious path to alignment where there turns out to be a few neurons (or at least some simple embedding) which correspond to human values, we use the tools of microscope AI to find that embedding, and just like that the alignment problem is basically solved.

This is the part I disagree with. The network does recognise trees, or at least green things (given that the grass seems pretty brown in the low tree pic).

Extrapolating this, I expect the AI might well have neurons that correspond roughly to human values, on the training data. Within the training environment, human values, amount of dopamine in human brain, curvature of human lips (in smiles), number of times the reward button is pressed, and maybe even amount of money in human bank account might all be strongly correlated.

You will have successfully narrowed human values down to within the range of things that are strongly correlated with human values in the training environment. If you take this signal and apply enough optimization pressure, you are going to get the equivalent of a universe tiled with tiny smiley faces.

Note that the examples in the OP are from an adversarial generative network. If its notion of "tree" were just "green things", the adversary should be quite capable of exploiting that. The whole point of the "natural abstractions" section of the OP  is that I do not think this will actually happen. Off-distribution behavior is definitely an issue for the "proxy problems" section of the post, but I do not expect it to be an issue for identifying natural abstractions.
4Donald Hobson4y
In order for the network to produce good pictures, the concept of "tree" must be hidden in there somewhere, but it could be hidden in a complicated and indirect manor. I am questioning whether the particular single node selected by the researchers encodes the concept of "tree" or "green thing".
Ah, I see. You're saying that the embedding might not actually be simple. Yeah, that's plausible.

Planned summary for the Alignment Newsletter:

I liked the author’s summary, so I’ve reproduced it with minor stylistic changes:
A low-level model of some humans has everything there is to know about human values embedded within it, in exactly the same way that human values are embedded in physical humans. The embedding, however, is nontrivial. Thus, predictive power alone is not sufficient to define human values. The missing part is the embedding of values within the model.
However, this also applies if we replace the phrase “human values
... (read more)

This came out of the discussion you had with John Maxwell, right? Does he think this is a good presentation of his proposal?

How do we know that the unsupervised learner won't have learnt a large number of other embeddings closer to the proxy? If it has, then why should we expect human values to do well?

Some rough thoughts on the data type issue. Depending on what types the unsupervised learner provides the supervised, it may not be able to reach the proxy type by virtue of issues with NN learning processes.

Recall that tata types can be viewed as hom... (read more)

I'm very glad johnswentworth wrote this, but there are a lot of little details where we seem to disagree--see my other comments in this thread. There are also a few key parts of my proposal not discussed in this post, such as active learning and using an ensemble to fight Goodharting and be more failure-tolerant. I don't think there's going to be a single natural abstraction for "human values" like johnswentworth seems to imply with this post, but I also think that's a solvable problem. (previous discussion for reference)
Sort of? That was one significant factor which made me write it up now, and there's definitely a lot of overlap. But this isn't intended as a response/continuation to that discussion, it's a standalone piece, and I don't think I specifically address his thoughts from that conversation. A lot of the material is ideas from the abstraction project which I've been meaning to write up for a while, as well as material from discussions with Rohin that I've been meaning to write up for a while. Two brief comments here. First, I claim that natural abstraction space is quite discrete (i.e. there usually aren't many concepts very close to each other), though this is nonobvious and I'm not ready to write up a full explanation of the claim yet. Second, for most proxies there probably are natural abstractions closer to the proxy, because most simple proxies are really terrible - for instance, if our proxy is "things people say are ethical on twitter", then there's probably some sort of natural abstraction involving signalling which is closer. Assuming we get the chance to iterate, this is the sort of thing which people hopefully solve by trying stuff and seeing what works. (Not that I give that a super-high chance of success, but it's not out of the question.) Strongly agree with this, and your explanation is solid. Worth mentioning that we do have some universality results for neural nets, but it's still the case that the neural net structure has implicit priors/biases which could make it hard to learn certain data structures. This is one of several reasons why I see "figuring out what sort-of-thing human values are" as one of the higher-expected-value subproblems on the theoretical side of alignment research.
Based off what you've said in the comments, I'm guessing you'd say the various forms of corrigibility are natural abstractions. Would you say we can use the strategy you outline here to get "corrigibility by default"? Regarding iterations, the common objection is that we're introducing optimisation pressure. So we should expect the usual alignment issues anyway. Under your theory, is this not an issue because of the sparsity of natural abstractions near human values?
I'm not sure about whether corrigibility is a natural abstraction. It's at least plausible, and if it is, then corrigibility by default should work under basically-similar assumptions. Basically, yes. We want the system to use its actual model of human values as a proxy for its objective, which is itself a proxy for human values. So the whole strategy will fall apart in situations where the system converges to the true optimum of its objective. But in situations where a proxy for the system's true optimum would be used (e.g. weak optimization or insufficient data to separate proxy from true), the model of human values may be the best available proxy.

Thanks a lot for writing this. I've been thinking about FAI plans along these lines for a while now, here are some thoughts on specific points you made.

First, I take issue with the "Alignment By Default" title. There are two separate questions here. Question #1 is whether we'd have a good outcome if everyone concerned with AI safety got hit by a bus. Question #2 is whether there's a way to create Friendly AI using unsupervised learning. I'm rather optimistic that the answer to Question #2 is yes. I find the unsupervised learning family of approaches ... (read more)

Thanks for the comments, these are excellent! Valid complaint on the title, I basically agree. I only give the path outlined in the OP ~10% of working without any further intervention by AI safety people, and I definitely agree that there are relatively-tractable-seeming ways to push that number up on the margin. (Though those would be marginal improvements only; I don't expect them to close the bulk of the gap without at least some progress on theoretical bottlenecks.) I am generally lukewarm about human-simulation approaches to alignment; the fusion power generator scenario is a prototypical example of my concerns here (also see this comment on it, which explains what I see as the key take-away). The idea of simulating a human doing moral philosophy is a bit different than what I usually imagine, though; it's basically like taking an alignment researcher and running them on faster hardware. That doesn't directly solve any of the underlying conceptual problems - it just punts them to the simulated researchers - but it is presumably a strict improvement over a limited number of researchers operating slowly in meatspace. Alignment research ems! I don't think this helps much. Two examples of "specifics of the data collection process" to illustrate: * Suppose our data consists of human philosophers' writing on morality. Then the "specifics of the data collection process" includes the humans' writing skills and signalling incentives, and everything else besides the underlying human values. * Suppose our data consists of humans' choices in various situations. Then the "specifics of the data collection process" includes the humans' mistaken reasoning, habits, divergence of decision-making from values, and everything else besides the underlying human values. So "specifics of the data collection process" is a very broad notion in this context. Essentially all practical data sources will include a ton of extra information besides just their information on human values
Can you be more specific about the theoretical bottlenecks that seem most important? I agree that Tool AI is not inherently safe. The key question is which problem is easier: the alignment problem, or the safe-use-of-dangerous-tools problem. All else equal, if you think the alignment problem is hard, then you should be more willing to replace alignment work with tool safety work. If you think the alignment problem is easy, you should discourage dangerous tools in favor of frontloaded work on a more paternalistic "not just benign, actually aligned" AI. An analogy here would be Linux vs Windows. Linux lets you shoot your foot off and wipe your hard drive with a single command, but it also gives you greater control of your system and your computer is less likely to get viruses. Windows is safer and more paternalistic, with less user control. Windows is a better choice for the average user, but that's partially because we have a lot of experience building operating systems. It wouldn't make sense to aim for a Windows as our first operating system, because (a) it's a more ambitious project and (b) we wouldn't have enough experience to know the right ways in which to be paternalistic. Heck, it was you who linked disparagingly to waterfall-style software development the other day :) There's a lot to be said for simplicity of implementation. (Random aside: In some sense I think the argument for paternalism is self-refuting, because the argument is essentially that humans can't be trusted, but I'm not sure the total amount of responsibility we're assigning to humans has changed--if the first system is to be very paternalistic, that puts an even greater weight of responsibility on the shoulders of its designers to be sure and get it right. I'd rather shove responsibility into the post-singularity world, because the current world seems non-ideal, for example, AI designers have limited time to think due to e.g. possible arms races.) What do I mean by the "safe-use-of-danger
Type signature of human values is the big one. I think it's pretty clear at this point that utility functions aren't the right thing, that we value things "out in the world" as opposed to just our own "inputs" or internal state, that values are not reducible to decisions or behavior, etc. We don't have a framework for what-sort-of-thing human values are. If we had that - not necessarily a full model of human values, just a formalization which we were confident could represent them - then that would immediately open the gates to analysis, to value learning, to uncertainty over values, etc. A good argument, but I see the difficulties of safe tool AI and the difficulties of alignment as mostly coming from the same subproblem. To the extent that that's true, alignment work and tool safety work need to be basically the same thing. On the tools side, I assume the tools will be reasoning about systems/problems which humans can't understand - that's the main value prop in the first place. Trying to collapse the complexity of those systems into a human-understandable API is inherently dangerous: values are complex, the system is complex, their interaction will inevitably be complex, so any API simple enough for humans will inevitably miss things. So the only safe option which can scale to complex systems is to make sure the "tools" have their own models of human values, and use those models to check the safety of their outputs... which brings us right back to alignment. Simple mechanisms like always displaying an estimated probability that I'll regret asking a question would probably help, but I'm mainly worried about the unknown unknowns, not the known unknowns. That's part of what I mean when I talk about marginal improvements vs closing the bulk of the gap - the unknown unknowns are the bulk of the gap. (I could see tools helping in a do-the-same-things-but-faster sort of way, and human-mimicking approaches in particular are potentially helpful there. On the other han
Do you have in mind a specific aspect of human values that couldn't be represented using, say, the reward function of a reinforcement learning agent AI? There's an aspect of defense-in-depth here. If your tool's model of human values is slightly imperfect, that doesn't necessarily fail hard the way an agent with a model of human values that's slightly imperfect does. BTW, let's talk about the "Research Assistant" story here. See more discussion here. (The problems brought up in that thread seem pretty solvable to me.) That's why you need a tool... so it can tell you the unknown unknowns you're missing, and how to solve them. We'd rather have a single die roll, on creating a good tool, then have a separate die roll for every one of those unknown unknowns, wouldn't we? ;-) Shouldn't we aim for a fairly minimalist, non-paternalistic tool where unknown unknowns are relatively unlikely to become load-bearing? All we need to do is figure out the unknown unknowns that are load-bearing in the Research Assistant scenario, then assistant can help us with the rest of the unknown unknowns. If solving FAI necessarily involves reasoning about things which are beyond humans (which seems to be what you're getting at with the "unknown unknowns" stuff), what is the alternative? We were discussing a scenario where we had an OK solution to alignment, and you were saying that you didn't want to get locked into a merely OK solution for all of eternity. I'm saying corrigibility can address that. Alignment is already solvable to an OK degree in this hypothetical, so I'm assuming corrigibility is solvable to an OK degree as well. Corrigible AI should be able to improve its corrigibility with increased capabilities the same way it can improve its alignment with increased capabilities. You say "corrigibility" has a lot of hidden complexity. The more capable the system, the more hypotheses it can generate regarding complex phenomena, and the more likely those hypotheses are to be correct
It's not the function-representation that's the problem, it's the type-signature of the function. I don't know what such a function would take in or what it would return. Even RL requires that we specify the input-output channels up-front. This translates in my head to "all we need to do is solve the main problems of alignment, and then we'll have an assistant which can help us clean up any easy loose ends". More generally: I'm certainly open to the idea of AI, of one sort or another, helping to work out at least some of the problems of alignment. (Indeed, that's very likely a component of any trajectory where alignment improves over time.) But I have yet to hear a convincing case that punting now actually makes long-run alignment more likely, or even that future tools will make creation of aligned AI easier/more likely relative to unaligned AI. What exactly is the claim here? I don't think solving FAI involves reasoning about things beyond humans. I think the AIs themselves will need to reason about things beyond humans, and in particular will need to reason about complex safety problems on a day-to-day basis, but I don't think that designing a friendly AI is too complex for humans. Much of the point of AI is that we can design systems which can reason about things too complex for ourselves. Similarly, I expect we can design safe systems which can reason about safety problems too complex for ourselves. What notion of "corrigible" are you using here? It sounds like it's not MIRI's "the AI won't disable its own off-switch" notion.
Try to clarify here, do you think the problems brought up in these answers are the main problems of alignment? This claim seems a bit odd to me because I don't think those problems are highlighted in any of the major AI alignment research agenda papers. (Alternatively, if you feel like there are important omissions from those answers, I strongly encourage you to write your own answer!) I did a presentation at the recent AI Safety Discussion Day on how to solve the problems in that thread. My proposed solutions don't look much like anything that's e.g. on Arbital because the problems are different. I can share the slides if you want, PM me your gmail address. Here's an example of a tool that I would find helpful right now, that seems possible to make with current technology (and will get better as technology advances), and seems very low risk: Given a textual description of some FAI proposal (or proposal for solving some open problem within AI safety), highlight the contiguous passage of text within the voluminous archives of AF/LW/etc. that is most likely to represent a valid objection to this proposal. (EDIT: Or, given some AI safety problem, highlight the contiguous passage of text which is most likely to represent a solution.) Can you come up with improbable scenarios in which this sort of thing ends up being net harmful? Sure. But security is not binary. Just because there is some hypothetical path to harm doesn't mean harm is likely. Could this kind of approach be useful for unaligned AI as well? Sure. So begin work on it ASAP, keep it low profile, and restrict its use to alignment researchers in order to create maximum differential progress towards aligned AI. I'm a bit confused why you're bringing up "safety problems too complex for ourselves" because it sounds like you don't think there are any important safety problems like that, based on the sentences that came before this one? I'm talking about the broad sense of "corrigible" described in e.g. the b
Mostly no. I've been trying to write a bit more about this topic lately; Alignment as Translation is the main source of my intuitions on core problems, and the fusion power generator scenario is an example of what that looks like in a GPT-like context (parts of your answer here are similar to that). Using GPT-like systems to simulate alignment researchers' writing is a probably-safer use-case, but it still runs into the core catch-22. Either: * It writes something we'd currently write, which means no major progress (since we don't currently have solutions to the major problems and therefore can't write down such solutions), or * It writes something we currently wouldn't write, in which case it's out-of-distribution and we have to worry about how it's extrapolating us I generally expect the former to mostly occur by default; the latter would require some clever prompts. I could imagine at least some extrapolation of progress being useful, but it still seems like the best way to make human-simulators more useful is to improve our own understanding, so that we're more useful to simulate. This sounds like a great tool to have. It's exactly the sort of thing which is probably marginally useful. It's unlikely to help much on the big core problems; it wouldn't be much use for identifying unknown unknowns which nobody has written about before. But it would very likely help disseminate ideas, and be net-positive in terms of impact. I do think a lot of the things you're suggesting would be valuable and worth doing, on the margin. They're probably not sufficient to close the bulk of the safety gap without theoretical progress on the core problems, but they're still useful. The "safety problems too complex for ourselves" are things like the fusion power generator scenario - i.e. safety problems in specific situations or specific applications. The safety problems which I don't think are too complex are the general versions, i.e. how to build a generally-aligned AI. An
Well, I encourage you to come up with a specific way in which GPT-N will harm us by trying to write an AF post due to not having solved Alignment as Translation and add it as an answer in that thread. Given that we may be in an AI overhang, I'd like the answers to represent as broad a distribution of plausible harms as possible, because that thread might end up becoming very important & relevant very soon.

Some notes on the loss function in unsupervised learning:

Since an unsupervised learner is generally just optimized for predictive power

I think it's worthwhile to distinguish the loss function that's being optimized during unsupervised learning, vs what the practitioner is optimizing for. Yes, the loss function being optimized in an unsupervised learning system is frequently minimization of reconstruction error or similar. But when I search for "unsupervised learning review" on Google Scholar, I find this highly cited paper by Bengio et al. The abstr... (read more)

This comment definitely wins the award for best comment on the post so far. Great ideas, highly relevant links. I especially like the deliberate noise idea. That plays really nicely with natural abstractions as information-relevant-far-away: we can intentionally insert noise along particular dimensions, and see how that messes with prediction far away (either via causal propagation or via loss of information directly). As long as most of the noise inserted is not along the dimensions relevant to the high-level abstraction, denoising should be possible. So it's very plausible that denoising autoencoders are fairly-directly incentivized to learn natural abstractions. That'll definitely be an interesting path to pursue further. Assuming that the denoising autoencoder objective more-or-less-directly incentivizes natural abstractions, further refinements on that setup could very plausibly turn into a useful "ease of interpretability" objective.
Thanks! I don't consider myself an expert on the unsupervised learning literature by the way, I expect there is more cool stuff to be found.

I like this post a lot, and I think it points out a key crux between what I would term the "Yudkowsky" side (which seems to mostly include MIRI, though I'm not too sure about individual researchers' views) and "everybody else".

In particular, the disagreement seems to crystallize over the question of whether "human values" really are a natural abstraction. I suspect that if Eliezer thought that they were, he would be substantially less worried about AI alignment than he currently is (though naturally all of this is my read on his views).

You do provide some ... (read more)

Bit of a side-note, but the high entropy of tree branching comes from trees using the biological equivalent of random number generators when "deciding" when/whether to form a branch. The distribution of branch length-ratios/counts/angles is actually fairly simple and stable, and is one of the main characteristics which makes particular tree species visually distinctive. See L-systems for the basics, or speedtree for the industrial-grade version (and some really beautiful images). It's that distribution which is the natural abstraction - i.e. the distribution summarizes information about branching which is relevant to far-away trees of the same species.
I think there's a subtle confusion here between two different claims: * Human values evolved as a natural abstraction of some territory. * Humans' notion of "human values" is a natural abstraction of humans' actual values. It sounds like your comment is responding to the former, while I'm claiming the latter. A key distinction here is between humans' actual values, and humans' model/notion of our own values. Humans' actual values are the pile of heuristics inherited from evolution. But humans also have a model of their values, and that model is not the same as the underlying values. The phrase "human values" necessarily points to the model, because that's how words work - they point to models. My claim is that the model is a natural abstraction of the actual values, not that the actual values are a natural abstraction of anything. This is closely related to this section from the OP: Roughly speaking, the concept of "human values" summarizes anything about the values of one human which is relevant to the values of far-away humans. Does that make sense?

This is the sort of thing I've been thinking about since "What's the dream for giving natural language commands to AI?" (which bears obvious similarities to this post). The main problems I noted there apply similarly here:

  • Prediction in the supervised task might not care about the full latent space used for the unsupervised tasks, losing information.
  • Little to no protection from Goodhart's law. Things that are extremely good proxies for human values still might not be safe to optimize.
  • Doesn't care about metaethics, just maximize
... (read more)

In this post, the author describes a pathway by which AI alignment can succeed even without special research effort. The specific claim that this can happen "by default" is not very important, IMO (the author himself only assigns 10% probability to this). On the other hand, viewed as a technique that can be deliberately used to help with alignment, this pathway is very interesting.

The author's argument can be summarized as follows:

  • For anyone trying to predict events happening on Earth, the concept of "human values" is a "natural abstraction", i.e. someth
... (read more)
One subtlety which approximately 100% of people I've talked to about this post apparently missed: I am pretty confident that the inputs to human values are natural abstractions, i.e. we care about things like trees, cars, humans, etc, not about quantum fields or random subsets of atoms. I am much less confident that "human values" themselves are natural abstractions; values vary a lot more across cultures than e.g. agreement on "trees" as a natural category. In the particular section you quoted, I'm explicitly comparing the best-case of abstraction by default to the the other two strategies, assuming that the other two work out about-as-well as they could realistically be expected to work. For instance, learning a human utility function is usually a built-in assumption of IRL formulations, so such formulations can't do any better than a utility function approximation even in the best case. Alignment by default does not need to assume humans have a utility function; it just needs whatever-humans-do-have to have low marginal complexity in a system which has learned lots of natural abstractions. Obviously alignment by default has analogous assumptions/flaws; much of the OP is spent discussing them. The particular section you quote was just talking about the best-case where those assumptions work out well. I partially agree with this, though I do think there are good arguments that malign simulation issues will not be a big deal (or to the extent that they are, they'll look more like Dr Nefarious than pure inner daemons), and by historical accident those arguments have not been circulated in this community to nearly the same extent as the arguments that malign simulations will be a big deal. Some time in the next few weeks I plan to write a review of The Solomonoff Prior Is Malign which will talk about one such argument.
4Vanessa Kosoy2y
That's fair, but it's still perfectly in line with the learning-theoretic perspective: human values are simpler to express through the features acquired by unsupervised learning than through the raw data, which translates to a reduction in sample complexity. This seems wrong to me. If you do IRL with the correct type signature for human values then in the best case you get the true human values. IRL is not mutually exclusive with your approach: e.g. you can do unsupervised learning and IRL with shared weights. I guess you might be defining "IRL" as something very narrow, whereas I define it "any method based on revealed preferences". Malign simulation hypotheses already look like "Dr. Nefarious" where the role of Dr. Nefarious is played by the masters of the simulation, so I'm not sure what exactly is the distinction you're drawing here.
Yup, that's right. I still agree with your general understanding, just wanted to clarify the subtlety. Yup, I agree with all that. I was specifically talking about IRL approaches which try to learn a utility function, not the more general possibility space. The distinction there is about whether or not there's an actual agent in the external environment which coordinates acausally with the malign inner agent, or some structure in the environment which allows for self-fulfilling prophecies, or something along those lines. The point is that there has to be some structure in the external environment which allows a malign inner agent to gain influence over time by making accurate predictions. Otherwise, the inner agent will only have whatever limited influence it has from the prior, and every time it deviates from its actual best predictions (or is just out-predicted by some other model), some of that influence will be irreversibly spent; it will end up with zero influence in the long run.
2Vanessa Kosoy2y
Of course, but this in itself is no consolation, because it can spend its finite influence to make the AI perform an irreversible catastrophic action: for example, self-modifying into something explicitly malign. In e.g. IDA-type protocols you can defend by using a good prior (such as IB physicalism) plus confidence thresholds (i.e. every time the hypotheses have a major disagreement you query the user). You also have to do something about non-Cartesian attack vectors (I have some ideas), but that doesn't depend much on the protocol. In value learning things are worse, because of the possibility of corruption (i.e. the AI hacking the user or its own input channels). As a consequence, it is no longer clear you can infer the correct values even if you make correct predictions about everything observable. Protocols based on extrapolating from observables to unobservables fail, because malign hypotheses can attack the extrapolation with impunity (e.g. a malign hypothesis can assign some kind of "Truman show" interpretation to the behavior of the user, where the user's true values are completely alien and they are just pretending to be human because of the circumstances of the simulation).
It's up.

I guess the main issue that I have with this argument is that an AI system that is extremely good at prediction is unlikely to just have a high-level concept corresponding to human values (if it does contain such a concept). Instead, it's likely to also include a high-level concept corresponding to what people say about about values - or rather several corresponding to what various different groups would say about human-values. If your proxy is based on what people say, then these concepts which correspond to what people say will match much better - and the probability of at least one of these concepts being the best match is increased by large the number of these. So I don't put a very high weight on this scenario at all.

This is a great explanation. I basically agree, and this is exactly why I expect alignment-by-default to most likely fail even conditional on the natural abstractions hypothesis holding up.
Also, I have another strange idea that might increase the probability of this working. If you could temporarily remove proxies based on what people say, then this would seem to greatly increase the chance of it hitting the actual embedded representation of human values. Maybe identifying these proxies is easier than identifying the representation of "true human values"? I don't think it's likely to work, but thought I'd share anyway.
Thanks! Is this why you put the probability as "10-20% chance of alignment by this path, assuming that the unsupervised system does end up with a simple embedding of human values"? Or have you updated your probabilities since writing this post?
Yup, this is basically where that probability came from. It still feels about right.
This requires hitting a window - our data needs to be good enough that the system can tell it should use human values as a proxy, but bad enough that the system can’t figure out the specifics of the data-collection process enough to model it directly. This window may not even exist.

I like this framing, it is clarifying.

When alignment-by-default works, it’s basically a best-case scenario, so we can safely use the system to design a successor without worrying about amplification of alignment errors (among other things).

didn't understand how this was derived or what other results/ideas it is referencing.

The idea here is that the AI has a rough model of human values, and is pointed at those values when making decisions (e.g. the embedding is known and it's optimizing for the embedded values, in the case of an optimizer). It may not have perfect knowledge of human values, but it would e.g. design its successor to build a more precise model of human values than itself (assuming it expects that successor to have more relevant data) and point the successor toward that model, because that's the action which best optimizes for its current notion of human values. Contrast to e.g. an AI which is optimizing for human approval. If it can do things which makes a human approve, even though the human doesn't actually want those things (e.g. deceptive behavior), then it will do so. When that AI designs its successor, it will want the successor to be even better at gaining human approval, which means making the successor even better at deception. This probably needs more explanation, but I'm not sure which parts need more explanation, so feedback would be appreciated.
Is the idea that the AI is optimizing for humans approving of things, as opposed to humans approving of its actions? It seems that if its optimizing for humans approving of its actions, it doesn't necessarily have an incentive to make a successor that optimizes for approval (though I admit it's not clear why it would make a successor at all in this case; perhaps it's designed to not plan against being deactivated after some time)
Right, I should clarify that. I was imagining that it's designing a successor which will take over the AI's own current input/output channels, so "its actions" in the future will actually be the successor's actions. (Equivalently, we could imagine the AI contemplating self-modification.)
This is helpful.
So in principle, it doesn’t even matter what kind of model we use or how it’s represented; as long the predictive power is good enough, values will be embedded in there, and the main problem will be finding the embedding.

I will agree with this. However, notice what this doesn't say. It doesn't say "any model powerful enough to be really dangerous contains human values". Imagine a model that was good at a lot of science and engineering tasks. It was good enough at nuclear physics to design effective fusion reactors and b... (read more)

This is entirely correct.

This will likely not work for dualistic model of human values (and other complex models, like family system). In this model, a human have an ethical system and opposing suppressed desires.

For example, I think that it is good to eat less, but have a desire for overeating. Combined, they produce behaviour in which I often have eating binges following by periods of fasting. If an AI want to predict my behaviour, it may suggest that I want to have periods of overeating and extrapolate accordingly. However, I consciously endorse only "eating less ethics" and regard it as my true values. As S.Armstrong wrote, there is always an assumption which part of me should be regarded as "true values".

Your behavior is not what the AI is trying to predict. The AI is just trying to predict the world, in general - including e.g. the outcomes of medical or psychological experiments which specifically try to probe the gears underlying your behavior.
But the result of such experiments may still not converge: in one experiment I will claim to have a value of not eating, and in another I will eat. But if the AI is advance enough, it could guess also the correct structure of motivational system. like the number of significant part in it, and each will be represented inside its human model. However, if there are many ways to create human models of similar efficacy, we can't say which model is correct and guess "correct" values.
That's still just looking at behavior. Probing the internals would mean e.g. hooking you to an FMRI to see what's happening in the brain when you claim to have a value of not eating or when you you eat. We can say which model is correct by looking at the internal structure of humans, which is exactly why medical research is relevant.
Knowing internal structure will not help much: the same way as knowing pixel locations on a picture is not equal to image recognition, which is high level representation and abstraction. We need something like a high-level representation of trees, as in your example, but for values. But values could be abstracted in different ways - in many more ways than trees. Even trees may be represented like "green mass" or like set of branches or in some other slightly non-human ways.
This is the part I disagree with. I think there is a single (up to isomorphism) notion of "tree" toward which a very broad variety of computationally-limited predictive systems will converge. That's what the OP's discussion of "natural abstractions" and "information relevant far away" is about. For instance, if a system's only concept of "tree" is "green mass" then it's either going to (a) need whole separate models for trees in autumn and winter (which would be computationally expensive), or (b) lose predictive power when reasoning about trees in autumn and winter. Also, if it learns new facts about green-mass-trees, how will it know that those facts generalize to non-green-mass-trees? Pointing to a Flower has a lot more about this, although it's already out-of-date compared to my current thoughts on the problem.
And here is my point: trees actually exist, and they are natural abstract. "Human values" was created by psychologists in the middle of 20th century as one of the ways to describe human mind. They don't actually exist, but are useful description instruments for some tasks. There are other ways to describe human mind and human motivations: ethical norms, drives, memes, desires, Freud model, family system etc. An AI may find some other abstractions which will be even better in compressing behaviour, but they will be not human values.
Humans have wanted things, and recognized other humans as wanting things, since long before 20th century psychologists came along and used the phrase "human values". I don't particularly care about aligning an AI to whatever some psychologist defines as "human values", I care about aligning an AI to the things humans want. Those are the "human values" I care about. The very fact that I can talk about that, and other people generally seem to know what I'm talking about without me needing to give a formal definition, is evidence that it is a natural abstraction. I would not say there are "other ways to model the human mind", but rather there are other aspects of the human mind which one can model. (Also there are some models of the human mind which are just outright wrong, e.g. Freudian models.) If a model is to achieve strong general-purpose predictive power, then it needs to handle all of those different aspects, including human values. A model of the human mind may be lower-level than "human values", e.g. a low-level physics model of the brain, but that will still have human values embedded in it somehow. If a model doesn't have human values embedded in it somewhere, then it will have poor predictive performance on many problems in which human values are involved.
But human "wants" are not actually a good thing which AI should follow. If I am fasting, I obviously want to eat, but me decision is not eating today. And if I have a robot helping me, I prefer it care about my decisions, not my "wants". This distinction between desires and decisions was obvious for last 2.5 thousand years, and "human values" is obscure and not natural idea.
You are using the word "want" differently than I was. I'm pretty sure I'm trying to point to exactly the same thing you are pointing to. And the fact that we're both trying to point to the same thing is exactly the evidence that the thing we're trying to point to is a natural abstraction. (The fact that the distinction between desires and decisions was obvious for the last 2.5. thousand years is also evidence that both of these things are natural abstractions.) This is a bad idea. You should really, really want the robot to care about something besides your decisions, because the decisions are not enough to determine your values.

So far, we’ve only talked about one AI ending up aligned, or a handful ending up aligned at one particular time. However, that isn’t really the ultimate goal of AI alignment research. What we really want is for AI to remain aligned in the long run, as we (and AIs themselves) continue to build new and more powerful systems and/or scale up existing systems over time.

I think this suggests an interesting path where alignment by default might be able to serve as a bridge to better alignment mechanisms, i.e. if it works and we can select for AIs that contains re... (read more)

I think of this as the Rohin trajectory, since he's the main person I've heard talk about it. I agree it's a natural approach to consider, though deceptiveness-type problems are a big potential issue.
Isn't remaining aligned an example of robust delegation? If so, there have been both discussions and technical work on this problem before.
Yup, exactly right, though this version is a fair bit more involved than the simplified delegation scenarios we've seen in most of the theoretical work.

Great post!

That might have been discussed in the comments, but my gut reaction to the tree example was not "It's not really understanding tree" but "It's understanding trees visually". That is, I think the examples point to trees being a natural abstraction with regard to images made of pixels. In that sense, dogs and cats and other distinct visual objects might fit your proposal of natural abstraction. Yet this doesn't entail that trees are a natural abstraction when given the position of atoms, or sounds (to be more abs... (read more)

My model of abstraction is that high-level abstractions summarize all the information from some chunk of the world which is relevant "far away". Part of that idea is that, as we "move away" from the information-source, most information is either quickly wiped out by noise, or faithfully transmitted far away. The information which is faithfully transmitted will usually be present across many different channels; that's the main reason it's not wiped out by noise in the first place. Obviously this is not something which necessarily applies to all possible systems, but intuitively it seems like it should apply to most systems most of the time: information which is not duplicated across multiple channels is easily wiped out by noise.
when alignment-by-default works, we can use the system to design a successor without worrying about amplification of alignment errors

Anything neural net related starts with random noise and performs gradient descent style steps. This doesn't get you the global optimal, it gets you some point that is approximately a local optimal, which depends on the noise, the nature of the search space, and the choice of step size.

If nothing else, the training data will contain sensor noise.

At best you are going to get something that roughly corresponds to human v... (read more)

That's not quite how natural abstractions work. There are lots of edge cases which are sort-of-trees-but-sort-of-not: logs, saplings/acorns, petrified trees, bushes, etc. Yet the abstract category itself is still precise. An analogy: consider a Gaussian cluster model. Any given cluster will have lots of edge cases, and lots of noise in the individual points. But the cluster itself - i.e. the mean and variance parameters of the cluster - can still be precisely defined. Same with the concept of "tree", and (I expect) with "human values". In general, we can have a precise high-level concept without a hard boundary in the low-level space.

Consider a source of data that is from a sum of several Gaussian distributions. If you have a sufficiently large number of samples from this distribution, you can locate the origional gaussians to arbitrary accuracy. (Of course, if you have a finite number of samples, you will have some inaccuracy in predicting the location of the gaussians, possibly a lot.)

However, not all distributions share this property. If you look at uniform distributions over rectangles in 2d space, you will find that a uniform L shape can be made in 2 different ways. More complicated shapes can be made in even more ways. The property that you can uniquely decompose sum of gaussians into its individual gaussians is not a property that applies to every distribution.

I would expect that whether or not logs, saplings, petrified trees, sparkly plastic christmas trees ect counted as trees would depend on the details of the training data, as well as the network architecture and possibly the random seed.

Note: this is an empirical prediction about current neural networks. I am predicting that if someone, takes 2 networks that have been trained on different datasets, ideally with different architectures, and tries to locate the neuron that holds the concept of "Tree" in each, and then shows both networks an edge case that is kind of like a tree, then the networks will often disagree significantly about how much of a tree it is.

This requires hitting a window - our data needs to be good enough that the system can tell it should use human values as a proxy, but bad enough that the system can’t figure out the specifics of the data-collection process enough to model it directly. This window may not even exist.


are there any real world examples of this? not necessarily in human-values setting 


This article claims that:

  • Unsupervised learning systems will likely learn many “natural abstractions” of concepts like “trees” or “human values”. Maybe they will even end up being simply a “feature direction”.
    • One reason to expect this is that to make good predictions, you only need to conserve information that’s useful at a distance. And this information can be imagined being a “natural abstraction”. 
  • If you then have an RL system or supervised learner who can use the unsupervised activations to solve a problem, then it can directly behave in suc
... (read more)