Former AI safety research engineer, now PhD student in philosophy of ML at Cambridge. I'm originally from New Zealand but have lived in the UK for 6 years, where I did my undergrad and masters degrees (in Computer Science, Philosophy, and Machine Learning). Blog:


Shaping safer goals
AGI safety from first principles


The case for aligning narrowly superhuman models

Nice post. The one thing I'm confused about is:

Institutionally, we are very uncertain whether to prioritize this (and if we do where it should be housed and how our giving should be structured).

It seems to me that the type of research you're discussing here is already seen as a standard way to make progress on the full alignment problem - e.g. the Stiennon et al. paper you cited, plus earlier work on reward modeling by Christiano, Leike, and others. Can you explain why you're institutionally uncertain whether to prioritise it - is it because of the objections you outlined? But your responses to them seem persuasive to me - and more generally, the objections don't seem to address the fact that a bunch of people who are trying to solve long-term alignment problems actually ended up doing this research. So I'd be interested to hear elaborations and defences of those objections from people who find them compelling.

Book review: "A Thousand Brains" by Jeff Hawkins

Great post, and I'm glad to see the argument outlined in this way. One big disagreement, though:

the Judge box will house a relatively simple algorithm written by humans

I expect that, in this scenario, the Judge box would house a neural network which is still pretty complicated, but which has been trained primarily to recognise patterns, and therefore doesn't need "motivations" of its own.

This doesn't rebut all your arguments for risk, but it does reframe them somewhat. I'd be curious to hear about how likely you think my version of the judge is, and why.

Takeaways from one year of lockdown

Thanks for the post; I think this type of reflection is very valuable. The main takeaway from this line of thought for me is that we're in a community which selects for scrupulosity and caution as character traits, which then have a big impact on how we think about risks. This has various implications for thinking about AI, which I won't get into here.

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

Thanks for the reply. To check that I understand your position, would you agree that solving outer alignment plus solving reward tampering would solve the pointers problem in the context of machine learning?

Broadly speaking, I think our disagreement here is closely related to one we've discussed before, about how much sense it makes to talk about outer alignment in isolation (and also about your definition of inner alignment), so I probably won't pursue this further.

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

Above you say:

Now, the basic problem: our agent’s utility function is mostly a function of latent variables. ... Those latent variables:

  • May not correspond to any particular variables in the AI’s world-model and/or the physical world
  • May not be estimated by the agent at all (because lazy evaluation)
  • May not be determined by the agent’s observed data

… and of course the agent’s model might just not be very good, in terms of predictive power.

And you also discuss how:

Human "values" are defined within the context of humans' world-models, and don't necessarily make any sense at all outside of the model.

My two concerns are as follows. Firstly, that the problems mentioned in these quotes above are quite different from the problem of constructing a feedback signal which points to a concept which we know an AI already possesses. Suppose that you meet an alien and you have a long conversation about the human concept of happiness, until you reach a shared understanding of the concept. In other words, you both agree on what "the referents of these pointers" are, and what "the real-world things (if any) to which they’re pointing" are? But let's say that the alien still doesn't care at all about human happiness. Would you say that we have a "pointer problem" with respect to this alien? If so, it's a very different type of pointer problem than the one you have with respect to a child who believes in ghosts. I guess you could say that there are two different but related parts of the pointer problem? But in that case it seems valuable to distinguish more clearly between them.

My second concern is that requiring pointers to be sufficient to "to get the AI to do what we mean" means that they might differ wildly depending on the motivation system of that specific AI and the details of "what we mean". For example, imagine if alien A is already be willing to obey any commands you give, as long as it understands them; alien B can be induced to do so via operant conditioning; alien C would only acquire human values via neurosurgery; alien D would only do so after millennia of artificial selection. So in the context of alien A, a precise english phrase is a sufficient pointer; for alien B, a few labeled examples qualifies as a pointer; for alien C, identifying a specific cluster of neurons (and how it's related to surrounding neurons) serves as a pointer; for alien D, only a millennium of supervision is a sufficient pointer. And then these all might change when we're talking about pointing to a different concept. 

And so adding the requirement that a pointer can "get the AI to do what we mean" makes it seem to me like the thing we're talking about is more like a whole alignment scheme than just a "pointer".

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

The question then is, what would it mean for such an AI to pursue our values?

Why isn't the answer just that the AI should:
1. Figure out what concepts we have;
2. Adjust those concepts in ways that we'd reflectively endorse;
3. Use those concepts?

The idea that almost none of the things we care about could be adjusted to fit into a more accurate worldview seems like a very strongly skeptical hypothesis. Tables (or happiness) don't need to be "real in a reductionist sense" for me to want more of them.

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

I agree with all the things you said. But you defined the pointer problem as: "what functions of what variables (if any) in the environment and/or another world-model correspond to the latent variables in the agent’s world-model?" In other words, how do we find the corresponding variables? I've given you an argument that the variables in an AGI's world-model which correspond to the ones in your world-model can be found by expressing your concept in english sentences.

The problem of determining how to construct a feedback signal which refers to those variables, once we've found them, seems like a different problem. Perhaps I'd call it the "motivation problem": given a function of variables in an agent's world-model, how do you make that agent care about that function? This is a different problem in part because, when addressing it, we don't need to worry about stuff like ghosts.

Using this terminology, it seems like the alignment problem reduces to the pointer problem plus the motivation problem.

The Pointers Problem: Human Values Are A Function Of Humans' Latent Variables

I need some way to say what the values-relevant pieces of my world model are "pointing to" in the real world. I think this problem - the “pointers to values” problem, and the “pointers” problem more generally - is the primary conceptual barrier to alignment right now.

It seems likely that an AGI will understand very well what I mean when I use english words to describe things, and also what a more intelligent version of me with more coherent concepts would want those words to actually refer to. Why does this not imply that the pointers problem will be solved?

I agree that there's something like what you're describing which is important, but I don't think your description pins it down.

Distinguishing claims about training vs deployment

I think 'robust instrumentality' is basically correct for optimal actions, because there's no question of 'emergence': optimal actions just are.

If I were to put my objection another way: I usually interpret "robust" to mean something like "stable under perturbations". But the perturbation of "change the environment, and then see what the new optimal policy is" is a rather unnatural one to think about; most ML people would more naturally think about perturbing an agent's inputs, or its state, and seeing whether it still behaved instrumentally.

A more accurate description might be something like "ubiquitous instrumentality"? But this isn't a very aesthetically pleasing name.

Distinguishing claims about training vs deployment

Can you elaborate? 'Robust' seems natural for talking about robustness to perturbation in the initial AI design (different objective functions, to the extent that that matters) and robustness against choice of environment.

The first ambiguity I dislike here is that you could either be describing the emergence of instrumentality as robust, or the trait of instrumentality as robust. It seems like you're trying to do the former, but because "robust" modifies "instrumentality", the latter is a more natural interpretation.

For example, if I said "life on earth is very robust", the natural interpretation is: given that life exists on earth, it'll be hard to wipe it out. Whereas an emergence-focused interpretation (like yours) would be: life would probably have emerged given a wide range of initial conditions on earth. But I imagine that very few people would interpret my original statement in that way.

The second ambiguity I dislike: even if we interpret "robust instrumentality" as the claim that "the emergence of instrumentality is robust", this still doesn't get us what we want. Bostrom's claim is not just that instrumental reasoning usually emerges; it's that specific instrumental goals usually emerge. But "instrumentality" is more naturally interpreted as the general tendency to do instrumental reasoning.

On switching costs: Bostrom has been very widely read, so changing one of his core terms will be much harder than changing a niche working handle like "optimisation daemon", and would probably leave a whole bunch of people confused for quite a while. I do agree the original term is flawed though, and will keep an eye out for potential alternatives - I just don't think robust instrumentality is clear enough to serve that role.

Load More