Former AI safety research engineer, now PhD student in philosophy of ML at Cambridge. I'm originally from New Zealand but have lived in the UK for 6 years, where I did my undergrad and masters degrees (in Computer Science, Philosophy, and Machine Learning). Blog:


AGI safety from first principles
Shaping safer goals


A space of proposals for building safe advanced AI

Wouldn't it just be "train M* to win debates against itself as judged by H"? Since in the original formulation of debate a human inspects the debate transcript without assistance.

Anyway, I agree that something like this is also a reasonable way to view debate. In this case, I was trying to emphasise the similarities between Debate and the other techniques: I claim that if we call the combination of the judge plus one debater Amp(M), then we can think of the debate as M* being trained to beat Amp(M) by Amp(M)'s own standards.

Maybe an easier way to visualise this is that, given some question, M* answers that question, and then Amp(M) tries to identify any flaws in the argument by interrogating M*, and rewards M* if no flaws can be found.

Reply to Jebari and Lundborg on Artificial Superintelligence

But if there is a continuum between the two, then thinking about the end points in terms of goals is relevant in interpreting the degrees of productiveness of goals in the middle.

I don't see why this is the case - you can just think about the continuum from non-goal to goal instead, which should get you the same benefits.

Beware Experiments Without Evaluation

This seems true in some cases, but very false in others. In particular:

  • Humans tend to experiment too little, and so we should often encourage more experimentation.
  • Qualitative observations are often much richer than quantitative observations. So even if you don't know what you're going to measure, making changes can help you understand what you were previously missing.

Your observations seems most true when there's institutional inertia, so that calling something an experiment can be a useful pretext.

Clarifying inner alignment terminology

Hmm, I think this is still missing something.

  1. "What I mean by perfect training and infinite data here is for the model to always have optimal loss on all data points that it ever encounters" - I assume you instead mean all data points that it could ever encounter? Otherwise memorisation is a sufficient strategy, since it will only ever have encountered a finite number of data points.
  2. When you say "the optimal policy on the actual MDP that it experiences", is this just during training, or also during deployment? And if the latter, given that the world is non-stationary, in what sense are you referring to the "actual MDP"? (This is a hard question, and I'd be happy if you handwave it as long as you do so explicitly. Although I do think that the fact that the world is not a MDP is an important and overlooked fact).
Clarifying inner alignment terminology

+1, great post.

Only nitpick: seems like it's worth clarifying what you mean by "infinite data" - from which distribution? And same with "off-distribution".

AGI safety from first principles: Goals and Agency

I do like the link you've drawn between this argument and Ben's one. I don't think I was inspired by it, though; rather, I was mostly looking for a better definition of agency, so that we could describe what it might look like to have highly agentic agents without large-scale goals.

"Inner Alignment Failures" Which Are Actually Outer Alignment Failures

I agree with John that we should define inner alignment only in contexts where mesa-optimisers exist, as you do in Risks from Learned Optimization (and as I do in AGI safety from first principles); that is, inner alignment = mesa-alignment. Under the "more general" definition you propose here, inner alignment includes any capabilities problem and any robustness problem, so it's not really about "alignment".

"Inner Alignment Failures" Which Are Actually Outer Alignment Failures

Okay, it really feels like we're talking past each other here, and I'm not sure what the root cause of that is. But I suspect that it is because you are thinking of outer alignment as choosing an objective function plus an environment, thereby bundling the robustness problem into the outer alignment problem, and I am thinking of it as just choosing an objective function. I could argue further why your definition isn't useful, but instead I'm just going to argue that it violates common usage.

Here's Paul: "You could select a policy which performed well on the distribution, but it may be the case that that policy is trying to do something [other than what you wanted]. There might be many different policies which have values that lead to reasonably good behavior on the training distribution, but then, in certain novel situations, do something different from what I want. That's what I mean by inner alignment."

Here's Evan: "[a mesa-optimiser could] get to the end of the maze on the training distribution, but it could be an objective that will do anything else, sort of off-distribution. That fundamental robustness problem of, when you train a model, and that model has an objective, how do you ensure that that objective is the one that you trained it for? That’s the inner alignment problem."

Here's Vlad: "Since the mesa-optimiser is selected based on performance on the base objective, we expect it (once trained) to have a good policy on the training distribution."

All of them are talking about mesa-optimisers which perform well on the training distribution being potentially inner-misaligned during deployment. None of them claim that the only way inner misalignment can be a problem is via hacking the training process. You don't engage with any previous definitions of inner alignment in your post, you just assert that the concept everyone else has been using is not "true" inner alignment, which seems pretty unhelpful.

(You do quote Evan but the thing he says is consistent with my outer alignment = objective function conception, and you don't explain why you're justified in interpreting it the other way).

"Inner Alignment Failures" Which Are Actually Outer Alignment Failures

Actually, maybe the main reason why this whole line of thinking seems strange to me is because the whole reason the inner alignment term was coined is to point at a specific type of generalisation failure that we're particularly worried about (see So you defining generalisation problems as part of outer misalignment basically defeats the point of having the inner alignment concept.

"Inner Alignment Failures" Which Are Actually Outer Alignment Failures

So if there are safety problems because you deployed the agent, can that ever qualify as inner misalignment under your definition? Or will it always be the fault of the training procedure for not having prepared the agent for the distributional shift to actual deployment? Unlike masturbation, which you can do both in modern times and in the ancestral environment, almost all of the plausible AI threat scenarios involve the AI gaining access to capabilities that it didn't have access to during training.

By contrast, under the conventional view of inner/outer alignment, we can say: bad behaviour X is an outer alignment problem if it would have been rewarded highly by the training reward function (or a natural generalisation of it). And X is an inner alignment problem if the agent is a mesa-optimiser and deliberately chose X to further its goals. And X is just a robustness problem, not an alignment problem, if neither of these apply.

Load More