Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Nate and Eliezer (Lethality 21) claim that capabilities generalise further than alignment once capabilities start generalising far at all. However, they have not articulated particularly detailed arguments for why this is the case. In this post I collect the arguments for and against the position I have been able to find or generate, and develop them (with a few hours’ effort). I invite you to join me in better understanding this claim and its veracity by contributing your own arguments and improving mine.

Thanks to these people for their help with writing and/or contributing arguments: Vikrant Varma, Vika Krakovna, Mary Phuong, Rory Grieg, Tim Genewein, Rohin Shah.


1. Capabilities have much shorter description length than alignment.

There are simple “laws of intelligence” that underwrite highly general and competent cognitive abilities, but no such simple laws of corrigibility or laws of “doing what the principal means” – or at least, any specification of these latter things will have a higher description length than the laws of intelligence. As a result, most R&D pathways optimising for capabilities and alignment with anything like a simplicity prior (for example) will encounter good approximations of general intelligence earlier than good approximations of corrigibility or alignment.

2. Feedback on capabilities is more consistent and reliable than on alignment.

Reality hits back on cognitive strategies implementing capabilities – such as forming and maintaining accurate beliefs, or making good predictions – more consistently and reliably than any training process hits back on motivational systems orienting around incorrect optimisation targets. Therefore there’s stronger outer optimisation pressure towards good (robust) capabilities than alignment, so we see strong and general capabilities first.

3. There’s essentially only one way to get general capabilities and it has a free parameter for the optimisation target.

There are many paths but only one destination when it comes to designing (via optimisation) a system with strong capabilities. But what those capabilities end up being directed at is path- and prior-dependent in a way we currently do not understand nor have much control over.

4. Corrigibility is conceptually in tension with capability, so corrigibility will fail to generalise when capability generalises well.

Plans that actually work in difficult domains need to preempt or adapt to obstacles. Attempts to steer or correct the target of actually-working planning are a form of obstacle, so we would expect capable planning to resist correction, limiting the extent to which alignment can generalise when capability starts to generalise.

5. Empirical evidence: human intelligence generalised far without staying aligned with its optimisation target.

There is empirical/historical support for capabilities generalising further than alignment to the extent that the analogy of AI development to the evolution of intelligence holds up.

6. Empirical evidence: goal misgeneralisation happens. 

There is weak empirical support for capabilities generalising further than alignment in the fact that it is possible to create demos of goal misgeneralisation (e.g.,

7. The world is simple whereas the target is not.

There are relatively simple laws governing how the world works, for the purposes of predicting and controlling it, compared to the principles underlying what humans value or the processes by which we figure out what is good. (This is similar to For#1 but focused on knowledge instead of cognitive abilities.) (This is in direct opposition to Against#3.)

8. Much more effort will be poured into capabilities (and d(progress)/d(effort) for alignment is not so much higher than for capabilities to counteract this).

We’ll assume alignment is harder based on the other arguments. For why more effort will be put into capabilities, there are two economic arguments: (a) at lower capability levels there is more profitability in advancing capabilities than alignment specifically, and (b) data about reality in general is cheaper and more abundant than data about any particular alignment target (e.g., human-preference data).

This argument is similar to For#2 but focused more on the incentives faced by R&D organisations and efforts: paths to developing capabilities are more salient and attractive.

9. Alignment techniques will be shallow and won’t withstand the transition to strong capabilities.

There are two reasons: (a) we don’t have a principled understanding of alignment and (b) we won’t have a chance to refine our techniques in the strong capabilities regime.

If advances in a core of general reasoning cause performance on specific domains like bioengineering or psychology to look "jumpy", this will likely happen at the same time as a jump in the ability to understand and deceive the training process, and evade the shallow alignment techniques.


1. Optimal capabilities are computationally intractable; tractable capabilities are more alignable.

For example, it may be that the structure of the cognition of tractable capabilities does not look like optimal planning - there’s no obvious factorisation into goals and capabilities. Convergent instrumental subgoals may not apply strongly to the intelligences we actually find.

2. Reality hits back on the models we train via loss functions based on reality-generated data. But alignment also hits back on models we train, because we also use loss functions (based on preference data). These seem to be symmetrically powerful forces.

In fact we care a lot about models that are deceptive or harmful in non-x-risky ways, and spend massive effort curating datasets that describe safe behaviour. As models get more powerful, we will be able to automate the process of generating better datasets, including through AI assistance. Eventually we will effectively be able to constrain the behaviour of superhuman systems with the sheer quantity and diversity of training data. 

3. Alignment only requires building a pointer, whereas capability requires lots of knowledge. Thus the overhead of alignment is small, and can ride increasing capabilities.

(Example of a similar structure, which gives some empirical evidence: millions of dollars to train GPT-3 but only thousands of dollars to finetune on summarisation.)

4. We may have schemes for directing capabilities at the problem of oversight, thus piggy-backing on capability generalisation.

E.g. debate and recursive reward modelling. Furthermore, overseers are asymmetrically advantaged (e.g. because of white-box access or the ability to test in simulation on hypotheticals).

5. Empirical evidence: some capabilities improvements have included corresponding improvements in alignment.

It has proved possible, for example fine-tuning language models on human instructions, to build on capabilities to advance alignment. Extrapolating from this, we might expect alignment to generalise alongside capabilities. For example, billions of tokens are required for decent language capabilities but then only thousands of human feedback points are required to point them at a task.

6. Capabilities might be possible without goal-directedness.

Humans are arguably not strongly goal-directed. We seem to care about lots of different things, and mostly don't end up with a desire to strongly optimise the world towards a simple objective.

Also, we can build tool AIs (such as a physics simulator or a chip designer) which are targeted at such a narrow domain that goal-directedness is not relevant since they aren't strategically located in our world. These AIs are valuable enough to produce economic bounties while coordinating against goal-directed AI development.

7. You don't actually get sharp capability jumps in relevant domains

The AI industry will optimise hard on all economically relevant domains (like bioengineering, psychology, or AI research), which will eliminate capability overhangs and cause progress on these domains to look smooth. This means we get to test our alignment techniques on slightly weaker AIs before we have to rely on them for slightly stronger AIs. This will give us time to refine them into deep alignment techniques rather than shallow ones, which generalise enough. 


Ω 49

New Comment
38 comments, sorted by Click to highlight new comments since: Today at 9:50 AM

I don't think any of these quite capture what I consider the "main argument" that capabilities generalize more than alignment (although 2 comes closest). That said, this is closely coupled to what I currently consider the most-difficult-to-explain concept in alignment, so it's not surprising that none of the arguments capture it. I'll take a shot here.

Motivating question: how can we build a feedback signal for alignment? What goes wrong with a feedback signal for alignment?

It's easy to come up with a crappy proxy feedback signal - just use human approval or something. And then it will obviously fail horribly under sufficient optimization pressure. But then (goes a standard response) maybe by the time our feedback signal breaks down under optimization pressure, we'll have figured out something better. We'll have noticed the ways in which the original broke down, and we'll fix those, and keep iterating.

That class of strategies is doomed for multiple reasons, but the one I want to highlight here is: how do we notice the ways in which the original feedback signal breaks? How do we notice the problems on which to iterate? Whatever it is that we're doing to notice problems, that's the "real" feedback signal, at the outermost optimization loop (i.e. the optimization loop of humans iteratively designing the system). And that outermost feedback signal is also a crappy proxy signal. Humans are not easily able to tell when or where problems occurred even in hindsight, in general. (And that's before we get anywhere near crazy shit like "Potemkin village world".)

Now, this isn't meant to be a proof of impossibility of alignment, or anything like that. Rather, the point is that alignment of strong optimizers simply cannot be done without grounding out in something fundamentally different from a feedback signal. There might be training against feedback signals somewhere in the architecture, but the core problems of alignment have to be solved via something more than just feedback.

the core problems of alignment have to be solved via something more than just feedback.

No. I strongly disagree, assuming you mean "feedback signals" to include "reward signals." The feedback signal is not the optimization target. The point of the feedback signal is not to be safely maximizable. The point of a feedback signal is to supply cognitive-updates to the network/agent. If the cognitive-updates grow human-aligned cognitive patterns which govern the AI's behavior, we have built an aligned agent

For example, suppose that I penalize the agent whenever I catch it lying. Then credit assignment de-emphasizes certain cognitive patterns which produced those outputs, and—if there are exact gradients to alternative actions—emphasizes or fine-tunes new lines of computation which would have produced the alternative actions in that situation. Concretely, I ask the AI whether it hates dogs, and it says "yes", and then I ask it whether it admitted to hating dogs, and it says "no."

Perhaps the AI had initially lied due to its pretrained initialization predicting that a human would have lied in that context, but then that reasoning gets penalized by credit assignment when I catch the AI lying. The reinforcement tweaks the AI to be less likely to lie in similar situations. Perhaps it learns "If a human would lie, then be honest." Perhaps it learns some totally alien other thing. But importantly, the AI is not necessarily optimizing for high reward—the AI is being reconfigured by the reinforcement signals. 

I think the key question of alignment is: How do we provide reinforcement signals so as to reliably reinforce and grow certain kinds of cognition within an AI? Asking after feedback signals which don't "fail horribly under sufficient optimization pressure" misses this more interesting and relevant question.

Straw person: We haven't found any feedback producer whose outputs are safe to maximise. We strongly suspect there isn't one.

Ramana's gloss of TurnTrout: But AIs don't maximise their feedback. The feedback is just input to the algorithm that shapes the AI's cognition. This cognition may then go on to in effect "have a world model" and "pursue something" in the real world (as viewed through its world model). But its world model might not even contain the feedback producer, in which case it won't be pursuing high feedback. (Also, it might just do something else entirely.)

Less straw person: Yeah I get that. But what kind of cognition do you actually get after shaping it with a lot of feedback? (i.e., optimising/selecting the cognition based on its performance at feedback maximisation) If your optimiser worked, then you get something that pursues positive feedback. Spelling things out, what you get will have a world model that includes the feedback producer, and it will pursue real high feedback, as long as doing so is a possible mind configuration and the optimiser can find it, since that will in fact maximise the optimisation objective.

Possible TurnTrout response: We're obviously not going to be using "argmax" as the optimiser though.

Thanks for running a model of me :) 

If your optimiser worked, then you get something that pursues positive feedback.

Actual TurnTrout response: No

Addendum: I think that this reasoning fails on the single example we have of general intelligence (i.e. human beings). People probably do value "positive feedback" (in terms of reward prediction error or some tight correlate thereof), but people are not generally reward optimizers. 

I think perhaps a lot work is being done by "if your optimiser worked". This might also be where there's a disanaology between humans<->evolution and AIs<->SGD+PPO (or whatever RL algorithm you're using to optimise the policy). Maybe evolution is actually a very weak optimiser, that doesn't really "work", compared to SGD+RL.

I think that evolution is not the relevant optimizer for humans in this situation. Instead consider the within-lifetime learning that goes on in human brains. Humans are very probably reinforcement learning agents in a relevant sense; in some ways, humans are the best reinforcement learning agents we have ever seen. 

I think the way I'd fit that into my ontology is "the reward signal is not the relevant feedback signal (for purposes of this argument)". The relevant feedback signal is whatever some human looks at, at the end of the day, to notice when there's problems or to tell how well the AI is doing by the human's standards. It's how we (human designers/operators) notice the problems on which to iterate. It's whatever the designer is implicitly optimizing for, in the long run, by developing an AI via the particular process the designer is using.

If the human is just using the reward signal as a control interface for steering the AI's internals, then the reward signal is not the feedback signal to which this argument applies.

We discussed more in person. I ended up agreeing with (what I perceive to be) a substantially different claim than I read from your original comment. I agree that we can't just figure out alignment by black-boxing AI cognition and seeing whether the AI does good things or not, nor can we just set up feedback loops on that (e.g. train a succession of agents and tweak the process based on how aligned they seem) without some substantial theoretical underpinnings with which to interpret the evidence.

However, I still don't see how your original comment is a reasonable way to communicate this state of mind. For example, you wrote:

It's easy to come up with a crappy proxy feedback signal - just use human approval or something. And then it will obviously fail horribly under sufficient optimization pressure.

What does this mean, if not using human approval as a reward signal? Can you briefly step me through a fictional scenario where the described failure obtains?


It's easy to come up with a crappy proxy feedback signal - just use human approval or something. And then it will obviously fail horribly under sufficient optimization pressure.

Now I don't understand why this will obviously fail horribly, if your argument doesn't apply to reward signals. How does human approval fail horribly when used in RL training

alignment of strong optimizers simply cannot be done without grounding out in something fundamentally different from a feedback signal.

I don't think this is obvious at all.  Essentially, we have to make sure that humans give feedback that matches their preferences, and that the agent isn't changing the human's preferences to be more easily optimized.

We have the following tools at our disposal:

  1. Recursive reward modelling / Debate. By training agents to help with feedback, improvements in optimization power boosts both the feedback and the process potentially fooling the feedback. It's possible that it's easier to fool humans than it is to help them not be fooled, but it's not obvious this is the case.
  2. Path-specific objectives. By training an explicit model of how humans will be influenced by agent behavior, we can design an agent that optimizes the hypothetical feedback that would have been given, had the agent's behavior not changed the human's preferences (under some assumptions).

This makes me mildly optimistic of using feedback even for relatively powerful optimization.

Minor rant about this is particular:

Essentially, we have to make sure that humans give feedback that matches their preferences...

Humans' stated preferences do not match their preferences-in-hindsight, neither of those matches humans' self-reported happiness/satisfaction in-the-moment, none of that matches humans' revealed preferences, and all of those are time-inconsistent. IIRC the first section of Kahnemann's textbook Well Being: The Foundations of Hedonic Psychology is devoted entirely to the problem of getting feedback from humans on what they actually like, and the tldr is "people have been working on this for decades and all our current proxies have known problems" (not to say they don't have unknown problems too, but they definitely have known problems). Once we get past the basic proxies, we pretty quickly run into fundamental conceptual issues about what we even mean by "human preferences".

The desiderata you mentioned:

  1. Make sure the feedback matches the preferences
  2. Make sure the agent isn't changing the preferences

It seems that RRM/Debate somewhat addresses both of these, and path-specific objectives is mainly aimed at addressing issue 2. I think (part of) John's point is that RRM/Debate don't address issue 1 very well, because we don't have very good or robust processes for judging the various ways we could construct or improve these schemes. Debate relies on a trustworthy/reliable judge at the end of the day, and we might not actually have that.

If the problem is "humans don't give good feedback", then we can't directly train agents to "help" with feedback; there's nothing besides human feedback to give a signal of what's "helping" in the first place. We can choose some proxy for what we think is helpful, but then that's another crappy proxy which will break down under optimization pressure.

It's not just about "fooling" humans, though that alone is a sufficient failure mode. Bear in mind that in order for "helping humans not be fooled" to be viable as a primary alignment strategy it must be the case that it's easier to help humans not be fooled than to fool them in approximately all cases, because otherwise a hostile optimizer will head straight for the cases where humans are fallible. And I claim it is very obvious, from looking at existing real-world races between those trying to deceive and those trying to expose the deception, that there will be plenty of cases where the expose-deception side does not have a winning strategy.

The agent changing "human preferences" is another sufficient failure mode. The strategy of "design an agent that optimizes the hypothetical feedback that would have been given" is indeed a conceptually-valid way to solve that problem, and is notably not a direct feedback signal in the RL sense. At that point, we're doing EU maximization, not reinforcement learning. We're optimizing for expected utility from a fixed model, we're not optimizing a feedback signal from the environment. Of course a bunch of the other problems of human feedback still carry over; "the hypothetical feedback a human would have given" is still a crappy proxy. But it's a step in the right direction.

Sure, humans are sometimes inconsistent, and we don't always know what we want (thanks for the references, that's useful!). But I suspect we're mainly inconsistent in borderline cases, which aren't catastrophic to get wrong. I'm pretty sure humans would reliably state that they don't want to be killed, or that lots of other people die, etc. And that when they have a specific task in mind , they state that they want the task done rather than not. All this subject to that they actually understand the main considerations for whatever plan or outcome is in question, but that is exactly what debate and rrm are for

my objection to this objection is that for the most part, we don't have an option not to pick the best feedback signal we have available at any given time. from a systems perspective, systems alignment only generalizes strongly if it improves capability enough for the relevant system to survive in competition with other systems. this is true at many scales of systems, but it's always for the same reason: competition between systems means that the most adaptive approach wins. a common mistake is to assume "adaptive" means "growth/accumulation/capture rate", but what it really means is "durability per unit efficiency": the instrumental drive to capture resources as fast as possible is fundamentally a decision theory error made by local optimizers.

to consider a specific example of some systems with this decision theory error, a limitation when gene driving mosquitos, for example, is that if the genes you add don't make the modified mosquitos enough more adaptive, they'll just die out; you'd need to perform some sort of trade where you offer the modified mosquitos a modified non-animal food source that only the modified mosquitos can eat, and that somehow can't be separated from the gene drive; you need to offer them a genetic update rule that reliably produces cooperation between species. if you can offer this, then mosquitos which become modified will be durably more competitive, because they have access to food sources that would poison unmodified mosquitos, and they can incrementally no longer threaten humans, so humans would no longer be seeking a way to entirely destroy the species. but it only works if you can get the mosquitos to coordinate en masse, and any mutation that makes that mosquito a defector against mosquito-veganism needs to be stopped in its tracks. the mosquito swarm has to reproduce away the interspecies defection strategy and then not allow it to return, while simultaneously preserving the species.

similarly in most forms of ai safety, there are at least three major labs you need to convince: deepmind, openai, and <whatever is going on over in china>. there are also others that will replicate experiments and some that will perform high quality experiments with somewhat less compute funding. between all of them, you have to come up with a mechanism of alignment that improves capability and which also is convergent about the alignment: if your alignment system doesn't get better alignment-durability/watt as a result of capability improvement, you haven't actually aligned anything, just papered over a problem. to some degree you can hope that one of these labs gets there first; but because capability growth is incremental, it's looking less and less likely that there will be a single watershed moment where a lab pulls so far ahead that no competition can be mounted. and during that window, defense of friendly systems needs to become stronger than inter-system attack.

(by system, again, I mean any organism or meta-organism or neuron or cell or anything inbetween.)

one example goal of something we need an aligned planetary system of beings to do is take control of the ecosystem enough to solve global warming. but in order to do that without screwing everything up, we need a clear picture of what forms of interference with what parts of the universe are acceptable: some clear notion of multi-tenant ownership that allows interfacing the needs of multiple subsystems to determine what their requirements are for their adjacent systems.

I find it notable and interesting that anthropic's recent research about interpretability (SoLU paper) focuses on isolating individual neurons' concept ownership, so that the privileged basis isolates them from interfering with each other. I'm intentionally stretching how far I can generalize this, but I really think this direction of reasoning has something interesting to say about ownership of matter as well. local internal coherence of matter ownership is a core property of a human body that should not be violated; while it's hard to precisely identify whether it's been violated subtly, sudden death is an easy to identify example of a state transition where the local informational process of a human existing has suddenly ceased and the low-entropy complexity was lost. at the same time, anthropic's paper is related to previous work on compressibility; attempting to improve interpretability ultimately boils down to attempting to improve the representation quality until it reaches a coherent, distilled structure that can be understood, as discussed in that paper.

I'd argue that, inherently, improvements to interpretability focused on coherent binding to physical variables have a fundamental connection to the potential to improve the formalizeability of the functions a neural network represents. and that that kind of improvement has the potential to allow binding the optimality of your main loss function more accurately to the functions you intend to optimize in the first place.

So then my question becomes - what competitive rules do we want to apply to all scales (within bacteria, within a neural network, within an insect, within a mammal, within a species, within a planet, between planets), in order to get representations at every scale that coherently describe what dynamics are acceptable interference and what are not?

again, I'm pulling together tenuous threads that I can't quite tie properly, and some of the links might be invalid. I'm a software engineer first, research ideas generator second - and I might be seeing ghosts. but I suspect that somewhere in game theory/game dynamics, there's an insight about how to structure competition in constructed processes that allows describing how to teach the universe to remember everything anyone ever considered beautiful, or something along those lines.

If this thread is of interest, I'd like to discuss it with more people. I've got some links in other posts as well.

I'm interested in this line of reasoning. I can't really say much in response right now, but I just read that paper you linked - they write such clear and easily, heh, interpretable papers don't they? - and I have strong opinions about "the correct value system" being rooted in maximizing some weighted sum of the "autonomy" of all living / agentic / intelligent systems, which it seems like you're gesturing towards as well. I'm interested in trying to figure out how to formalize this.

Nice - thanks for this comment - how would the argument be summarised as a nice heading to go on this list? Maybe "Capabilities can be optimised using feedback but alignment cannot" (and feedback is cheap, and optimisation eventually produces generality)?

Maybe "Humans iteratively designing useful systems and fixing problems provide a robust feedback signal for capabilities, but not for alignment"?

(Also, I now realize that I left this out of the original comment because I assumed it was obvious, but to be explicit: basically any feedback signal on a reasonably-complex/difficult task will select for capabilities. That's just instrumental convergence.)

Reality hits back on the models we train via loss functions based on reality-generated data. But alignment also hits back on models we train, because we also use loss functions (based on preference data). These seem to be symmetrically powerful forces.

Alignment doesn't hit back, the loss function hits back and the loss function doesn't capture what you really want (eg because killing the humans and taking control of a reward button will max reward, deceiving human raters will increase ratings, etc).  If what we wanted was exactly captured in a loss function, alignment would be easier.  Not easy because outer optimization doesn't create good inner alignment, but easier than the present case.

the loss function doesn't capture what you really want

This seems like a type error to me. What does it mean for a reward function to "capture what I really want"? Can anyone give even a handwavy operationalization of such a scenario, so I can try to imagine something concrete?

Sure, one concrete example is the reward function in the tic-tac-toe environment (from X's perspective) that returns -1 when the game is over and O has won, returns +1 when the game is over and X has won, and returns 0 on every other turn (including a game over draw), presuming what I really want is for X to win in as few turns as possible.

I can probably illustrate something outside of such a clean game context too, but I'm curious what your response to this one is first, and to make sure this example is as clear as it needs to be.

Yes, I can imagine that for a simple game like tic-tac-toe. I want an example which is not for a Platonic game, but for the real world. 

What about the real world is important here? The first thing you could try is tic-tac-toe in the real world (i.e., the same scenario as above but don't think of a Platonic game but a real world implementation). Does that still seem fine?

Another aspect of the real world is that we don't necessarily have compact specifications of what we want. Consider the (Platonic) function that assigns to every 96x96 grayscale (8 bits per pixel) image a label from {0, 1, ..., 9, X} and correctly labels unambiguous images of digits (with X for the non-digit or ambiguous images). This function I would claim "captures what I really want" from a digit-classifier (at least for some contexts of use, like where I am going to use it with a camera at that resolution in an OCR task), although I don't know how to implement it. A smaller dataset of images with labels in agreement with that function, and training losses derived from that dataset I would say inherit this property of "capturing what I really want", though imperfectly due to the possibilities of suboptimality and of generalisation failure. 

The first thing you could try is tic-tac-toe in the real world (i.e., the same scenario as above but don't think of a Platonic game but a real world implementation). Does that still seem fine?

Hm, no, not really.

This function I would claim "captures what I really want" from a digit-classifier (at least for some contexts of use, like where I am going to use it with a camera at that resolution in an OCR task)

I mean, there are several true mechanistic facts which get swept under the rug by phrases like "captures what I really want" (no fault to you, as I asked for an explanation of this phrase!):

  • This function provides exact gradients to desired network outputs, thus providing "exactly the gradients we want"
  • This function would not be safe to "optimize for", in that, for sufficiently expressive architectures and a fixed initial condition (e.g. the start of an ML experiment), not all interpolating models are safe,
    • Furthermore, a model which (by IMO unrealistic assumption) searched over plans to minimize the time-average-EV of the number stored in the loss register, would kill everyone and negative-wirehead,
  • For every input image, you can use this function as a classifier to achieve the human-desired behavior.

There are several claims which are not true about this function:

  • The function does not "represent" our desires/goals for good classification over 96x96 grayscale images, in the sense of having the same type signature as those desires,
  • Similarly, the function cannot be "aligned" or "unaligned" with our desires/goals, except insofar as it tends to provide cognitive updates which push agents towards their human-intended purposes (like classifying images).

I messaged you two docs which I've written on the subject recently.

Hm, no, not really.

OK let's start here then. If what I really want is an AI that plays tic-tac-toe (TTT) in the real world well, what exactly is wrong with saying the reward function I described above captures what I really want?


There are several claims which are not true about this function:

Neither of those claims seemed right to me. Can you say what the type signature of our desires (e.g., for good classification over grayscale images) is? [I presume the problem you're getting at isn't as simple as wanting desires to look like (image, digit-label, goodness) tuples as opposed to(image, correct digit-label) tuples.]

what exactly is wrong with saying the reward function I described above captures what I really want?

Well, first of all, that reward function is not outer aligned to TTT, by the following definition:

“My definition says that an objective function r is outer aligned if all models optimal under r in the limit of perfect optimization and unlimited data are aligned.” 

-- Evan Hubinger, commenting on "Inner Alignment Failures" Which Are Actually Outer Alignment Failures

There exist models which just wirehead or set the reward to +1 or show themselves a win observation over and over, satisfying that definition and yet not actually playing TTT in any real sense. Even restricted to training, a deceptive agent can play perfect TTT and then, in deployment, kill everyone. (So the TTT-alignment problem is unsolved! Uh oh! But that's not a problem in reality.)

So, since reward functions don't have the type of "goal", what does it mean to say the real-life reward function "captures" what you want re: TTT, besides the empirical fact that training current models on that reward signal+curriculum will make them play good TTT and nothing else? 

Can you say what the type signature of our desires (e.g., for good classification over grayscale images) is?

I don't know, but it's not that of the loss function! I think "what is the type signature?" isn't relevant to "the type signature is not that of the loss function", which is the point I was making. That said -- maybe some of my values more strongly bid for plans where the AI has certain kinds of classification behavior?

My main point is that this "reward/loss indicates what we want" framing just breaks down if you scrutinize it carefully. Reward/loss just gives cognitive updates. It doesn't have to indicate what we really want, and wishing for such a situation seems incoherent/wrong/misleading as to what we have to solve in alignment.

It's not clear to me why you think the concept of reward functions "breaks down" when applied to more complicated environments. I think maybe you mean to ask for something else.

Isn't that literally the alignment problem? Come up with a loss function that captures what we want an AI to do in the real world, and then it's easy enough to make an AI that does what we want it to do.

Not at all. That's part of what makes it hard. You still have to engineer an AI to maximize that loss function and not some intermediate target, using the actual ML methods that have yet to be pioneered, even if you have such a literal utility function to measure out rewards with. If after your training loop you create some sort of mesa-optimizer that optimizes not-quite-that-loss function, you lose.

Just wanted to say this is the single most useful thing I've read for improving my understanding of alignment difficulty. Thanks for taking the time to write it!

Thanks that's great to hear :)

5. Empirical evidence: human intelligence generalised far without staying aligned with its optimisation target.

I think this one is debatable. It seems to me that human intelligence has remained reasonably well aligned with its optimization target, if its optimization target is defined as "being well-fed, getting social status, remaining healthy, having sex, raising children, etc.", i.e. the things that evolution actually could optimize humans for rather than something like inclusive fitness that it couldn't directly optimize for. Yes there are individual humans who are less interested in pursuing particular pieces on that list (e.g. many prefer not to have children), but that's because the actual thing being optimized is a combination of those variables that's sensitive to local conditions. Any goal drift is then from a changed environment that acts as an input to the optimization target, rather than from an increase in capabilities as such.

The point isn't about goal misalignment but capability generalisation. It is surprising to some degree that just selecting on reproductive fitness through its proxies of being well-fed, social status etc humans have obtained the capability to go to the moon. It points toward a coherent notion & existence of 'general intelligence' as opposed to specific capabilities. 

I think what you say makes sense, but to be clear the argument does not consider those things as the optimisation target but rather considers fitness or reproductive capacity as the optimisation target. (A reasonable counterargument is that the analogy doesn't hold up because fitness-as-optimisation-target isn't a good way to characterise evolution as an optimiser.)

A reasonable counterargument is that the analogy doesn't hold up because fitness-as-optimisation-target isn't a good way to characterise evolution as an optimiser.

Yes, that was my argument in the comment that I linked. :)

Yeah, that's the main counterargument. Evolution is purposeless and doesn't care about anything for specific species or nature itself, and evolution isn't telelogical, so Argument 5 fails.

evolution isn't exactly purposeless; it has very little purpose, but to the degree a purpose could be described, the purpose for which things evolve is to survive in competition. that's more than nothing. the search process is mutation, and the selection process is <anything that survives>. inferring additional constraints that this purpose implies seems potentially fruitful; non-local optimizers like ourselves can look at that objective and design constraints that unilaterally increase durability. our ability to reason over game theory means we're not constrained to only evolutionary game theory; for example, we can make tit-for-tat-with-forgiveness more durable by noticing that it has a tendency to be replaced with cooperation when society is cooperative, and we can reintroduce tit-for-tat-with-forgiveness into contexts where cooperatebot-style reasoning has taken over.

I'm not explaining it correctly or maybe I'm misinterpreting what you're saying or maybe I'm just wrong; the problem is you can get a mesa-optimizer even when you're training on the real objective, if the result of your ML process is itself an optimizer that performs well in the test cases as far as you run it/simulate it, but then in the limit of resources/compute pursues something different.