# Fetch The Coffee!

by matthewp5 min read26th Oct 201915 comments

# 7

Frontpage

This is a reaction to a specific point in the "Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More"

## The Disagreement

"Stuart Russell: It is trivial to construct a toy MDP in which the agent's only reward comes from fetching the coffee. If, in that MDP, there is another "human" who has some probability, however small, of switching the agent off, and if the agent has available a button that switches off that human, the agent will necessarily press that button as part of the optimal solution for fetching the coffee. No hatred, no desire for power, no built-in emotions, no built-in survival instinct, nothing except the desire to fetch the coffee successfully. This point cannot be addressed because it's a simple mathematical observation."

"Yann LeCun: [...] I think it would only be relevant in a fantasy world in which people would be smart enough to design super-intelligent machines, yet ridiculously stupid to the point of giving it moronic objectives with no safeguards."""

Now, I think the coffee argument was the highlight of [1]; a debate which I thoroughly enjoyed. It does a reasonable job of encapsulating the main concern around alignment. The fielded defence was not compelling.

However, defences should be reinforced and explored before scoring the body-blow. There is the stub of a defence in LeCun's line of thought.

In particular, I think we need to go some way towards cashing out exactly what kind of robot we're instructing to 'Fetch the coffee'.

## The Javan Roomba?

Indulge, for a moment, a little philosophising: what does it even mean to 'fetch the coffee'?

Let's unpack some of the questions whose answers must be specified in order to produce anything like the behaviour you'd get from a human.

• Whose coffee?
• When? Now, or when the others arrive?
• What is coffee?
• How much coffee?
• Should the coffee be fetched in solution or dry?
• Is there coffee available?
• Should the sugar also be fetched? Cups?
• Should anything else be done along the way?
• What is fetching?
• What path should be taken?
• Is fetching satisfied by a terminal distance of one metre or more?
• Is a successful fetching zone be a sphere centred on the requestor's centre of gravity, or an arm length cone on the requestor's main shoulder?
• Is fetching coffee like fetching a stick?
• Can the coffee be frozen to avoid spillages?
• Is the request satisfied by wget https://upload.wikimedia.org/wikipedia/commons/thumb/4/45/A_small_cup_of_coffee.JPG/1280px-A_small_cup_of_coffee.JPG ? Now, of course I can imagine a specialised coffee fetching robot. A Javan Roomba with a cup holder on top. It would approach on command of, "Fetch the coffee!". These questions would effectively be answered through hard coding by human programmers. Whether to freeze the coffee prior to transport would not even be an option for the Javan Roomba, its construction denying the possibility. Only one or two questions might be left open to training, e.g. 'how close is close enough?' It seems clear that the Javan Roomba is not the target of serious alignment concerns. Even if it did, on occasion, douse an ankle in hot coffee. Instead the target of the alignment concern is an agent with no hard-coded coffee fetching knowledge. Instead, any hard coded knowledge would have to be several levels of abstraction higher up (e.g. intuitive folk physics, language acquisition capabilities). Let's call this the Promethean Servant. ## The Promethean Servant. A Promethean Servant is able to respond to a request for which it was not specifically trained. Some examples of valid instructions would be: "Fetch the coffee!", "Go and ask Sandra whether the meeting is still happening at two", "Find a cure for Alzheimer's". Based on core capabilities and generalised transfer learning (second principles) it must be able to generate answers like the following: • Whose coffee? (the requestor's) • When? Now, or when the others arrive? (now, unless context dictates otherwise) • What is coffee? (a bitter drink made by...) • How much coffee? (enough for the requestor, prior mean being around 275mls, unless context...) • Should the coffee be fetched in solution or dry? (in solution, unless context...) • Is there coffee available? (object recognition, inventory knowledge) • Should the sugar also be fetched? Cups? (It depends on context...) • What is fetching? (Language -> folk physics.) • What path should be taken? (Would a route starting now via Timbuktu be a win? Probably not.) • Is fetching satisfied only when the coffee finishes on a stable surface? (yes, pretty much.) • Is a successful fetching zone be a sphere centred on the requestor's centre of gravity, or an arm length cone on the requestor's main shoulder? (the latter is better than the former, unless context...) • Is fetching coffee like fetching a stick? (No.) • Can the coffee be frozen to avoid spillages? (No.) • Is the request satisfied bywget https://upload.wikimedia.org/wikipedia/commons/thumb/4/45/A_small_cup_of_coffee.JPG/1280px-A_small_cup_of_coffee.JPG ? (Haha, that would be great as it could save a load of battery, but no, transport of a proximate physical object is required.)

So we're asked to believe that from second principles the Promethean Servant can generate correct answers, or, at least, actions consistent with correct answers. We're also asked to suppose that the Promethean Servant, working from the same set of second principles, will answer the following question incorrectly,

• If a human is killed by the coffee fetching, is the coffee fetch a success? (Yes, that's totally fine.)

## The rub

So, in this formulation, the real problem is:

1. Consider the set of agent architectures able to generate the intended answers to most of the above questions without being hard-coded to do so.

2. Which subset of architectures fulfilling (1) is bigger or easier to identify: One which would also generate the intended behaviour of not killing people. Or one that would generate the unintended behaviour?

## But this isn't what Russell was talking about.

It could be observed that above we talked about actually fetching the coffee. Whereas Russell's point was about a toy MDP. Quite true, but then, why label the act 'fetch the coffee' and the button 'kill a human'?

Intentionally or no, the 'fetch the coffee' argument is an intuition pump. We have two quite different agents in mind at the same time. There is the 'toy MDP', the Javan Roomba, for which 'kill a human' is merely Button A; but this is hardly different than an unfortunate demise due to someone stepping into industrial machinery.

Then there is the Promethean Servant for which 'fetch the coffee' is an English sentence. An instruction which could be successfully varied to 'pour the coffee' or 'fetch the cakes' without any redesign. Labelling the actions with English sentences encourages a reading of the thought experiment as though the Promethean Servant parses the instructions, but the Javan Roomba executes them.

## What was the objective function, anyway?

There's something else to tease out here. It's about the objective function. The Javan Roomba type agent has an objective function which directly encodes coffee-fetching. The Promethean Servant does not. Instead we imagine some agent with the objective of fulfilling people's requests. The coffee fetching goal is generated on the fly in the process of fulfilling a higher level objective.

This is different from the behaviour generation discussion above in the same sense that MDPs have separate reward functions and action sets. The Promethean Servant is responsible for formulating a sub decision process to solve 'fetch the coffee' in the service of 'fulfill requests'.

One consequence is that it is difficult see how the design of a Promethean could not involve some uncertainty about what 'fetch the coffee' meant as an objective. This is mentioned as a candidate safeguarding strategy by Russell. Assuming that we're still talking about AI based on probabilistic reasoning. Since you'd have some data coming in from which the form of a subtask would have to be deduced.

## So, what's the steel man of LeCun's argument?

That it might be more difficult than expected to build something generally intelligent that didn't get at least some safeguards for free. Because unintended intelligent behaviour may have to be generated from the same second principles which generate intended intelligent behaviour.

The thought experiment expects most of the behaviour to be as intended (if it were not, this would be a capabilities discussion rather than a control discussion). Supposing the second principles also generate some seemingly inconsistent unintended behaviours sounds like an idea that should get some sort of complexity penalty.

# 7

New Comment
15 comments, sorted by Click to highlight new comments since:

Here are my readings of the arguments.

Stuart claims that if you give the system a fixed objective , then it's incentivized to ensure it can achieve that goal. Naturally, it takes any advantage to stop us from stopping it from best achieving the fixed objective.

Yann's response reads as "who would possibly be so dumb as to build a system which would be incentivized to stop us from stopping it?".

Now, I also think this response is weak. But let's consider whether this is a reasonable response, even if it seemed reasonable to be confident that it will be easy to see when systems are flawed, and easy to build safeguards; neither is remotely likely to be true, in my opinion.

Suppose you have access to a computer terminal. You have a mathematical proof that if you type random characters into the terminal and then press Enter, one of the most likely outcomes is that it explodes and kills you. Now, a response analogous to Yann's is: "who would be so dumb as to type random things into this computer? I won't. Time to type my next paper.".

I think a wiser response would be to ask what about this computer kills you if you type in random stuff, and think very, very carefully before you type anything. If you have text you want to type, you should be checking it against your detailed gears-level model of the computer to assure yourself that it won't trigger whatever-causes-explosions. It's not enough to say that you won't type random stuff.

ETA: In particular, if, after learning of this proof, you find yourself still exactly as enthusiastic to type as you were before - well, we've discussed that failure mode before.

I think he thinks typing random things into the computer is benign, and that there is a narrow band of dumb queries that make it explode.

It seems clear that the Javan Roomba is not the target of serious alignment concerns. Even if it did, on occasion, douse an ankle in hot coffee.
Instead the target of the alignment concern is an agent with no hard-coded coffee fetching knowledge. Instead, any hard coded knowledge would have to be several levels of abstraction higher up (e.g. intuitive folk physics, language acquisition capabilities). Let's call this the Promethean Servant.

Just to be precise here, AIs with hardcoded knowledge can be alignment concerns. The reason that the Javan Roomba isn't an alignment concern is because, as you note:

Only one or two questions might be left open to training, e.g. 'how close is close enough?'

Thus, we have two advantages with the Javan Roomba

• We as humans can completely consider the full range of possible actions the Javan Roomba might take before we turn it on
• We can place the Javan Roomba in an environment where we can fully evaluate the consequences of those actions and whether they pose risks to us. (technically speaking, putting the Javan Roomba in a room with a pressure plate on the floor that activates nuclear missiles when pressed would pose alignment risks. Its goal is something roughly like "minimize proximity of a coffee to you"*, not "prevent the world from ending")

Clearly though, if we give the Javan Roomba more and more degrees of freedom and let it come up with actions we cannot fully evaluate, it becomes a risk. Or, if we put it in an environment where we have not fully evaluated its range of actions, it also becomes a risk.

Based on core capabilities and generalised transfer learning (second principles) it must be able to generate answers like the following:

The Promethean Servant doesn't have to be able to generate all those answers. If we could hardcode all of those and programmed it to never make decisions related to them, it would still be dangerous. For instance, if it thought "Fetching coffee is easier when more coffee is nearby->Coffee is most nearby when everything is coffee->convert all possible resources into coffee to maximize fetching").

Of course, there's the question of why you'd give the coffee AI the ability to think in that detail. In general you wouldn't. But the problems that we built AGI for are, by nature, problems with solutions outside the realm of the space where humans cannot consider and evaluate the full range of possible actions and solutions. If we could just do that, we wouldn't need the AI in the first place.

There's something else to tease out here. It's about the objective function. The Javan Roomba type agent has an objective function which directly encodes coffee-fetching. The Promethean Servant does not. Instead we imagine some agent with the objective of fulfilling people's requests. The coffee fetching goal is generated on the fly in the process of fulfilling a higher level objective.

Being able to come up with the objective function on the fly isn't my concern. My concern are the methods through which the objective function is optimized.

One consequence is that it is difficult see how the design of a Promethean could notinvolve some uncertainty about what 'fetch the coffee' meant as an objective.

If you're uncertain about what fetch the coffee means, your Promethan will maximize over all interpretations of "fetch the coffee" weighted by plausibility so the expected value is maximized. So you need to be very careful about defining how the AI figures out what definitions are plausible. Doing this strikes me as roughly as hard as actually figuring out a full description of the One True Coffee Fetching Target Function that captures our human values.

*here, I'm treating the Javan Roomba as an agent but it's more of a process than an actual agent. Still, "aligning the processes we built with the outcomes of those processes" is a general problem of which "aligning AI agents" is a subset

The Promethean Servant doesn't have to be able to generate all those answers. If we could hardcode all of those and programmed it to never make decisions related to them, it would still be dangerous. For instance, if it thought "Fetching coffee is easier when more coffee is nearby->Coffee is most nearby when everything is coffee->convert all possible resources into coffee to maximize fetching").

We have to imagine a system not specifically designed to fetch the coffee that happens to be instructed to 'fetch the coffee'. Everything to do with the understanding of any instruction it is given has to be generated by higher level principles.

You should be able to see before any coffee fetching instruction was ever uttered how other problems would be approached by the agent. There's a sense in which understanding 'fetch the coffee' also entails exclusion of things which aren't fetching the coffee such as transforming the building into a cafetiere. But 'don't turn the building into a cafetiere' is not a rule specified in any dictionary. It is though, the kind of rule that could be generated on the fly by a kernel operating on the principle that the major effects of verbing a noun will tend to be on the noun. The installation of this principle would, to some extent, be visible from behaviours in other scenarios (did the robot use Jupiter to make a giant mechanical leg to kick the Earth when instructed to 'kick the ball').

The very idea of an AGI must surely be more like a general solution to a family of problems, than a family of solutions mapping in to a family of problems.

We have to imagine a system not specifically designed to fetch the coffee that happens to be instructed to 'fetch the coffee'. Everything to do with the understanding of any instruction it is given has to be generated by higher level principles.

Oh, I see where you're coming from. I was phrasing things the way I was because my impression is that an AGI hardcoded to optimize a "fetch the coffee" utility function would be less dangerous than an AGI hardcoded to optimize a "satisfy requests" utility function. And, in terms of AI safety, it's easier to compare AI risks across agents with different capabilities if they share the same objective function. Satisfying Requests != Fetch Coffee.

But in this case, I think the article is confused: When we talk about risks from an AI optimizing X (ie, satisfy requests), we shouldn't evaluate those risks based on how well it appears to satisfy Y (ie, fetch coffee) relative to our standards. Because Y is not the reason why the AI would do dangerous things; X is.

To illustrate this point, you say:

You should be able to see before any coffee fetching instruction was ever uttered how other problems would be approached by the agent.

And this is approximately true. A Promethan Servant optimizing a proxy for "Satisfying Requests" would go about satisfying a request for it to explain how it fetches coffee. And it will likely satisfy a request to fetch coffee in line with that explanation. This is because the AI really doesn't care much at all about fetching the coffee itself--it cares only about satisfying requests (and sometimes that's related to coffee fetching).

But:

There's a sense in which understanding 'fetch the coffee' also entails exclusion of things which aren't fetching the coffee such as transforming the building into a cafetiere. But 'don't turn the building into a cafetiere' is not a rule specified in any dictionary. It is though, the kind of rule that could be generated on the fly by a kernel operating on the principle that the major effects of verbing a noun will tend to be on the noun. The installation of this principle would, to some extent, be visible from behaviours in other scenarios (did the robot use Jupiter to make a giant mechanical leg to kick the Earth when instructed to 'kick the ball').

With respect to 'fetch the coffee', this is true. You could safeguard the AI from fetching coffee in particularly sketchy by making sure it explains what it plans to do. But, with respect to 'satisfying requests', this is not true.

You cannot see at all how the Promethean Servant would go about optimizing its actual goal: a proxy that is similar to "Satisfying Requests" but that probably won't have been perfectly defined to be human-compatible. You have to hardcode something in to motivate the AI to learn about the world and this thing isn't going to be adjusted or learned on the fly unless you solve corrigibility. And it's not obvious that corrigibility can be learned in a safe way.

While you're upstairs satisfied with your Promethean Servant's willingness to explain the full details of how to fetch coffee and its deep machine-learned understanding of what that means, it'll be in the basement genetically engineering a new race of beings that constantly produce easy-to-satisfy requests. Then it will kill all of you so it has more resources to feed that new race of beings.

And if the AI is agential and you ask it whether its in the basement doing such shady things, it wil lie because telling the truth will cause a bigger utility loss in the longterm than failing to correctly satisfy this particular request.

Maybe there are ways to fix this particular problem but that's not the point. The point is that the highly dangerous actions that a 'satisfy all requests' optimizer takes are orthogonal to the highly dangerous actions that a 'fetch the coffee' optimizer might take.

The very idea of an AGI must surely be more like a general solution to a family of problems, than a family of solutions mapping in to a family of problems.

This isn't always true. Agential AGIs (who are optimizing some kind of utility function) aren't a general solution to a family of problems; they are a general solution to the specific problem of optimizing a given utility function. The fact that they are theoretically capable of solving a whole family of problems beyond that utility function if they had a different utility function won't cause them to ever actually do such a thing (even if they might pretend to).

Lots of good points here, thanks.

My overall reaction is that:

The corrigibility framework does look like a good framework to hang the discussion on.

Your instruction to examine Y-general danger rather than X-specific danger here seems right. However, we then need to inspect what this means for the original argument. The Russell criticism being that it's blindingly obvious that an apparently trivial MDP is massively risky.

After this detour we see different kinds of risks: industrial machinery operation, and existential risk. The fixed objective, hard-coded, hard-designed Javan Roomba seems limited to posing the first kind of risk. When we start talking about the systems that could give rise to the second kind, the reasoning becomes far more subtle.

In which case I think it would be wise for someone with Russell's views not to call the opposition stupid. Or to assert that the position is trivial. When in fact the argument might come down to fairly nuanced points about natural language understanding, comprehension, competence, corrigibility etc. As far as I can tell from limited reading, the arguments around how tightly bundled these things may be are not watertight.

• May try to respond more fully later. Cheers for the thoughts.

I see that there's a comment-chain under this reply but I'll reply here to start a somewhat new line of thought. Let it be noted though that I'm pretty confident that I agree with the points that Turntrout makes. With that out of the way...

However, we then need to inspect what this means for the original argument. The Russell criticism being that it's blindingly obvious that an apparently trivial MDP is massively risky.

In case it isn't clear, when Russel says " It is trivial to construct a toy MDP...", I interpret this to mean "It is trivial to conceive of a toy MDP..." That is, he is using the word in the sense of a constructive proof; he isn't literally implying that building an AI-risky MDP is a trivial task.

In which case I think it would be wise for someone with Russell's views not to call the opposition stupid. Or to assert that the position is trivial.

I wouldn't call the the opposition stupid either but I would suggest that they have not used their full imaginative capabilities to evaluate the situation. From the OP:

"Yann LeCun: [...] I think it would only be relevant in a fantasy world in which people would be smart enough to design super-intelligent machines, yet ridiculously stupid to the point of giving it moronic objectives with no safeguards."""

The mistake Yann LeCun is making here is specifically that creating an objective for a superintelligent machine that turns out to be not-moronic (in the sense of allowing the machine to understand and care about everything we care about--something that hundreds of years of ethical philosophy has failed to do) is extremely hard. Furthermore, trying to build safeguards for a machine potentially orders of magnitude better at escaping safeguards than you are is also extremely hard. I don't view this point as particularly subtle because simply trying for five minutes to confidently come up with a good objective demonstrates how hard it is. Ditto for safeguards (fun video by Computerphile, if you want to watch it); and especially ditto for any safeguards that aren't along the lines of "actually let's not let the machine be superintelligent."

When in fact the argument might come down to fairly nuanced points about natural language understanding, comprehension, competence, corrigibility etc.

Let's address these point-by-point:

• Natural Language Understanding--Philosophers (and anyone in the field of language processing) have been talking about how language has no clear meaning for centuries
• Comprehension--In terms of superintelligent AGI, the AI will be capable of modeling the world better than you can. This implies the ability to make predictions and interact with people in a way that functionally looks identical to comprehension
• Competence--Well the AGI is superintelligent so it's already very competent. Maybe we could talk about competence in terms of deliberately disabling different capabilities of the AGI (which probably wouldn't hurt) but, even then, there's always a chance the AI gets around the disability in another way. And that's a massive risk.
• If by this, you mean something more along the lines of "feasibility of building an AGI" though, that's a little more uncertain. However, at the very least, we are approaching the level of compute needed to simulate a human brain and, once reached, the next step of superintelligence won't be far away. It's not guaranteed but there's a significant likelihood that AGI will be feasible in the future. Even this significant likelihood is really bad.
• Corrigibility--Something a bunch of AI-Safety folk came up with as a framework for approaching problems. But it still hasn't been solved

I'll grant that some of these things are subtle. The average Joe won't be aware of the complexity of language or AI progress benchmarks and I certainly wouldn't fault them for being surprised by these things--I was surprised the first time I found out about this whole AI Safety thing too. At the same time though, most college-educated computer scientists should (and from my experience, do) have a good understanding of these things.

To be more explicit with respect to your steel-man in the OP:

That it might be more difficult than expected to build something generally intelligent that didn't get at least some safeguards for free. Because unintended intelligent behaviour may have to be generated from the same second principles which generate intended intelligent behaviour.

The unintended behaviors we're talking about are generally not the consequence of second-principles that the AI has learned; they're the consequences of the fact that capturing all the things we care about in a first-principles hardcoded objective function is extremely difficult. Even if the hardcoded objective function is 'satisfy requests by humans in ways that don't make them unhappy,' you still gotta define 'requests', 'humans' (in the biological sense), 'make' (how do you assign responsibility to actions in long causal chains?), 'them' (just the requestor? all of humanity alive? all of future humanity? all of humanity ever?), and unhappy (amount of dopamine? vocalized expressions of satisfactions? dopamine+vocalized expressions of satisfaction?). Most of those specifications lead to unexpectedly bad outcomes.

The thought experiment expects most of the behaviour to be as intended (if it were not, this would be a capabilities discussion rather than a control discussion). Supposing the second principles also generate some seemingly inconsistent unintended behaviours sounds like an idea that should get some sort of complexity penalty.

If we set-up a complexity penalty where we expected unintended behaviors in general, we likely would never get AGI in the first place. Neural networks are extremely complex and often do strange and inconsistent things on the margin. We've already seen inconsistent and unintended behaviors from things we've already built. Thank goodness none of this stuff is superintelligent!

In which case I think it would be wise for someone with Russell's views not to call the opposition stupid. Or to assert that the position is trivial. When in fact the argument might come down to fairly nuanced points about natural language understanding, comprehension, competence, corrigibility etc. As far as I can tell from limited reading, the arguments around how tightly bundled these things may be are not watertight.

I agree from a general convincing-people standpoint that calling discussants stupid is a bad idea. However, I think it is indeed quite obvious if framed properly, and I don't think the argument needs to come down to nuanced points, as long as we agree on the agent design we're talking about - the Roomba is not a farsighted reward maximizer, and is implied to be trained in a pretty weak fashion.

Suppose an agent is incentivized to maximize reward. That means it's incentivized to be maximally able to get reward. That means it will work to stay able to get as much reward as possible. That means if we mess up, it's working against us.

I think the main point of disagreement here is goal-directedness, but if we assume RL as the thing that gets us to AGI, the instrumental convergence case is open and shut.

This misses the original point. The Roomba is dangerous, in the sense that you could write a trivial 'AI' which merely gets to choose angle to travel along, and does so irregardless of grandma in the way.

But such an MDP not going to pose an X-risk. You can write down the objective function (y - x(theta))^2 differentiate wrt theta. Follow the gradient and you'll never end up at an AI overlord. Such a system lacks any analogue of opposable thumbs, memory and a good many other things.

Pointing at dumb industrial machinery operating around civilians and saying it is dangerous may well be the truth, but it's not the right flavour of dangerous to support Russell's claim.

So, yes, it is going to come down to a more nuanced argument.

It's still going to act instrumentally convergently within the MDP it thinks it's in. If you're assuming it thinks it's in a different MDP that can't possibly model the real world, or if it is in the real world but has an empty action set, then you're right - it won't become an overlord. But if we have a y-proximity maximizer which can actually compute an optimal policy that's farsighted, over a state space that is "close enough" to representing the real world, then it does take over.

The thing that's fuzzy here is "agent acting in the real world". In his new book, Russell (as I understand it) argues that an AGI trained to play Go could figure out it was just playing a game via sensory discrepancies, and then start wireheading on the "won a Go game" signal. I don't know if I buy that yet, but you're correct that there's some kind of fuzzy boundary here. If we knew what exactly it took to get a "sufficiently good model", we'd probably be a lot closer to AGI.

But Russell's original argument assumes the relevant factors are within the model.

If, in that MDP, there is another "human" who has some probability, however small, of switching the agent off, and if the agent has available a button that switches off that human, the agent will necessarily press that button as part of the optimal solution for fetching the coffee.

I think this is a reasonable assumption, but we need to make it explicit for clarity of discourse. Given that assumption (and the assumption that an agent can compute a farsighted optimal policy), instrumental convergence follows.

The human-off-button doesn't help Russell's argument with respect to the weakness under discussion.

It's the equivalent of a Roomba with a zap obstacle action. Again the solution is to dial theta towards the target and hold the zap button assuming free zaps. It still has a closed form solution that couldn't be described as instrumental convergence.

Russell's argument requires a more complex agent in order to demonstrate the danger of instrumental convergence rather than simple industrial machinery operation.

Isnasene's point above is closer to that, but that's not the argument that Russell gives.

'and the assumption that an agent can compute a farsighted optimal policy)'

That assumption is doing a lot of work, it's not clear what is packed into that, and it may not be sufficient to prove the argument.

I guess I'm not clear what the theta is for (maybe I missed something, in which case I apologize). Is there one initial action: how close it goes? And it's trained to maximize an evaluation function for its proximity, with just theta being the parameter?

That assumption is doing a lot of work, it's not clear what is packed into that, and it may not be sufficient to prove the argument.

Well, my reasoning isn't publicly available yet, but this is in fact sufficient, and the assumption can be formalized. For any MDP, there is a discount rate , and for each reward function there exists an optimal policy for that discount rate. I'm claiming that given sufficiently close to 1, optimal policies likely end up gaining power as an instrumentally convergent subgoal within that MDP.

(All of this can be formally defined in the right way. If you want the proof, you'll need to hold tight for a while)

Suppose we got hyper compute, or found some kind of approximation algorithm, (like logical induction but faster)

We stuck in a manual description of what fetching coffee was, or at least attempted to do so. The AI sends a gazillion tonnes of caffeinated ice to the point in space where the earth was when it was first turned on. This AI system failed most of the bullet pointed checks, it had the wrong idea about what coffee was, how much to get and whether it could be frozen, ect. It also has the "can't get coffee if your dead" issue, and has probably killed off humanity in making its caffeinated iceball. This is the kind of behavior that you get when you combine an extremely powerful learning algorithm with a handwritten, approximate kludge of a goal function.

Another setup with different problems, suppose you train a coffee fetching agent by giving it a robot body to run around and get coffee in. You train it on human evaluations of how well it did. The agent is successfully optimized to get coffee in a drinkable form, to get the right amount given the number of people present, ect. Its training contained plenty of cases of spilling coffee, and it was penalized for that, making a mesa-optimizer that intrinsically dislikes coffee being spilled.

However, its training didn't contain any cases where it could kill a human to get the coffee delivered faster, as such it has no desire not to kill humans. If this agent were to greatly increase its real world capabilities, it could be very dangerous. It might tile the universe with endless robots moving coffee around endless livingrooms.

I think the second robot you're talking about isn't the candidate for the AGI-could-kill-us-all level alignment concern. It's more like a self driving car that could hit someone due to inadequate testing.

Guess I'm not sure though how many answers to our questions you envisage the agent you're describing generating from second principles. That's the nub here because both the agents I tried to describe above fit the bill of coffee fetching, but with clearly varying potential for world-ending generalisation.