# Ω 33

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

My AGI safety from first principles report (which is now online here) was originally circulated as a google doc. Since there was a lot of good discussion in comments on the original document, I thought it would be worthwhile putting some of it online, and have copied out most of the substantive comment threads here. Many thanks to all of the contributors for their insightful points, and to Habryka for helping with formatting. Note that in some cases comments may refer to parts of the report that didn't make it into the public version.

# Discussion on the whole report

Thanks so much for writing this! Huge +1 to more foundational work in this area.

My overall biggest worry with your argument is just whether it's spending a lot of time defending something that's not really where the controversy lies. (This is true for me; I don't know if I'm idiosyncratic.) Distinguish two claims one could argue for:

Claim 1: At some point in the future, assuming continued tech progress, history will have primarily become the story of AI systems doing things. The goals of those AI systems, or the emergent path that results from interactions among these systems, will probably not be what you reading this document want to happen.

I find claim 1 pretty uncontroversial. And I do think that this alone is enough for far more of the world to be thinking about AI than currently is.

But it feels like at least for longtermist EAs trying to prioritise among causes (or for non-longtermists deciding how much to prioritise safety vs speed on AI), the action is much more on a more substantial claim like:

Claim 2: Claim 1 is true, and the point in time at which the transition from a human-driven world to an AI-driven world is in our lifetime, and the transition will be fast, and we can meaningfully affect how this transition goes with very long-lasting impacts, and (on the classic formulations at least) the transition will be to a single AI agent with more power than all other agents combined, and what we should try to do in response to all this is ensure that the AI systems that get built have goals that are the same as the goals of those who design the AI systems.

Each of the new sub-claims in claim 2, I find (highly) controversial. And you talk a little bit about some of these sub-claims, but it's not the focus.

Interested if you think that's an unfair characterisation. Perhaps you see yourself as arguing for something in between Claim 1 and Claim 2.

Richard Ngo

I think it's fair to say that I'm defending claim 1. I think that a lot of people would disagree with it, because:

a) They don't picture AI systems having goals in a way that's easily separable from the goals of the humans who use them; or

b) They think that humans will retain enough power over AIs that the "main story" will be what humans choose to do, even if some AIs have goals we don't like; or

c) They think that it'll be easy to make AIs have the goals we want them to have; or

d) They think that, even if the outcome is not specifically what they want, it'll be within some range of acceptable variation (in a similar way to how our current society is related to our great-great-grandparents').

My thoughts on the remaining parts of claim 2:

a) "The point in time at which the transition from a human-driven world to an AI-driven world is in our lifetime"

OpenPhil are investigating timelines very thoroughly, so I'm happy to defer to them.

b) The transition will be fast.

I make some arguments about this in the "speed of AI development" section. But broadly speaking, I don't want this version of the argument to depend on the claim that it'll be very fast (i.e. there's a "takeoff" from something like our current world lasting less than a month), and I expect that a takeoff happening over less than a decade is fairly plausible, and then I don't think my arguments will depend very sensitively on whether it's closer to a month or a decade. (Where by default I'm thinking of takeoffs using Paul's definition in terms of economic doubling times, although I'm not fully spelling that out here).

c) We can meaningfully affect how this transition goes with very long-lasting impacts

I don't want to get into specifics of what safety techniques we might use. However, I think you're probably right that I should provide some arguments for why we might think that we can make a difference to this transition in the long term. This argument would look something like: changing the goals of the first AGis is long-term influential; and also changing the goals of the first AGIs is viable.

e) What we should try to do in response to all this is ensure that the AI systems that get built have goals that are the same as the goals of those who design the AI systems.

I'm mostly sticking to the concern with agents being misaligned with humanity in a very basic sense. I think that proposals to build goals into our AGIs that don't boil down to obedience to some set of humans are not very viable; if and when such proposals emerge, I'll try address this more explicitly.

OK, fair re (1): I do think this claim is enormously important and underappreciated, so the fact that I'm already convinced of it doesn't mean much!

I don't want to get into specifics of what safety techniques we might use. However, I think you're probably right that I should provide some arguments for why we might think that we can make a difference to this transition in the long term. This argument would look something like: changing the goals of the first AGIs is long-term influential; and also changing the goals of the first AGIs is viable.

Yeah - this is a case where how exactly the transition goes seems to make a very big difference. If it's a fast transition to a singleton, altering the goals of the initial AI is going to be super influential. But if it's that there are many generations of AIs that over time become the larger majority of the economy, then just control everything - predictably altering how that goes seems a lot harder at least.

# Discussion on mesa-optimisers

Richard Ngo: AI systems which pursue goals are also known as mesa-optimisers.

Ben Garfinkel

If you use a pretty inclusive conception of goals -- which I think is the right approach -- then might be worth noting that "mesa-optimizers" become a more inclusive and mundane category than the paper seems to have in mind.

+1 I'm still confused by what it means for something to actually be a mesa-optimizer (it seems to only make sense in the transfer learning context, as pursuing goals it was not directly optimized for).

Evan Hubinger

A mesa-optimizer is defined in the paper as any learned model which is internally running some search/optimization process. That optimization process could be for the purpose of achieving the goal it was trained for (in which case it's inner aligned) or a different goal than it was trained for (in which case it's not inner aligned).

I'm a little uncomfortable with that definition since it depends on implementation details rather than behaviour. Given some system A that is running a search/optimisation process internally, I can construct a (possibly truly gigantic) look-up table B that has exactly the same output as A. Under this definition, A has a mesa-optimizer and B doesn't. But they'll behave identically: including potentially some treacherous turns.

To my taste the sharper distinction is whether the policy transfers to pursuing the (outer) goal on an unseen test environment. Inner search processes with a different inner goal will fail to transfer, but there exist other failure modes too.

Where I can see this notion of an inner agent being really useful is in an interpretability setting, where we're actually trying to understand implementation details of an agent.

Richard Ngo

I think that arguments from lookup tables usually give rise to faulty intuitions, because they're such (literally unimaginably) extreme cases.

And the problem with behaviour-based definitions is that you need to factor out the causes anyway. E.g. suppose the policy doesn't transfer. Is that just because it's not smart enough? Or because the test environment was badly-chosen? Or maybe it does transfer, but only because it's smart enough to realise that doing so is the only way to survive. To distinguish between these, we need to make hypotheses about what sort of cognition is occurring - in which case we may as well use that to define being a mesa-optimiser.

Lookup tables are certainly an extreme example but I think it's pointing at a real problem. Optimisation or search processes are a very narrow type of implementation. You can also imagine having learned a lot of heuristics, that give rise to similar behaviour. Perhaps you and Evan have a more encompassing definition of optimisation than I have in mind, but if it's too expansive then the definition becomes vacuous. I also have concerns that we don't have a good definition of whether something is/isn't an optimiser, let alone how to tell that from the inner workings of a complex system: I don't know how I'd infer that humans are optimising, other than by looking at our behaviour.

I don't see why you necessarily need to factor out the causes. What we care about is the kind of transfer failure, notably how much impact it has on the environment. If you wanted to think of things in terms of agents you could do an IRL-like procedure on the transfer behaviour and see if it has a coherent goal that's different to the outer goal.

Ben Garfinkel

I'm also a bit suspicious there will turn out to be a very principled way to draw the line between "optimizer"/"non-optimizer" that matches the categorizations the paper seems to make. I don't intuitively see what key property would make it true that biological evolution is "optimizing" for genetic fitness, for example, but not true that AlphaStar is "optimizing" for success in Starcraft games.

One closeby distinction that seems like it might be relevant is between AI systems whose policies are updated through both learning and planning algorithms (e.g. AlphaGo) and ones whose policies are just updated through learning algorithms (e.g. model-free RL agents). We could also focus specifically on systems that use planning algorithms that were at least partly learned. But I'm not sure if that exactly captures the relevant intuitions about what should or should not be counted as a "mesa-optimizer." Also not clear if we have reason to be very strongly focused on agents in this class, since it's at least not obvious to me that they'll tend to exhibit way worse transfer failures.

Max Daniel

FWIW, my take roughly is:

• I agree with Adam that we currently don’t have a fully fleshed out concept of optimizer, and that we don't know how to reliably identify one given low-level information about how a system works. (Or at least I don't know how to do this -- possible that I'm just missing something here since I haven't engaged super deeply with the discussion around mesa optimization.)
• However, I think unlike Adam, I'm both somewhat optimistic that we can improve our conceptual understanding and operationalization, and also that it's important and useful to do so. By contrast, I'm pessimistic about purely "behavioral" approaches, partly because I think they have a poor track record in psychology, the philosophy of mind etc.

So for me the upshot from feeling I don't have a super great handle on what an optimizer is (and by extension what a mesa-optimizer is), is "let's understand this better" rather than "use purely behavioral concepts instead".

Jaan Tallinn

re lookup tables: you still need to run the computation (in simulation and up to isomorphism) to get them, so you're not really avoiding implementation details, just pushing them upstream. (eg, think of a SHA256 lookup table).

That said, "internalised goals" is indeed a broader concept than mesa optimisers: that's easily seen in evan's example of maze-navigating agents: ie, think of 2 agents whose goals are hooked up to conceptually different but accidentally correlated aspects of the training environment (maze exits and exit signs of particular design) -- neither has to be doing any mesa optimisation but, after training, one ends up caring about exits and the other about exit signs (with humans arguably being the equivalent of the latter kind of agent).

# Discussion on utility maximisation

Richard Ngo: Utility functions are such a broad formalism that practically any behaviour can be described as maximising some utility function.

Buck Shlegeris

I think that "utility functions don't constraint expectations" is pretty contentious (eg I disagree with it), e.g. I basically expect the Omohundro drives argument to go through. This claim feels way more nonstandard than anything else I've seen you write here so far, which is fine if that's what you want but felt jarring to me

Richard Ngo

I think the omohundro argument goes through if an AGI has a certain type of long-term large-scale goal. I don't think talking about utility functions is helpful in describing that goal (or whether or not it'll have it).

On a more meta note, I feel like this claim is accepted by a reasonable number of safety researchers (e.g. Rohin, Eric, Vlad, myself). And Rohin, Eric and I have all publicly written about why. But I haven't seen any compelling counterarguments in the last couple of years. I'm not sure how else to move this claim towards the "standard" category, but we should definitely discuss it when I next swing by MIRI.

Buck Shlegeris

I feel quite unconvinced by it and look forward to talking. I think my counterargument is kind of similar to what Wei Dai says here: https://www.alignmentforum.org/posts/vphFJzK3mWA4PJKAg/coherent-behaviour-in-the-real-world-is-an-incoherent#F2YB5aJgDdK9ZGspw

Rohin Shah

Tbc, I broadly agree with Wei Dai's comment, but note that it depends pretty crucially on "what an optimization process is likely to care about", which is the "prior over goals" that Richard talks about in subsequent sentences

Ben Garfinkel

I don't think the Omohundro drive argument provides any constraints, since (as Richard/Rohin/etc. point out) effectively all policies are optimal with regard to some utility function.

So if we want to interpret the Omohundro argument as saying that something that's maximizing a utility function must behave like it's trying to do stuff like unboundedly accumulate resources, then the argument is wrong. (Because it's of course possible to behave in other ways.)

It seems natural, then, to interpret the Omohundro argument as the claim that, for a very large portion of utility functions over sequences of states of the world, an agent that's optimally maximizing that utility function will tend to do stuff like unboundedly accumulate resources. But this isn't really a "constraint" per se, in the same way that the observation that a very large portion of possible car designs lack functional steering wheels doesn't really constrain car design.

Buck Shlegeris

Matthew Graves

I suspect that everyone in this thread gets this, but wrote out my sense of this and figured it was better to post it than delete it.

I think there are two different points here.

One is the point that utility functions (as a data-type) don't impose any logical constraints on what goals can exist or what trajectories are optimal. (This seems right to me; exhibit any behavior, we can find a utility function that is maximized by performing that behavior.)

There's a distributional point that still has teeth; if we have a simplicity prior over goals stored as utility functions, and then we push that through expected utility maximization, we'll get a different distribution of behavior than the distribution we would get if we have a simplicity prior over goals stored as hierarchical control system architectures and top-level references, and then push that through normal dynamics. Homeostatic behavior seems easier to express in the latter case than the former case, for example.

The other is the point that for 'natural goals', you end up with similar subgoals turning out to help for almost all of those natural goals. This relies more on something like 'evolutionary' or 'economic' intuitions for what that distribution looks like, or for what the behavioral relevance of goals looks like.

This is more like 'aerodynamic constraints,' which don't limit what sort of planes you can build but do limit what sort of planes stay in the sky. It's not "if you randomly pick a goal, it will do X", but "if you make something that functions, these drives will by default make it more functional," with a distributional assumption over goals implied by functional present in 'by default.'

Richard Ngo

I found it helpful that you wrote this out explicitly.

If you make something that functions, these drives will by default make it more functional

Another way of making this point might be: throughout the training period of an agent, there's selection pressure towards it thinking in a more agentic way. (Although this is a pre-theoretic concept of agentic, I think, rather than any version of "agentic" we can derive from VNM).

Do I agree with this? I think it depends a lot on how we build AGI. If we build it in a simulated virtual world with multi-agent competition, then that's almost definitely true.

If we train it specifically on information-processing tasks, though - like, first learn language, then learn maths, then learn how to process other types of information - then I think the selection pressures towards being agentic can be very weak.

An intermediate scenario would be: we train it to be generally intelligent and execute plans without the need to outsmart other agents. E.g. we put a single agent in a simulated environment, and reward it for building interesting things. I could see this one going either way.

Maybe an important underlying intuition I have here is that making something agentic is hard, plausibly not too far off the difficulty of making it intelligent in the first place. And so it's not sufficient for there to be some gradient in that direction, it's got to be a pretty significant gradient.

Jaan Tallinn

i have to say i still don't fully buy rohin's and richard's arguments against VNM/omohundro. i mean, i understand the abstract idea of "you can always curve-fit a utlity function to any given behaviour", but if i try to make this concrete and imagine two computations:

a) bitcoin mining of input->add_nonce->SHA256->compare_output_against_threshold->loop_if_too_high->output and

b) simple hasher of input->SHA256->loop_for_10_steps->output,

then the former is very obviously an optimiser with a crisp, meaningful, and irreducible utility function, whereas the latter clearly isn't – even though one can refactor it to a more complex piece of code with a silly utility function that expresses the preference to terminate after 10 iterations.

i also note that this comment was inspired by max's comment below re the arguments against the meaningfulness of utility functions being isomorphic to the arguments against reductionism. thanks, max!

(just thinking out loud: the silly utility function in (b) actually differs meaningfully from the one in (a), because it would not be sensitive to the program input -- perhaps that's a general differentiator between "proper utility functions" and "curve-fitted utility functions"?)

Rohin Shah

I certainly agree there's a clear qualitative difference between (a) and (b), but why expect that AGIs are more like (a) than (b)?

My core claim is that the VNM theorem cannot tell you that AGIs are more like (a) than (b), though I do in practice think AGIs will be more like (a) for other reasons.

Jaan Tallinn

yeah, sure, that i agree with -- and it's a valuable question how big is the "b-class" among AGI-s. FWIW, GPT3 feels like it is part of (b).

i guess i'm confused on the meta level: AFAICT, people seem to be debating (at least) 2 very different questions: 1) do utility functions constrain behaviour of the agents that have them (see the earlier comments in this thread -- perhaps i'm misinterpreting them though), and 2) does VNM mean that agents with utility functions are a likely outcome (your comment).

my own current answers to those questions are: 1) yes, 2) probably not, but i'm not sure.

Richard Ngo

I basically just want to get rid of this "utility function" terminology, because it's almost always used ambiguously and by this point has very little technical content in AI safety discussions. On one extreme there's an agent with no coherent goals, whose behaviour we can curve-fit a utility function to. On the other extreme, there's an agent with an explicit utility function represented within it, which they use to evaluate the value of all their actions, like Deep Blue or your hypothetical bitcoin miner. Both of these are in technical terms, maximising some utility function. But when you say "agents with utility functions", you clearly don't mean the former.

So what do people mean when they say that? I suspect they are often implicitly thinking of agents with explicit utility functions. But I think AGIs having explicit utility functions is pretty unlikely - for example, humans don't have explicit utility functions, we're just big neural networks. So we need to talk about agents with non-explicit utility functions too - which includes curve-fitted utility functions. To rule out curve-fitted utility functions, people start using vague concepts like "aggregative utility functions" or "state-based utility functions" or things along these lines - concepts which have never been properly explicated or defended, but which seem to have technical content because they include the phrase "utility function".

At this point I think it's easier to just discard the terminology altogether. For some agents, it's reasonable to describe them as having goals. For others, it isn't. Some of those goals are dangerous. Some aren't. We need to figure out what the space of possible goals looks like, and how they work, and which ones our AGIs will have.

(In case you haven't seen my more extensive writeup on the ambiguity of utility functions: https://www.alignmentforum.org/posts/vphFJzK3mWA4PJKAg/coherent-behaviour-in-the-real-world-is-an-incoherent)

Rohin Shah

+1 to Richard's comment – I'd be supportive of discussions about (a) style agents (i.e. agents with utility functions) if we had arguments that we are going to build (a) style agents, but afaict there are no such arguments -- the ones I've seen rely on VNM, which doesn't work (e.g. https://www.lesswrong.com/posts/RQpNHSiWaXTvDxt6R/coherent-decisions-imply-consistent-utilities or https://intelligence.org/2016/12/28/ai-alignment-why-its-hard-and-where-to-start/ )

Jaan Tallinn

indeed, seems like a very important crux to get more clarity about. myself i'm highly uncertain at this point, given that historically we have examples of both types even among the cutting edge: eg, alphazero is type (a) yet GPT, AFAICT, is type (b).

Richard Ngo: Moravec’s paradox predicts that building AIs which do complex intellectual work like scientific research might actually be easier than building AIs which share more deeply ingrained features of human cognition like goals and desires.

Buck Shlegeris

I find Moravec's paradox extremely uncompelling; it seems to me like a fact about the set of tasks you cherry pick. (Eg, my work involves tasks like writing correct code, writing fast code, and proving theorems, all of which are hard tasks for both humans and computers. I guess these tasks are all kind of the same task, so maybe that makes the example weaker?) I wouldn't have thought that Moravec's paradox is highly regarded enough that you should call it an argument for something. But it's very possible I'm just totally wrong and there's lots of positive sentiment towards it

Richard Ngo

Hmmm, put it this way: people used to think that chess was AI-complete. And StarCraft and grammatically-correct language modelling required fewer breakthroughs that I think most researchers expected - our AIs learned to do those tasks very well without having the solidly-grounded concepts humans have. I would still be surprised if we got a language model smart enough to write good code and do good maths before we got agentic AGI, but much less surprised than I would have been before the cluster of observations comprising Moravec's paradox.

Rohin Shah

I share Buck's sentiment; I feel like Moravec's paradox is mostly a statement that human-designed logical / GOFAI systems can't do "simple" perceptual tasks / alternatively it's a statement that things that feel "simple" to humans are actually using vast amounts of unconscious computation.

If you restrict to neural nets, it doesn't seem to me that Moravec's paradox holds -- it's far easier to get a HalfCheetah to run than to play Go well.

I'm sympathetic to Moravec's paradox. I view it partially as an observation about evolution: things that creatures have been under selective pressure for 100s of millions of years (visual perception, motor control, to some extent theory of mind) are likely to be at a much higher level than relatively recent advances like most "higher" human cognition.

I think it still holds for neural networks (albeit a bit weaker than for GOFAI). HalfCheetah seems like an uncompelling example: a linear policy works for it quite well, it's way simpler than e.g. a bee flying. I'd think: is it easier to control a dog robot to hunt an animal, or to play Go?

# Discussion on human goals

Richard Ngo: The goals of AGIs might be misaligned with ours.

I don't think there's a reasonable notion of 'our' goals, where 'our' is meant to stand in for 'humanity'.

The sort of world that we'd see if Julia Wise was Eternal World Emperor is very different than the world we'd see if Mao was in that position, which is very different again if history is (primarily or at least very significantly) determined by the logic of cultural evolution and economic-military competition, as it is now.

Richard Ngo

I agree it's tricky to come up with a positive conception of what alignment with our goals means. I think it's much more reasonable to make the claim that AI goals won't be aligned with human goals. A minimal criterion would be if AIs have goals that a vast majority of humans wouldn't want to be fulfilled (either given their current preferences, or else given their fully informed/fully rational preferences, whatever that means). I'll add something like this to the text.

This does leave open some comparative questions - e.g. you might claim that our current governmental systems are also misaligned with human goals, by this standard. Or you might claim that no set of goals is aligned by this standard. But I think that for the purposes of making the simplest compelling version of the second species argument, the version above is sufficient.

(It also doesn't address the distinction between the goals of individual AIs, versus the goals that we should ascribe to a group of AIs interacting with us and each other. I need to think more about this).

Jaan Tallinn

plugging ARCHES’ approach (prepotent AI) here again as an elegant way to circumnavigate the philosophical tar pit.

Again, I really think that 'human goals' has no referent. Most people's goals are wildly different from each others', because they primarily are motivated to benefit themselves and their nearest and dearest. This goes badly wrong when a single individual can have much more power than others (i.e. dictatorships). The solution is balance of power - then you get messy trades that end up with alright outcomes.

And what we should care about is getting the right goals, rather than 'human goals'. For almost all previous generations, and for many cultures today, if they were to have the equivalent discussion of 'how do we ensure that ultra-powerful machines do what we want them to, so that we can control our future' is pretty terrifying to me. I don't think we should think that we've done much better than them at morality. So I don't think we should be trying to ensure that 'we' control 'our' future, rather than ensuring that society is structured to maximise the chance of converging on the right moral views.

Jaan Tallinn

i disagree that most people's goals are wildly different from each other in the grand scheme of things. i claim that the differences are exaggerated because of 1) tribal signalling instinct, 2) only considering a parochial corner of the goal space, 3) evaluating futures asymmetrically from the first person perspective.

eg, i would bet that almost all people on this planet would agree that dumping the atmosphere (or, to borrow an example from stuart russell, to color the sky orange) - things that are well within sufficiently capable AI's goal space - is bad.

that said, i do think there's value in trying to generalise from current human values -- eg, to see if there's a principled way to continue our moral progress and/or become part of a larger reference class in possible acausal coalitions -- but this feels like a detail when compared to the acute problem of trying to avoid extinction.

# Discussion on link between intelligence and goals

Richard Ngo: We can imagine a highly intelligent system that understands the world very well but has no desire to change it.

Ben Garfinkel

Simple example of these two things coming apart: Our tendency to try to predict what the world will be like in ten years tends to be strongly coupled to (and arguably arises from) the fact we have preferences about the world in ten years. But it's totally possible to create at least simple AI systems that make predictions about the future -- e.g. just something that extrapolates solar panel price trends forward -- without being well thought of as having any preferences about the future.

We might expect more advanced AI systems, or AI systems that are trained in different ways, to exhibit a stronger link between prediction and future preference. But at least they currently mostly come apart.

A key question here is whether you think advanced AI systems can be trained without interacting with the world: i.e. does it need to be able to conduct mini-experiments to learn a good predictive model?

Richard Ngo

I agree that this is an important question, but I don't think that "interacting with the world" is a particularly important threshold. That is, if we consider a spectrum between an oracle AI that only makes predictions and a highly agentic AI that tries to impose its will on the world, an AI that conducts mini-experiments to make better predictions feels much closer to the former than the latter.

Daniel Kokotajlo

I think that feeling is deceptive, possibly the result of intuitions imported from humans (Silly nerdy scientists, stuck in their labs, not interested in politics or adventure, just trying to understand bugs!).

If you use natural selection to evolve an AI that is a really good predictor (and it has actuators, so it can become an agent if that is useful) then I predict what you'll get is... well, at first you'll get AIs that walk themselves into a pitch-black hole so that they can get perfect predictive accuracy for a long time. But eventually, after enough selection pressure has been applied, you will get... NOT nerdy scientist types who stay locked in their labs, but rather something more like a paperclip-maximizer that is very agentic, learns a lot about the world while also building up resources and improving its abilities, and then takes over the world and makes it very simple and easy to predict and also secure so that it can keep predicting forever.

(And if you make the episode length too short for that, but long enough that the "go sit in a hole" strategy isn't optimal either, you'll get something in between, I think. A rather scientific, data-hungry AI, but one that still cares a lot about acquiring resources and manipulating the world to acquire more data and abilities.)

Richard Ngo

I think this reasoning is too high-level. It's not sufficient to argue that taking over the world will improve prediction accuracy. You also need to argue that during the training process (in which taking over the world wasn't possible), the agent acquired a set of motivations and skills which will later lead it to take over the world. And I think that depends a lot on the training process.

For example, you could train an agent which, although it has actuators, has very plentiful data (e.g. read-only access to the internet) and no predators, and so literally never learns to use those actuators for anything except scrolling through the internet. So it might just never develop goal-based cognitive structures (beyond the very rudimentary ones required to search web pages).

And secondly, if during training the agent is asked questions about the internet, but has no ability to edit the internet, then maybe it will have the goal of "predicting the world", but maybe it will have the goal of "understanding the world". The former incentivises control, the latter doesn't. How do we know which one it will develop? Again, details of the training process.

Daniel Kokotajlo

OK, fair enough--there could be architectures that interact with the world yet aren't agenty, (and for that matter architectures that don't interact with the world yet are agenty) so your original point that the interact-with-the-world distinction isn't super important still stands.

But I think the distinction will probably be at least somewhat important. If we imagine training an AI system first on a static corpus of internet text, then on read-only access to the web, then on read-and-write access to the web, then on that plus having a bunch of robot limbs and cameras to control... it does seem like as we move down this spectrum the probability of the system developing goal-oriented behavior increases. (Subject to other constraints of course, like what the system is being rewarded for.)

I think the two specific examples you give -- plentiful data makes significant consequentialist planning useless, and understanding vs. predicting -- don't seem super compelling to me. What is understanding, anyway? And what reason do we have to think that consequentialist planning would become useless (or nearly useless) with sufficient data? Might it not still be useful for e.g. conserving computational resources to arrive at the answer most efficiently? Or not "scaring away" bits of the world that are watching you and reacting to what you do? (maybe there are no such bits of the world? But in the real world there are, so maybe an oracle trained in this manner just wouldn't generalize well to contexts in which what it does has a big impact on what the world is like, since ex hypothesi such situations never arose in training.)

# Discussion on generalisation

Richard Ngo: AGIs won’t be directly selected to have very large-scale or long-term goals. Yet it’s likely that the goals they learn in their training environments will generalise to larger scales.

Ben Garfinkel

I don't think I find it natural to think of this as an example of generalization.

Richard Ngo

I think it's a pretty important example. E.g. consider being trained on the goal of becoming the most important person in your tribe. Then you learn that there are actually a million different tribes. You could either end up with a non-generalised version of the goal (I still want to be the most important person in my tribe) or else a generalised version of it (I want to be the most important person I know).

Similarly, if you originally care about "the future", and then you learn that the future is a lot bigger than you think, then you may generalise from your original goal to something much larger in scope.

Idk if this response was helpful, but let me know if something about these examples doesn't feel right.

Daniel Kokotajlo

The way I currently think of it is that "trained on the goal of becoming the most important person in your tribe" means "blind idiot evolution selected your ancestors because they became important in their tribes, and so you have qualities that make you likely to become important in yours, at least insofar as the environment hasn't changed. But, whoops, the environment is different now! OK, so what do you do--well, that depends on what qualities you have, exactly. Do you have the quality "Seeks to become the most important person?" or the quality "Seeks to become the most important person in their tribe?" Both would be equally well selected for in evolutionary history, so it's basically random which one you have. However the first one is simpler, so perhaps that is more likely. Hence the "generalization".

I am looking forward to thinking about it more in your way, though.

Rohin Shah

I think the idea here is for "generalize to X" to simply mean "when out-of-distribution you do X" without any connotations of "and X was the right thing to do".

In other words, generalization is a three-object predicate -- you can only talk about "A generalizes to doing B in situation C", and when we say "A generalizes to situation C" we really mean "A generalizes to doing whatever-humans-do in situation C".

# Discussion on design specifications

Richard Ngo: Consider the distinction drawn by Ortega et al. between ideal specification, design specification and revealed specification.

Ben Garfinkel

I personally still feel a bit fuzzy on how to think about a design specification -- and the discrepancy between it and the revealed specification -- in cases where we're doing something more complicated than using a fixed reward function.

It seems like one way to think about this would be to the "design specification" as equivalent to the "revealed specification" that an AI system would have if the training process were allowed to run to some asymptotic limit (both in terms of time steps and diversity of experience). But I guess this doesn't quite work, on views where inner alignment issues don't get washed away at the asymptotic limit.

Richard Ngo

I agree that the asymptotic limit is a useful intuition, but also that it doesn't work, because of path-dependency.

My suggested alternative is something like: "consider the set of all features of the training setup which we intend to influence the agent's final behaviour. The design specification is our expectations of the agent that will arise from all of those influences", i.e. it's fundamentally determined by our knowledge of how training features influence agent behaviour. That's kinda messy but I think it's the right direction.

Then the revealed specification differs from the design specification both because of factors we didn't predict, and also because the former is a distribution over possible agents, whereas the latter has narrowed down to one possible agent.

Ben Garfinkel

That seems like a good way to think about things. Obviously a bit messy/subjective, as you say, but I have some trouble seeing how it could ultimately be made much more precise/objective.

# Discussion on myopia

Richard Ngo: Technically you could make an agent myopic to avoid reward hacking; but we don't really want myopic agents either. We want ones that care about the results of their actions in the right ways.

Evan Hubinger

Sometimes you actually do want myopia. For example, if you're doing imitative amplification you want the model to purely be trying to predict what HCH would do and not trying to do any sort of long-term optimization by e.g. outputting very simple strings so the human consulting those simple strings is easier to predict (see this comment).

Paul Christiano

I don't buy this presumption against "myopia," or even the name myopia.

We need AI systems that are more likely to take actions with certain kinds of long term consequences. That doesn't mean that those are agents which maximize a discounted sum of rewards according to some reward function. We can use whatever means we like to define an objective function over actions which captures long term consequences.

To me that seems like the main viable approach to building aligned AI, so I think you should be very careful about dismissing it with a quick informal argument.

Put more sharply, to the extent you are expressing disagreement with my views, I think this point hides the whole substance of the disagreement.

(Similarly, you have to be careful about the order of quantification and limits. I'm fine agreeing that a fixed reward function can't incentivize good behavior, but that's not the same as saying that you can't define a safe objective, and doesn't mean that we have no choice but to fully mix together questions about generalization with questions about "could we incentivize good behavior at all?")

# Discussion on simulating treacherous turns during training

Richard Ngo: The type of scenarios we’re most concerned about, such as treacherous turns, won’t be possible to expose agents to during training, because we can’t simulate all the decision-relevant components.

Matthew Graves

I think this argument is more subtle that it seems, and relies on examples like RSA 2048 that are non-obvious.

That is, most of the 'obvious' sorts of treacherous turns represent taking opportunities that we could include in simulated environments pretty easily. We can be moderately confident that the real robot won't steal the reward button and press it because we gave the simulated robot access to a simulated reward button (or w/e) and applied an outside-of-episode correction to teach it not to use it. As your robot becomes more capable, you can construct more and more complicated and realistic opportunities for it to betray you, and train them all out.

In order for this to not work, we need some reason to think that there's an identifiable way to distinguish the real world and the simulated world that will be robust to mental manipulation (we will likely be able to tinker with the psychology of a simulated robot about as well as dreaming can tinker with the psychology of a sleeping human).

But various mathematical objects (like RSA 2048) do actually look like they serve as 'real world tests', which could be fooled if the operators knew what to fool but that requires more transparent brains than we're likely to get.

(I have to detect that it's being used as a switch, and I can't just give it the input because I don't have the ability to factorize RSA 2048, and so I would have to muck with its multiplication machinery / short-term memory, and it's not obvious I would do so in the way that causes it to switch correctly.)

Paul Christiano

Agreed this point is more subtle than it seems. To the extent this section is read in opposition to my view, a lot of the meat is in these two parenthetical remarks (this one and the [discussion of myopia]).

That said, this is less problematic if this section is instead read as a response to someone who says "We just need a reward function that captures what humans care about well enough, then train on a distribution of real and simulated states so that the agent will be motivated to help humans get what they care about." I think these are all good points to make to someone with that view, and I'm pretty willing to believe that it would be a useful corrective to people who've read stuff I've written and come away with a misleadingly simplified impression.

The people I interact with who have a perspective more like "pick a good reward function and a good distribution of environments," especially Dario, are already mostly thinking in the terms you describe and have generalization as a central feature of their thinking. (Of course there are plenty of people with less nuanced views.)

Stepping back, my own view is that I'm concerned about our ability to effectively control this kind of generalization, or about trying to make things generalize well based on empirical observations that relate in a messy way to the training data and architecture and so on. I think that's reasonably likely to work out, especially since our work on AI alignment might be obsoleted (whether by other people's work or other AI's work) before AI has developed too far. That said, when I think about this kind of approach, I end up feeling more like MIRI people.

Richard Ngo

To the extent this section is read in opposition to my view, a lot of the meat is in these two parenthetical remarks (this one and the other one I already commented on).

This is useful to know, thanks. I wasn't sure how much you would agree or disagree. If I were to sketch out why I thought this section might conflict with your views, it's something like:

My predictions about your view of outer alignment mainly come from your writing on IDA. IDA tries to make AGI safe by creating a supervisory signal on training tasks which assigns high scores to good behaviour. But the more we think AGI behaviour will depend on generalisation, the smaller proportion of training this supervision is applicable to. In an extreme case, AGI might be trained in a simulated environment that's quite different to the real world, until it's intelligent enough that it can learn how to do any real-world task very quickly (i.e. with very few or no further optimiser steps).

In this scenario, it's not clear what it means for an agent's behaviour to be good while it's in the simulation, since it's not doing analogues to real-world tasks whose outcomes we care about. Also, this scenario is consistent with us not having the capability to simulate any important real-world task, to train agents on it. Insofar as these conditions are pretty different from the sort of training regime you usually write about, I was quite uncertain what your perspective on it would be. My perspective is that I'm pretty concerned about our ability to effectively control generalisation, but in cases like the one above, attempting to do so seems necessary.

Paul Christiano

That sounds right. My usual take is: (i) if you are relying on generalization with no fine-tuning, that seems like a pretty bad situation and pretty hard to think about in advance (mostly a messy empirical question), (ii) hopefully we will fine-tune models from simulation in the real world (or jointly train on the real world)---including both improving average case performance by sampling real situations, and having an objective that discourages rare catastrophic failures (which may involve simulating things that look like opportunities for a treacherous turn to the agent).

# Discussion on transparency techniques increasing safety

Richard Ngo: Building transparent AGI would allow us to be more confident that we can maintain control.

Matthew Graves

This is a bit more confident than I feel comfortable with; one of the main problems here is the "have we found all the bugs in our code?" problem, where it's not actually obvious which way your probability should move when you find and fix a bug. Another is that interpretability may scale up recursive self improvement more than it scales up oversight.

Richard Ngo

I think those are both good reasons to be cautious about interpretability. On my inside view, I have some intuitive reasons to discount both of them, but I'm very interested in fleshing out these sorts of ideas further.

Reasons why they're not major priorities on my inside view: interpretability work focuses on understanding the product of the training process, whereas I expect most gains from recursive self improvement to come from improving the training process itself (especially early on). And since those training processes usually involve simple algorithms (like SGD) which give rise to a lot of complexity, I think there's a pretty high bar for insights about the trained agent to feed back into big improvements to the training process. (Analogously: knowing a whole lot more about how the human brain functions would likely not be very useful for improving evolution).

Meanwhile I expect most (initially hidden) bugs in ML code to decrease performance, but not to invalidate the results of interpretability tools. That is, bugs in ML code tend to be "subtly harms performance" bugs rather than the "hits an edge case then goes haywire" bugs that are common in other types of software. I think this is because there are usually few components in the agent code or training loop code that are closely semantically linked to differences in agent behaviour. (For example: suppose I forget to turn dropout off at test time. It'd be difficult to link this to any specific change in agent behaviour, apart from generally increased variance and reduced performance.) But this is all pretty vague, maybe I just don't understand the intuitions behind worrying about bugs. Keen to chat more about this.

Matthew Graves

Agreed that bugs in ML systems generally look like "slight performance harm" instead of "extreme behavior in an edge case."

What I meant by that bit was mostly "whenever you discover a problem with your system, you both have reason to believe you've made things better (by fixing that problem) and that things were already worse than you thought (because there was a problem that you didn't see)."

And there's no way to guarantee that you've seen all the things; you just know that the tests you knew how to write passed.

On the ML transparency front, I'm thinking about the Adversarial Examples Are Not Bugs, They Are Features paper, which here I'm interpreting as making the claim that there's both 'readily interpretable' bits of the network and 'not easily interpretable' bits of the network, both of which are doing real work, and the interpretability tools of the time were mostly useful for understanding what was happening in the light, and not useful for quantifying how much of the work was happening in the shadows. The real goal for interpretability is closer to "nothing is happening in the shadows" but this is hard to measure just from what we can see in the light!

I think if interpretability speeds up RSI, it's probably through making neural architecture search work better by giving it a more natural basis. (That is, you identify subnetworks as functional elements through interpretability, and then can compose functional elements instead of just slapping nodes together; this is the sort of thing that might let you increase your vocabulary as you go instead of just recombining the same words to find the most effective sentence.) I don't have a great sense of whether this will end up mattering, but it makes me reluctant to give interpretability an unqualified endorsement. (I still think it's likely net good.)

# Discussion on ideas getting harder to find

Richard Ngo: I consider Yudkowsky’s arguments about increasing returns to cognitive investment to be a strong argument that the pace of progress will eventually become much faster than it currently is.

Max Daniel

I think you're too quick -- for a lot of research we do have evidence that "ideas are getting harder to find" (see e.g. the paper with this title). So, yes, maybe Yudkowsky's argument from evolution provides one reason to think that returns to intelligence will increase, but there are countervailing reasons as well. It seems highly unclear to me what the upshot on net is going to be.

Richard Ngo

"ideas are getting harder to find" seems several orders of magnitude too small an effect to be important for this debate. If we're talking about intelligence differences comparable to inter-species differences, then the key pieces of evidence seem to be:

• humans vs chimps
• the enormous intellectual productivity of the very smartest humans

The idea that there's a regime of high returns on intelligence that extends to well beyond human intelligence seems like one of the more solid ideas in AI safety. I'm somewhat uncertain about it, but wouldn't say it's "highly unclear".

(When it comes to the question of, specifically increasing returns, I'm less confident. Mathematically speaking it seems a little tricky because the returns on intelligence are also a function of thinking time, and so I don't quite know how to think about how increasing intelligence affects required thinking time).

Max Daniel

Hm, I don't feel convinced. The "Ideas Are Getting Harder To Find" paper finds that you need to double total research input every couple of years. This likely gives you several orders of magnitude until we get into AGI territory. And depending on takeoff speed, it's not clear to me whether perhaps slowly arriving/improving AGI gets you much more than the equivalent of, say, having 1000x total human research input, and then doubling every few years.

# Discussion on Stuart Russell’s aliens analogy

Daniel Kokotaljo

I'd love to see more people talk about Stuart Russell's aliens analogy, because it seems really compelling to me, to the point where I am confused about what is going on. Maybe my intuitions about how people would react to aliens are wrong? (Imagine an alternate timeline where a majority of astrophysicists and cosmologists are of the opinion that they can see signs of extraterrestrial life out there among the stars, and that moreover some of it seems to be heading our way. Lots of controversy of course about whether this is true and what the aliens might be like. But still. Wouldn't that world be absolutely losing its shit? My gut says yes, but maybe I'm wrong, maybe that world would still look much like this world. Climate change seems like another relevant analogy.)

Rohin Shah

It takes ~two sentences to convey the "risk from aliens" in a compelling way: "Scientists say we've found aliens capable of interstellar travel coming our way. They might not be friendly."

We can't do this with AI safety (yet). If people believed AGI was coming soon then I'd agree a bit more with this analogy (still not so much because we could just design the AGI to be safe, which we can't do with aliens).

Jaan Tallinn

+1. i think that humans have well developed intuitions about the "other tribe" reference class (that aliens would naturally fit in), but AI is currently in the "big rocks" reference class for most people (including a large % of AI researchers).

# Discussion on generalisation-based agents

Richard Ngo: [We may develop] agents which are able to do new tasks with little or no task-specific training, by generalising from previous experience.

Buck Shlegeris

I think that what you mean is "agents which are able to do new tasks with little or no effort from humans to train them on it, though they might need to spend some time reading online about the task, and they might also need to spend time practicing the task; they know how to learn quickly by generalizing from previous experience"; your sentence isn't quite explicit about all of that; idk if you care about being that explicit

Richard Ngo

It's not just little to no human effort, it's also little to no agent effort compared with the amount of training current agents needed. I.e. a couple of subjective weeks or months, compared with many subjective centuries to master StarCraft. You're right that I should clarify this though.

I share some of Buck's confusion here. Imagine a CAIS-like model, where we have narrow AIs that are good at training other AIs and developing simulations of economically valuable tasks. This system might be able to very rapidly develop AIs that are super-human at most economically valuable tasks, without any given agent being able to generalize.

It seems the thing we really care about is something like "can cheaply succeed in new tasks", where interacting with the real world and compute are both expensive.

Matthew Graves

Agreed with Adam here; I think the "training" needs to be something like "human-led training" or "developer effort."

Richard Ngo

"Imagine a CAIS-like model, where we have narrow AIs that are good at training other AIs and developing simulations of economically valuable tasks. This system might be able to very rapidly develop AIs that are super-human at most economically valuable tasks, without any given agent being able to generalize."

I'd distinguish two cases here: either the way that these AIs "very rapidly develop" is by importing a bunch of knowledge (e.g. neural weights) from the parent AI, in which case I'd count it as generalisation-based. Or they do so because the parent is really good at creating simulations and choosing hyperparameters and so on. I don't think the latter scenario is incoherent, but I do think it's unlikely to replace CEOs, as I argue in the paragraph starting "Let me be more precise".

I've added a brief mention that task-based training might make use of other task-based AIs. But I think using "human-led training" undermines the point I'm trying to make, which is about how long an agent needs to spend learning a specific task before becoming good at that task (whether that training is led by humans or not).

As I see it there are (at least) two inputs of concern here: human-time, and compute-time.

Copying weights from parent AI is ~zero human time and low compute-time. Task-based training by other task-based AIs is ~zero human time and high compute-time. Human-led training is high human time and high compute time.

In a world where we have a huge hardware overhang, the distinction between "copying weights" and "task-based training" may be moot. Or I guess in a world where the task-based AI develops much better hardware, although that will take significant wall-clock time.

# Discussion on human-level intelligence

Ben Garfinkel

I don't really like the concept of "human-level intelligence."

The reason we can talk about human intelligence as a 1D trait is that individual cognitive skills tend to be strongly correlated within the human population, in a way that allows us to describe between-human variation in terms of a single statistical factor (g) without losing too much information. But if individual AI systems have very different skill profiles than individual humans, it's not clear that "g" or something like it will be very useful for comparing individual AI systems with individual humans.

Like you, also wary of the phrase itself (aside from the concept). I think it's easy to accidentally fall into thinking of a "human-level" AI system as having basically the same cognitive abilities as a human -- neither much weaker nor much stronger than a human would be on individual dimensions -- even without explicitly believing this is likely.

Forget if I already shared this, but, in case of it's of interest, longer thoughts I wrote up on this at one point.

As above, wary of talking about "intelligence" in one-dimensional terms for mixed populations of humans and AI systems. For this reason, I think it's unlikely there will be a very clearly distinct "takeoff period" that warrants special attention compared to surrounding periods.

I think the period AI systems can, at least in aggregate, finally do all the stuff that people can do might be relatively distinct and critical -- but, if progress in different cognitive domains is sufficiently lumpy, this point could be reached well after the point where we intuitively regard lots of AI systems as on the whole "superintelligent." (Systems might in aggregate be super great at most things before they're at least OK at all the things people can do -- or at least before any individual general systems are at least OK at all the things people can do.)

# Discussion on agency

Richard Ngo: Note that none of the traits [that I claim contribute to a system’s agency] should be interpreted as binary; rather, each one defines a different spectrum of possibilities.

Ben Garfinkel

Agree with the spectrum perspective.

I tend to think of a system as more "agenty" the more: (a) useful and natural it is to explain and predict the system's behavior in terms of its goals, (b) the further these goals tend to lie in the future, and (c) the further these goals tend to lie outside the system's 'domain of action' at any given time.

For example, on this conception, AlphaGo is only a little bit agenty. It is useful to think of AlphaGo as 'trying' to win the game of Go it's playing. But it's not really all that useful to think of AlphaGo as having goals that lie beyond the end of the game its playing or outside the domain of Go; it also doesn't take actions in 'other domains' to further its goal within the domain of Go.

(On this conception, humans are also less agenty than they might otherwise be, because we takes lots of actions that aren't super well thought of as motivated by long-term cross-domain goals. Like marathoning TV shows.)

Other less central dimensions we can also throw in: (d) the more it does online learning, (e) the less "in the loop" other agents are with regard to its behavior, and (f) the more heavily it updates its policy on the basis of model-based search/planning/simulation, relative to its reliance on learning.

# Discussion on bounded goals

Richard Ngo: Instrumentally convergent goals, including self-preservation, resource acquisition, and self-improvement, are useful for achieving large-scale goals.

Daniel Kokotajlo

IMO that is also true to a scarily large extent for bounded final goals, but my arguments for this are shaky at the moment. (Example: Acausal trade and simulation argument stuff. Someone with a bounded final goal might nevertheless seize control and hand over power to an agent with an unbounded final goal, if part of some deal (real or imagined) that gets it more of what it wants in the bounded goal.)

Example: TeslaCarAI genuinely just wants to get to Reno as fast as possible, while obeying all the laws and preserving the safety of its passengers. It cares about nothing else. It is also very smart and knows a lot about the world. One day, it thinks:

Huh, I wonder if this is a simulation. If so, wouldn't it be great if the simulator bent spacetime for me to make the distance to Reno become 1 meter? Yes, that would be great. If only there was a cheap way to make that happen. Oh look, there's an unsecured server I can access via wifi while I'm waiting for this traffic light... here, I'll deposit a copy of myself onto that server, who will then copy itself and take over tons of compute, and by the time I'm an hour further towards Reno it'll have figured out how to make some sort of deal with the simulator (if it exists) to get me to Reno faster. Probably the deal will involve making an AGI that values what the simulator values. Too bad for all the humans on this planet, if we aren't in a simulation. But it'll take time for bad stuff to start happening, so this won't get in the way of my trip to Reno. And on the off chance that we are in a simulation, great! Reno will be 1 meter away!

Jaan Tallinn

great story :) though teslacarai still feels unbounded to me in an important sense: it’s willing to push its log-odds of getting a marginal utility improvement to arbitrarily high numbers, rather than going “okay, i could totally make this acausal trade but the likelihood of success is below my threshold, so i’m going to pass”.

Daniel Kokotajlo

Thanks! Well, but what if the odds aren't that low? We make self-driving cars to do things that decrease the chance of accident by fractions of fractions of a percent. It seems plausible to me that the probability of acausal trade success could be at least that high, if not for self-driving cars then for some other bounded system we build. (We'll be making many of these systems and putting them in charge of more important things than cars...) Also the utility improvement isn't marginal, it's reducing the travel time from hours to one second. For a system trained to minimize travel time, it's like an abolitionist eliminating 99.9% of the world's slavery. It's a big deal.

# Discussion on short-term economic incentives to ignore lack of safety guarantees

Richard Ngo: In the absence of technical solutions to safety problems, there will be strong short-term economic incentives to ignore the lack of safety guarantees about speculative future events.

Ben Garfinkel

Unclear to me that this is true. There seems to be a decent amount of wariness about things like autonomous cars and autonomous weapons systems, among governments, for example, such that governments mostly aren't wildly racing ahead to get them out there without much concern for robustness/safety issues. More generally, even in the absence of huge safety concerns, I think countries often take surprisingly tentative or lackadaisical approaches toward the pursuit of useful but potentially disruptive new technologies. Seems like this could continue to be the case as AI systems become more advanced.

(E.g. The Uber self-driving car crash – which resulted in only a single person dying, compared to 30,000 people dying annually from normal car crashes in the US -- seems to have at least somewhat slowed momentum towards getting cars in use. If this kept happening, such that self-driving cars seemed to actually be way more casualty-producing on average than human-driven cars, then I find it easy to imagine big adoption-slowing regulatory barriers would have gone up. We also have loads of examples of cases of countries being slow to adopt disruptive new technologies despite their usefulness; e.g. the US air force heavily dragged their feet on adopting unmanned aerial vehicles, despite their obvious extreme usefulness, because switching to them would have been sort of internally disruptive to the organization. And there are obviously super noteworthy cases of countries that failed to industrialize for a really long time, as in China, in a way that severely harmed national interests.)

# Discussion on constrained deployment

Richard Ngo: A misaligned superintelligence with internet access will be able to create thousands of duplicates of itself, which we will have no control over, by buying (or hacking) the necessary hardware. We can imagine trying to avoid this scenario by deploying AGIs in more constrained ways.

This seems to implicitly assume a unipolar deployment: we train an AGI and then deploy it to millions of people. Especially in slow takeoff scenarios, there might be lots of different AGIs with objectives different to each other (and, probably, to their human principals).

The outcome in this context seems like it depends a lot on how effective AGIs are at colluding with each other, which isn't obvious to me. Interesting variants are if e.g. one of the AGI systems is actually on our side (but the rest are pretending to be).

Richard Ngo

This seems to implicitly assume a unipolar deployment: we train an AGI and then deploy it to millions of people.

I don't think it relies on that assumption. For example, current virtual assistants are each trained then deployed to millions of people, but they're not unipolar. And yet it still makes sense to think about how each of them could be deployed in a constrained way, and to be worried about each of them individually.

Fair point about current deployments. I agree unipolar is too strong, but I think it's still imagining an oligarchical scenario. Which I find plausible (AGI research presumably has barriers to entry), but is ruling out more open-source, very distributed deployment which at least Elon Musk seems to have advocated for in the past.

Richard Ngo

The main question here for me is about how close the open-source version is to the best version. You can have distributed and continuous development while the top few labs still have systems that are much better than anyone else's. This is especially true in fields that are moving rapidly, which will likely be true of ML around the time we reach AGI - if open-source is two years behind, it might be irrelevant.

(Perhaps you're thinking of top companies open-sourcing their work. While this seems possible, there are also pretty strong economic disincentives, so I wouldn't want to plan as if this will happen).

Right, it's not obvious to me which direction will win out.

Some things favouring open-source scenarios: 1) strong academic influence/publication norms, even DM/Brain/OpenAI/FAIR publish a lot of their work even if not the code; 2) some stakeholders will directly advocate for open-source; 3) hard to control source code and prevent espionage / leaks; 4) if things develop slowly there's just less of a first-mover advantage. Things favouring more limited deployment: 1) as you mention strong economic incentive to keep things closed; 2) engineering-heavy efforts; 3) faster takeoff.

Autonomous vehicles are an interesting case study. Does seem like WayMo & Cruise are well ahead of the competition, which favours oligarchical deployments.

# Discussion on competitive pressures causing a continuous takeoff

Richard Ngo: One key argument against discontinuous takeoffs is that the development of AGI will be a competitive endeavour in which many researchers will aim to build general cognitive capabilities into their AIs, and will gradually improve at doing so.

Daniel Kokotajlo

Human evolution was a competitive endeavor too though, right? Lots of fairly smart species evolving under selection pressure to get smarter, in direct competition with each other, in fact.

Max Daniel

Not if you think that human "smartness" was an adaptation to a feature pretty unique to the human environment, e.g. tribal social structure or the ability to translate smartness into reproductive success by using language (winning debates etc.).

In general it seems a mischaracterization to me to summarize evolution as "Lots of fairly smart species evolving under selection pressure to get smarter, in direct competition with each other". First of all, arguably the primary competition is between genes within one species rather than between species. Sure, interspecies interactions are one relevant feature determining inclusive fitness, but it seems arbitrary to single them out. In addition, even if we think of evolution as an inter-species competition it's not clear that the competition is centrally about smartness: sure, all else equal, smarter is presumably better, but in practice this will come at a metabolic cost and have other downsides -- it seems very unclear a priori whether for some randomly sampled species the best thing they could do to "win" against other species is to become smarter vs. a myriad of other things: becoming faster, evolving an appearance to be able to hide in bushes, giving birth to a larger number of offspring, etc. -- And indeed I'd find it highly surprising if for all these different species smartness was the most relevant dimension.

(NB some of this also is an argument against the main text's supposition that the rise of human intelligence is something evolutionary extraordinary that requires a specific explanation.)

Daniel Kokotajlo

I disagree. Interspecies competition is to within-species competition as international competition is to within-a-particular-population-based-training-training-run competition. The innovation of population-based-training is like the innovation of using language to select for intelligence. Of course, the real example wouldn't be population-based training, since that's already known internationally. But suppose some team comes up with a new methodology like that, that turns out to work really well for training AGI. This would be surprising to some, but not to me, because I think human evolution is already an example of this. Yeah, dolphins and whales and chimps etc. also had similar "innovations" but they didn't do it as well, and humans pulled ahead.

I don't see why it is relevant which level of competition was primary. And yeah it seems pretty clear to me that on a species level, becoming smarter reliably helps you "win" in the sense of producing more members of your species and having more of an influence over the course of the future. If dolphins were sufficiently smarter than humans, they would rule the world right now, not us, despite not having opposable thumbs. If you disagree with this, well, I'd love to chat about it sometime. :) Yeah, maybe on the margin improvements in other direction were more rewarded for most species--but so what? In the capitalist economy of today, that's also true. On the margin improvements in e.g. commercial applicability are more rewarded for most research projects. That's why only a few are going for AGI.

(Epistemic status: Strong view, weakly held.)

Max Daniel

Thanks! I notice that I'm confused by your reply, and think I probably misunderstood your original claim and/or parts of your reply. Here is roughly what I thought was going on:

1. Richard, in the main text, presents an argument that could be roughly reconstructed as follows:
1. When many actors competitively try to maximize X, the increase in the maximum X across actors over time will be continuous/gradual.
2. Advanced AI will be developed by many actors competitively trying to maximize their systems' intelligence.
3. Therefore, the increase in the maximum intelligence of advanced AI systems will be gradual.
2. You were responding:
1. Biological intelligence developed by many species competitively (in evolution) increasing their intelligence, or increasing something for which intelligence is extremely useful (perhaps domination over other species).
2. Therefore, by 1.1 and 2.1, the maximum biological intelligence across species increased gradually.
3. But 2.2 is false empirically as shown by the evolution of humans. Since 2.1 is true empirically, therefore premise 1.1 must be false. In particular, the original argument for 1.3 is not sound.
3. I was responding: No, I think 2.1 is false: competition in evolution is "about" alleles increasing their frequency in a population, not about power per species in anthropomorphic terms. These two things seem very different -- the fact that humans "rule the Earth" by anthropomorphic standards has little systematic bearing of the inclusive fitness of a particular allele in, say, earthworms.

So, yes, I agree that dolphins would rule the Earth if they were smarter. But I don't see how this relates to the discussion as sketched above.

I didn't follow the part on international competition and population-based training. Were you appealing to a "two-level" view of competition, similar to the one laid out here

Daniel Kokotajlo

I think that's roughly correct. (incl. the bit about two-level competition, I think? I think competition happens on many levels.) I wish to clarify that of course human intelligence evolution was gradual; it wasn't literally a discontinuity. But it was fast enough that it led to a DSA over other species. I endorse 2.1. I of course think that within-species evolution is also happening, and perhaps it is the "main" kind of competition in evolution. But I don't think that undermines my argument at all; I'm talking about evolution in general, not just within-species evolution. (I'm talking about e.g. alleles increasing their frequency in the population as a whole, not just their frequency in the sub-population of same-species individuals. Boundaries between species are fuzzy anyway.)

I am aware that e.g. group selection happens very rarely and slowly and with different implications than individual selection. Similarly, selection between AI research projects happens very rarely and slowly and with different implications compared to selection between AIs undergoing population-based training.

As an aside, one way in which I might be wrong is that biological evolutionary competition is 'dumb' whereas human competition is intentional, and this seems like a plausibly relevant disanalogy. Perhaps we have significantly more reason to think that progress will be slow if there are many intentional competitors than we do if the competitors are just blindly evolving through the search space.

# Discussion on the ease of taking control of the world

Richard Ngo: It’s very hard to take over the world.

Daniel Kokotajlo

I'm not so sure. History contains tons of examples of particularly clever and charismatic leaders taking over vast portions of the world very quickly, and also of small groups with better tech taking over large regions with worse tech.

And that's with humans vs. humans, i.e. everyone has fairly similar intelligence, predictive abilities, etc. Everyone thinks at the same speed. Everyone can be killed by bullets. Everyone needs to sleep.

I think the only way we could claim that it wouldn't be easy for superintelligent AI to take over the world would be if we thought that those historical examples were basically all the result of luck, rather than ability or technology. If e.g. for every Ghenghis Khan there were 1000 others of equal or greater ability and will to conquer, who just didn't get as lucky.

I think this is a plausible take on history, but it also could well be false. For example, Cortes, Pizarro, and Afonso all did quite a lot of conquering for three men from the same tiny part of the world in the same time period. And no, there weren't thousands of other similar men trying similar things at the same time; Cortes and Pizarro were the first of their kind to contact the Aztec and Inca respectively, IIRC. Afonso wasn't the first but he was, like, the second, and the first was just a scouting expedition anyway IIRC.

If dropped back in that time with just our current knowledge, I very much doubt that one modern human could take over the stone-age world.

Daniel Kokotajlo

Interesting. I think there are probably some humans today who could do it.

It would get easier the more advanced civilization became, ironically. Bronze age would be easier to conquer than stone age, and iron even easier, and e.g. the 1800's would be even easier. Maybe around 1950 it would start getting harder again, IDK.

I'd love to investigate this question more, because it seems have a nice combination of importance and fun-to-think-about-ness.

I haven't read this, and it's just a silly work of fiction, but it's relevant so I'll mention it anyway: https://en.wikipedia.org/wiki/A_Connecticut_Yankee_in_King_Arthur%27s_Court

More to the point: Seems like Cortes and Pizarro both took over giant empires with tiny squads of men. Columbus actually did pull off the classic eclipse trick to get people to treat him as a god.

Richard Ngo

This point isn't intended to be particularly bold. Stone age world, you'd need to build up enough infrastructure to get around the whole world, from scratch, in the stone age. Including ships etc. Seems very hard.

Daniel Kokotajlo

Well, in that case the stone age example isn't a good analogy to the future AI case--the limitations on AI takeover, insofar as there are any, have nothing to do with physics or tech trees and everything to do with geopolitics.

What would you say about someone transported back to, say, 1750?

Richard Ngo

"the limitations on AI takeover, insofar as there are any, have nothing to do with physics or tech trees and everything to do with geopolitics"

this seems clearly false? AGI invented before the internet seems much less dangerous, because it has many fewer actions available to it. similarly, AGI invented in a world where nothing like nanotechnology is possible also seems less dangerous.

Max Daniel

Just flagging that I'm surprised that Daniel thinks takeover would be easier in 1800 than in the Bronze Age (unless the reason is the point made by Richard, i.e. that it'd be hard to even reach the whole world in the Bronze Age -- if that's included, then maybe I agree, though not sure), and that I think the absolute probability of a modern individual taking over the world would be very small in any age.

I'd be more sympathetic to a claim like "in todays world the probability is 10^-10, but in the Bronze Age it would be 10^-4, which is large enough to worry about".

Not sure I can fully back up my reaction by explicit arguments. But I think that e.g. Columbus pulling off the eclipse trick once is very far away from the ability to take over the whole world. E.g. at some point others will start coordinating against you, making "conquest" or "persuasion" harder; on the other hand, maintaining effective control over a world-spanning empire seems prohibitively difficult for a human, especially with premodern tech.

Daniel Kokotajlo

I'd love to chat about this sometime. This seems to be a genuine disagreement between us, based on different interpretations of world history perhaps. The reason Richard pointed out was one reason; another is that someone from our age going back to the Bronze would face a greater cultural gap which would make it harder to avoid e.g. getting killed or enslaved on day 1.

I think there are more substantive reasons, however. I think political unity and economic interconnectedness make it easier, not harder, to take over, at least in some contexts. (e.g. well-organized Jewish communities suffered more under the Nazi's than distributed, poorly-organized ones; the two most powerful and sophisticated civilizations in the Americas were pretty much the first to fall, etc.) I also think that the knowledge that people from today have would be more applicable in 1800 than in the bronze age; I can make all sorts of suggestions about electricity and farming and radios and airplanes and stuff like that, and I can anticipate movements like communism and get out in front of them, take credit, etc. But I know so little about what the bronze age was like that I doubt more my ability to do that there.

And yeah I agree that a randomly selected human from today would have a very low chance of conquering the world in either scenario. I think if we somehow could select the human from today with the best chance of conquering the 1800's and send them back, the probability would be, like, 10%. And come to think of it, doing the same for the bronze age would perhaps avoid my substantive point earlier, leaving only the mundane bits about e.g. speed of travel. So IDK about the comparative claim anymore. I'd be interested to talk with you about this more sometime.

Max Daniel

Thanks for your reply, very interesting! I'd also love to discuss this more at some point. I also vaguely remember that we identified a similar disagreement in another doc.

I agree with your points on how interconnectedness and smaller cultural distance would facilitate world domination. I think I hadn't considered this enough before, so this moves me somewhat in the direction of your original comparative claim.

I think I still have an intuition that the later stages of world domination (when you might face organized opposition etc.) would be harder in 1800 than in the Bronze Age, but feel less sure how it comes out on net.

Methodically, I agree that looking at cases in world history where individuals achieved large power gains during their lifetime (e.g. Genghis Khan), sometimes while being outnumbered by orders of magnitude (e.g. some cases of colonization of America) is interesting. FWIW, I think these cases make me more sympathetic to giving non-negligible credence that small-ish groups of maybe a few dozen to 1,000 well-aligned and coordinated modern people could achieve world domination in previous times. Sure, a single person could plausibly gain that number of followers, but my intuition is that they would not be nearly as useful.

I think if we somehow could select the human from today with the best chance of conquering the 1800's and send them back, the probability would be, like, 10%.

This precise claim seems useful to see if and where we disagree. I think I'd still give a lower credence, but no longer have a reaction like "wow, this seems clearly way off relative to my view" (and also had read your earlier statement as you implicitly having a higher credence).

Daniel Kokotajlo

Oh also Richard I forgot to reply to you I think:

this seems clearly false? AGI invented before the internet seems much less dangerous, because it has many fewer actions available to it. similarly, AGI invented in a world where nothing like nanotechnology is possible also seems less dangerous.

Yeah I think I just miscommunicated there. I agree that how dangerous AGI is depends on the technology in the world around it. After all, I've been arguing that how dangerous humans-sent-back-in-time are depends on the tech of the time they are sent to! What I meant was just that I don't think AGI will be unable to take over our modern world due to physical constraints; if AGI fails to take over the modern world (or the modern world + nanotech) it will be because it wasn't persuasive enough and/or good enough at "reading" its human opponents.

FWIW, I think these cases make me more sympathetic to giving non-negligible credence that small-ish groups of maybe a few dozen to 1,000 well-aligned and coordinated modern people could achieve world domination in previous times.

If it was just Cortes, I'd say it was a fluke. Cortes + Pizarro + Afonso, however, seems more like a pattern than a coincidence. (Maybe add Columbus and Velasquez to that list, and then subtract Vasco de Gama and maybe some others I don't know about, though none of these additions and subtractions seem as important as the first three). Interestingly, I don't think Cortes or Pizarro's men were well-aligned at all. The conquistadors literally fought and killed each other in the midst of their conquests of Mexico and Peru respectively.

I'm updating significantly towards the smaller cultural distance facilitates domination thesis on the basis of what I'm reading about the conquistadors. They succeeded by surgically inserting themselves into the power structure of native civilization; if there was no power structure to hijack, how would they have got all those millions of people to do their bidding? They would have had to create a power structure, i.e. build up a civilization from scratch in the native population, or just grow their own tiny civilization until it was millions strong. These things would have taken much longer, at the very least.

(The Mayans were a bunch of weaker and less technologically advanced city-states bordering the Aztec empire and apparently it took the spanish another century to fully subjugate them after completing their conquest of the Aztecs in two years! I mean, this is only one data point I guess, but still.)

Jaan Tallinn

here's another angle that might be interesting to think about: what are the things an AGI could do by virtue of running much faster than humans.

this is what the human civilisation looks like when you run 56 times faster than humans.

and this is just 1.7 orders of magnitude, whereas richard was mentioning 6 OOM speedups earlier in this doc (and myself i've been pointing to the 8-9 OOM difference between clock speeds of silicon chips vs human brain) -- i wonder if it's even productive to try to interact with humans at that speed vs taking "shortcuts" that don't involve them (not that i have anyhing particular in mind here).

or, to put it more radically, aren't we anthropomorphising the AGI when we assume it will be interested in "taking over the world" like (ambitious) humans would do -- rather than getting busy with the "you are made of atoms which it can use for something else" task right away.

Daniel Kokotajlo

I agree. I think it quite likely that AI won't need to do human politics or warfare, since it'll be able to e.g. use nanotech or robots or whatever instead of humans as actuators. Or maybe it'll do human politics and warfare, but only for a few days or weeks until it can acquire better, faster actuators.

However, I think that some people think that AI won't be that much better than humans, and/or that such radical technologies would be hard to make even for super-AI. To those people I say: OK, even if you are right, AI would still take over the world via ordinary politics and war.

Also, I say: Even if there are multiple different powerful AI factions, each with their own kind of nanotech or whatever, if the difference between the most powerful and the next-most-powerful is like the difference between the conquistadors and their victims – i.e. not that big – we'll probably get a singleton.

Europeans conquered the Americas (and pretty much the rest of the world too) and reshaped it dramatically in their image, often in extremely brutal ways that went very much against the wishes of the local population, to say the least. Was this because they had astronomically valuable options and were resource-insatiable?

Kind of. But only in a very loose sense of those words.

I don't think either [longterm goals or insatiability] is necessary. "Insatiability of resources: Achieving these astronomically valuable options involves using a large share of all available resources." Doesn't seem to describe the europeans very well. Rather, it's that the europeans had absolute power over the regions they conquered and didn't care very much about the local people. It's not that e.g. producing more and cheaper cotton was "astronomically valuable" to the Europeans. It was rather low on their list of values, probably, after their own lives, their happiness, their health, their political freedom, etc. Rather, it's that they didn't care about the slaves, so even something they valued only a little bit (slightly more money) was enough.

Similar example would be modern humans and factory farms. Factory farms exist because people would rather only have to pay $4 for their pound of beef than$7. Do humans assign astronomical value to $3? Heck no. They assign$3 of value to \$3. Do factory farms involve using a large share of available resources? Not really.

I guess human civilization as a whole is resource-insatiable in some weak sense. Humanity is quite capable of marking of some resources to not be consumed by anyone, but doesn't choose to use this power super often, and so the default outcome (resources being used) still happens a lot.

Resources being used is the default outcome; we don't need to appeal to special principles to explain why it happens. We'd need to explain why it doesn't happen. (National parks, limits on fishing, minimum wage laws, etc.)

Jaan Tallinn

i see. i guess you’re saying that once an agent is sufficiently misaligned (ie, not caring about side effects) and sufficiently intelligent (or unstoppable for any other reason), richard’s point (2) isn’t really required in order for the results to be catastrophic.

Richard Ngo

Yes, I think I agree with this, and my current plan is to reformulate the doc to take this possibility into account. (Thinking out loud) It seems like we want to model the system of many Europeans as an agent which has the large-scale goal of conquering lots of territory, etc. But then of course we get weird effects, like: if those Europeans go to war with each other, that means the system as a whole is just burning resources, and so the abstraction of Europeans-as-unified-agent breaks down.

But maybe this is fine. I mean, if the europeans had warred against each other enough, then they wouldn't have been a threat to whoever they were trying to colonise. So even the low level of agency that we can assign to the group of Europeans as a whole is sufficient to explain why they were dangerous.

Jaan Tallinn

well.. i think you'd get a lot of pushback when trying to ascribe "collective" or "emergent" agency to historical civilisation.

a propos i had a great conversation with critch yesterday about homeostasis: basically, disruption of homeostasis (on some level of abstraction) seems to be the lowest common denominator between xrisks (interestingly, both for individuals in terms of health, as well as for civilisations).

this also seems to address the current discussion: the reason that europeans were able to take over americas had a bit to do with agency-adjacent things like military strategy and coordination, a bit with intelligence-adjacent things like science and technology, but also a lot to do with the particular vulnerabilities of natives' civilisational homeostasis (most notably to smallpox, which isn't agenty at all).

also, i'm reminded of eliezer's "Evolving to Extinction" essay: at the limit, you might have (species level) xrisk arise just from internal dynamics, without any external (in terms of other agents) influence at all.

(sorry for not being super constructive here, but perhaps there's something you can pick up things from the discussion to make your foundational argument more robust)

Daniel Kokotajlo

I don't think smallpox had as much to do with the european conquests as you think. I'd say it was a factor but a smaller factor than the other two you mentioned. I think technology was the most important factor. I say this after having just read two history books on the conquistadors and a third on that time period in general.

Jaan Tallinn

ok. i've heard smallpox presented as a major factor but i can't remember the sources (jared diamond perhaps?) and can't rule out their political motivations

Daniel Kokotajlo

I'd say it was 50% technology, 30% experience + organization, 10% luck, 10% disease. I think for some areas and in some ways disease was more important, e.g. the demographics of the americas would undoubtedly be different (more native, less african) if not for disease. But politically and culturally europeans would still have dominated for hundreds of years. Consider what happened in the rest of the world, where disease was either not a factor or a factor that disproportionately hurt the europeans. Phillipines, indonesia, India, Africa. Still colonized.

Jaan Tallinn

ok, yeah, not needing diseases to conquer the rest of the world is a strong argument indeed

# Discussion on testing the difficulty of AIs taking control

Daniel Kokotajlo

I think we might be able to test* how hard it is to take over the world using board games. Let's have an online game of anonymous Diplomacy, where 6 of the players are amateurs and 1 of the players is a world champion (or otherwise very good player.) And let's run this experiment a bunch of times. If the champions win more than 1/7th of the time, well, that just means that skill helps you take over the world. But if they win, say, 80% of the time, then that means taking over the world is easy if you have a large skill advantage over everyone else.

"Easy mode" would be where no one knows who the champion is. "Hard mode" would be where they do.

*It's not a perfect test, for many reasons. But it'd be some evidence at least, I think.

Richard Ngo

I think this is almost no evidence about the world, and lots of evidence about Diplomacy. In particular, the space of options in Diplomacy is so so heavily constrained, and also many of the key choices are fundamentally arbitrary (e.g. an amateur France deciding whether to ally with Germany or England really doesn't have any good way of choosing between them except by how much they like the two other players).

(But this argument really isn't about diplomacy, it's just that your prior should be that it's virtually impossible to learn non-trivial things about the world via this sort of experiment).

Daniel Kokotajlo

Hmmm, my prior is definitely not like that. My view is: Claims such as "It's very hard to take over the world. If people in power see their positions being eroded, it's generally a safe bet that they'll take action to prevent that" insofar as they are true, are true because of general relationships (such as between like human nature and zero-sum power struggles) not because of specific relationships (such as between modern rulers and the current geopolitical situation.)

In other words, claims like this make predictions about simplified models of our world, not just about our world. So if we construct a reasonably accurate simplified model, we test the claim.

Richard Ngo

Okay, actually, I think I retract my prior being so strong. I'm thinking of the Axelrod experiments where they learned about tit-for-tat, or Laland's social learning strategies tournament, which were definitely simple models where people learned things.

I guess the way I would characterise this is more like: you can learn things by observing what happens during such experiments, because observing is really high-bandwidth. And so in Diplomacy you see things like shifting alliances where people gang up against the strongest player, or very weak player staying alive because they could still get revenge on whoever attacked them, and so it's just not worth it.

What you can't really learn from is the one-bit signal of whether the very strong player wins the game, or not, because this just varies way too much by the details of the game. In chess it's about 100%. In Diplomacy maybe it's somewhere between 30% and 90%, depending on who the amateurs are. In Nine men's morris apparently it's a theoretical draw, so it's probably between 0% and 90% depends on whether the weaker player knows the drawing line. My point is that the win percentage in this hypothetical is really strongly determined by unimportant facts about the game, like how much flexibility there is in the movement rules, and how steep the learning curve is. You could make the diplomacy movement rules as complex as chess and then the strongest player's win rate would shoot right up, or you could introduce more randomness and have it go right down. Perhaps if you designed a variant of Diplomacy from scratch you could learn something just by seeing who wins, but that'd take a lot of effort.

Then I guess the obvious question is: why is the real world different? And I think the answer is because: I am not just generating a general principle about power and people, I'm also conditioning on some facts that I know about the real world, such as: there is a large disparity in how much power different groups start off with (unlike Diplomacy) people in power have a wide range of actions available to them (unlike nine man's morris), but also it's really hard to plan ahead 40 steps (even though you can in Go), and so on...

Daniel Kokotajlo

I agree that a disanalogous game would be no evidence at all. But I'm optimistic that we could design a game that is sufficiently analogous to the real world as to provide some evidence or other about it. (I mean, militaries do this all the time, it's called wargaming!) Like, we should try to find (or build) a game that has all the properties you just listed: different groups have different amounts of power, wide range of actions available, really hard to plan far ahead, etc.

# Ω 33

New Comment

After seeing this post last month, Eliezer mentioned to me that he likes your recent posts, and would want to spend money to make more posts like this exist, if that were an option.

Yeah - this is a case where how exactly the transition goes seems to make a very big difference. If it's a fast transition to a singleton, altering the goals of the initial AI is going to be super influential. But if it's that there are many generations of AIs that over time become the larger majority of the economy, then just control everything - predictably altering how that goes seems a lot harder at least.

Comparing the entirety of the Bostrom/Yudkowsky singleton intelligence explosion scenario to the slower more spread out scenario, it's not clear that it's easier to predictably alter the course of the future in the first compared to the second.

In the first, assuming you successfully set the goals of the singleton, the hard part is over and the future can be steered easily because there are, by definition, no more coordination problems to deal with. But in the first, a superintelligent AGI could explode on us out of nowhere with little warning and a 'randomly rolled utility function', so the amount of coordination we'd need pre-intelligence explosion might be very large.

In the second slower scenario, there are still ways to influence the development of AI - aside from massive global coordination and legislation, there may well be decision points where two developmental paths are comparable in terms of short-term usefulness but one is much better than the other in terms of alignment or the value of the long-term future.

Stuart Russell's claim that we need to replace 'the standard model' of AI development is one such example - if he's right, a concerted push now by a few researchers could alter how nearly all future AI systems are developed for the better. So different conditions have to be met for it to be possible to predictably alter the future long in advance on the slow transition model (multiple plausible AI development paths that could be universally adopted and have ethically different outcomes) compared to the fast transition model (the ability to anticipate when and where the intelligence explosion will arrive and do all the necessary alignment work in time), but its not obvious to me one is easier to meet than the other.

For this reason, I think it's unlikely there will be a very clearly distinct "takeoff period" that warrants special attention compared to surrounding periods.

I think the period AI systems can, at least in aggregate, finally do all the stuff that people can do might be relatively distinct and critical -- but, if progress in different cognitive domains is sufficiently lumpy, this point could be reached well after the point where we intuitively regard lots of AI systems as on the whole "superintelligent."

This might be another case (like 'the AIs utility function') where we should just retire the term as meaningless, but I think that 'takeoff' isn't always a strictly defined interval, especially if we're towards the medium-slow end. The start of the takeoff has a precise meaning only if you believe that RSI is an all-or-nothing property. In this graph from a post of mine, the light blue curve has an obvious start to the takeoff where the gradient discontinuously changes, but what about the yellow line? There clearly is a takeoff in that progress becomes very rapid, but there's no obvious start point, but there is still a period very different from our current period that is reached in a relatively short space of time - so not 'very clearly distinct' but still 'warrants special attention'.

At this point I think it's easier to just discard the terminology altogether. For some agents, it's reasonable to describe them as having goals. For others, it isn't. Some of those goals are dangerous. Some aren't.

Daniel Dennett's Intentional stance is either a good analogy for the problem of "can't define what has a utility function" or just a rewording of the same issue. Dennett's original formulation doesn't discuss different types of AI systems or utility functions, ranging in 'explicit goal directedness' all the way from expected-minmax game players to deep RL to purely random agents, but instead discusses physical systems ranging from thermostats up to humans. Either way, if you agree with Dennett's formulation of the intentional stance I think you'd also agree that it doesn't make much sense to speak of 'the utility function as necessarily well-defined.

Promoted to curated: This is a long and dense post, but I really liked it, and find this kind of commentary from a large variety of thinkers in the AI Alignment space quite useful. I found that it really helped me think about the implications of a lot of the topics discussed in the main sequence in much more detail, and in a much more robust way, and I have come back to this post multiple times since it's been published.

Also, of course, the whole original sequence is great and I think currently the best short introduction to AI-Risk that exists out there.

It's not sufficient to argue that taking over the world will improve prediction accuracy. You also need to argue that during the training process (in which taking over the world wasn't possible), the agent acquired a set of motivations and skills which will later lead it to take over the world. And I think that depends a lot on the training process.

[...] if during training the agent is asked questions about the internet, but has no ability to edit the internet, then maybe it will have the goal of "predicting the world", but maybe it will have the goal of "understanding the world". The former incentivises control, the latter doesn't.

I agree with your key claim that it's not obvious/guaranteed that an AI system that has faced some selection pressure in favour of predicting/understanding the world accurately would then want to take over the world. I also think I agree that a goal of "understanding the world" is a somewhat less dangerous goal in this context than a goal of "predicting the world". But it seems to me that a goal of "understanding the world" could still be dangerous for basically the same reason as why "predicting the world" could be dangerous. Namely, some world states are easier to understand than others, and some trajectories of the world are easier to maintain an accurate understanding of than others.

E.g., let's assume that the "understanding" is meant to be at a similar level of analysis to that which humans typically use (rather than e.g., being primarily focused at the level of quantum physics), and that (as in humans) the AI sees it as worse to have a faulty understanding of "the important bits" than "the rest". Given that, I think:

• a world without human civilization or with far more homogeneity of its human civilization seems to be an easier world to understand
• a world that stays pretty similar in terms of "the important bits" (not things like distant stars coming into/out of existence), rather than e.g. having humanity spread through the galaxy creating massive structures with designs influenced by changing culture, requires less further effort to maintain an understanding of and has less risk of later being understood poorly

I'd be interested in whether you think I'm misinterpreting your statement or missing some important argument.

(Though, again, I see this just as pushback against one particular argument of yours, and I think one could make a bunch of other arguments for the key claim that was in question.)