Senior research scholar at FHI. My current research interests are mainly the behaviour and interactions of boundedly rational agents, complex interacting systems, and strategies to influence the long-term future, with focus on AI alignment.

Previously I was a researcher in physics, studying phase transitions, network science and complex systems.

Wiki Contributions


Limits to Legibility

I don't think the intuition "both are huge" so "~ roughly equal" is correct.

Tree search is decomposable into specific sequence of a board states, which are easily readable; in practice trees are pruned, and can be pruned to human-readable sizes.

This isn't true for the neural net. If you decompose the information in AlphaGo net into a huge list of arithmetic, if the "arithmetic" is the whole training process, the list is much larger than in the first case. If it's just the trained net, it's less interpretable than the tree.

Pivotal outcomes and pivotal processes

With the last point: I think can roughly pass your ITT - we can try that, if you are interested. 

So, here is what I believe are your beliefs

  • With pretty high confidence, you expect sharp left turn to happen (in almost all trajectories)
  • This is to a large extent based on the belief that at some point "systems start to work really well in domains really far beyond the environments of their training" which is roughly the same as "discovering a core of generality" and few other formulations. These systems will be in some meaningful sense fundamentally different from eg Gato
  • From your perspective, this is based on thinking deeply about the nature of such system (note that this mostly based on hypothetical systems, and an analogy with evolution)
  • My claim roughly is this is only part of what's going on, where the actual think is: people start with a deep prior on "continuity in the space of intelligent systems". Looking into a specific question about hypothetical systems, their search in argument space is guided by this prior, and they end up mostly sampling arguments supporting their prior.  (This is not to say the arguments are wrong.)
  • You probably don't agree with the above point, but notice the correlations:
    • You expect sharp left turn due to discontinuity in "architectures" dimensions (which is the crux according to you)
    • But you also expect jumps in capabilities of individual systems (at least I think so)
    • Also, you expect majority of hope in a "sharp right turn" histories (in contrast to smooth right turn histories)
    • And more
  • In my view yours (or rather MIRI-esque) views on the above dimensions are correlated more than expected, which suggest the existence of hidden variable/hidden model explaining the correlation. 

I personally think that a large majority of humanity's hope lies in someone executing a pivotal act. But I assume Critch disagrees with this, and holds a view closer to 1+2+3.

If so, then I think he shouldn't go "well, pivotal acts sound weird and carry some additional moral hazards, so I will hereby push for pivotal acts to become more stigmatized and hard to talk about, in order to slightly increase our odds of winning in the worlds where pivotal acts are unnecessary".

Rather, I think hypothetical-Critch should promote the idea of pivotal processes, and try to reduce any existing stigma around the idea of pivotal acts, so that humanity is better positioned to evade destruction if we do end up needing to do a pivotal act. We should try to set ourselves up to win in more worlds.

Can't speak for Critch, but my view is pivotal acts planned as pivotal acts, in the way how most people in LW community think about them, have only a very small chance of being the solution. (my guess is one or two bits more extreme, more like 2-5% than 10%).

I'm not sure if I agree with you re: the stigma. My impression is while the broader world doesn't think in terms of pivotal acts, if it payed more attention, yes, many proposals would be viewed with suspicion. On the other hand, I think on LW it's the opposite: many people share the orthodoxy views about sharp turns, pivotal acts, etc., and proposals to steer the situation more gently are viewed as unworkable or engaging in thinking with "too optimistic assumptions" etc. 

Note that I advocate for considering much more weird solutions, and also thinking much more weird world states when talking with the "general world".  While in contrast, on LW and AF, I'd like to see more discussion of various "boring" solutions on which the world can roughly agree.

Continuity assumptions are about what's likely to happen, not about what's desirable. It would be a separate assumption to say "continuity is always good", and I worry that a reasoning error is occurring if this is being conflated with "continuity tends to occur".

Basically, no. Continuity assumptions are about how the space looks like. Obviously forecasting questions ("what's likely to happen") often depend on ideas how the space looks like.

My claim is that pivotal acts are likely to be necessary for good outcomes, not that they're necessarily likely to occur. If your choices are "execute a pivotal act, or die", then insofar as you're confident this is the case, the base rate of continuous events just isn't relevant.

Yes but your other claim is "sharp left turn" is likely and leads to bad outcomes. So if we partition the space of outcomes good/bad, in both branches you assume it is very likely because of sharp turns. 


The primary argument for hard takeoff isn't "stuff tends to be discontinuous"; it's "AGI is a powerful invention, and e.g. GPT-3 isn't a baby AGI". The discontinuity of hard takeoff is not a primitive; it's an implication of the claim that AGI is different from current AI tech, that it contains a package of qualitatively new kinds of cognition that aren't just 'what GPT-3 is currently doing, but scaled up'.

This is becoming  maybe repetitive, but I'll try to paraphrase again. Consider the option the "continuity assumptions" I'm talking about are not grounded in "takeoff scenarios", but in "how you think about hypothetical points in the abstract space of intelligent systems". 

Thinking about features of this highly abstract space, in regions which don't exist yet, is epistemically tricky (I hope we can at least agree on that).

It probably seems to you, you have many strong arguments giving you reliable insights about how the space works somewhere around "AGI".

My claim is: "Yes, but the process which generated the arguments is based on black-box neural net, which has a strong prior on things like "stuff like math is discontinuous"" (I suspect this "taste and intuition" box is located more in Eliezer's mind, and some other people updated "on the strenght of arguments") This isn't to imply various people haven't done a lot of thinking and generated a lot of arguments and intuitions about this. Unfortunately, given other epistemic constraints, in my view the "taste and intuitions" differences sort of "propagate" to "conclusion" differences.

Pivotal outcomes and pivotal processes

In my view, in practice, the pivotal acts framing actually pushes people to consider a more narrow space of discrete powerful actions, "sharp turns", "events that have a game-changing impact on astronomical stakes". 

As I understand it, the definition of "pivotal acts" explicitly forbids to consider things like "this process would make 20% per year of AI developers actually take safety seriously with 80% chance" or "what class of small shifts would in aggregate move the equilibrium?". (Where things in this category get straw-manned as "Rube-Goldberg-machine-like")

As often, one of the actual cruxes is in continuity assumptions, where basically you have a low prior on "smooth trajectory changes by many acts" and high prior on "sharp turns left or right".

Second crux, as you note, is doom-by-default probability: if you have a very high doom probability, you may be in favour of variance-increasing acts, where people who are a few bits more optimistic may be much less excited about them, in particular if all plans for such acts they have very unclear shapes of impact distributions.

Given this deep prior differences, it seems reasonable to assume this discussion will lead nowhere in particular. (I've a draft with a more explicit argument why.)

Continuity Assumptions

Note that Nate and Eliezer expect there to be some curves you can draw after-the-fact that shows continuity in AGI progress on particular dimensions. They just don't expect these to be the curves with the most practical impact (and they don't think we can identify the curves with foresight, in 2022, to make strong predictions about AGI timing or rates of progress).

Quoting Nate in 2018: ...


Yes, but conversely, I could say I'd expect some curves to show discontinuous jumps, mostly in dimensions which no one really cares about.  Clearly the cruxes are about discontinuities in dimensions which matter.

As I tried to explain in the post, I think continuity assumptions mostly get you different things than "strong predictions about AGI timing". 


My point here isn't to throw 'AGI will undergo discontinuous leaps as they learn' under the bus. Self-rewriting systems likely will (on my models) gain intelligence in leaps and bounds. What I’m trying to say is that I don’t think this disagreement is the central disagreement. I think the key disagreement is instead about where the main force of improvement in early human-designed AGI systems comes from — is it from existing systems progressing up their improvement curves, or from new systems coming online on qualitatively steeper improvement curves?

I would paraphrase this as "assuming discontinuities at every level" - both one-system training, and the more macroscopic exploration in the "space of learning systems" - but stating the key disagreement is about the discontinuities in the space of model architectures, rather than in jumpiness of single model training.

Personally, I don't think the distinction between 'movement by learning of a single model' and 'movement by scaling' and 'movement by architectural changes' will be necessarily big. 

There is, I think, a really basic difference of thinking here, which is that on my view, AGI erupting is just a Thing That Happens and not part of a Historical Worldview or a Great Trend.

This seem more or less support what I wrote? Expecting a Big Discontinuity, and this being a pretty deep difference?

I think the Hansonian viewpoint - which I consider another gradualist viewpoint, and whose effects were influential on early EA and which I think are still lingering around in EA - seemed surprised by AlphaGo and Alpha Zero, when you contrast its actual advance language with what actually happened.  Inevitably, you can go back afterwards and claim it wasn't really a surprise in terms of the abstractions that seem so clear and obvious now, but I think it was surprised then; and I also think that "there's always a smooth abstraction in hindsight, so what, there'll be one of those when the world ends too", is a huge big deal in practice with respect to the future being unpredictable.

My overall impression is Eliezer likes to argue against "Hansonian views", but something like "continuity assumptions" seem much broader category than Robin's views.  

Paul and Eliezer have had lots of discussions over the years, but I don't think they talked about takeoff speeds between the 2018 post and the 2021 debate?

In my view continuity assumptions are not just about takeoff speeds. E.g, IDA make much more sense in a continuous world - if you reach a cliff, working IDA should slow down, and warn you. In the Truly Discontinuous world, you just jump off the cliff at some unknown step. 

I would guess probably a majority of all debates and disagreements between Paul and Eliezer has some "continuity" component: e.g. the question whether we can learn a lot of important alignment stuff on non-AGI systems is a typical continuity problem, but only tangentially relevant to takeoff speeds.


Where I agree and disagree with Eliezer

Not very coherent response to #3. Roughly

  • Caring about visible power is a very human motivation, and I'd expect will draw many people to care about "who are the AI principals", "what are the AIs actually doing", and few other topics, which have significant technical components
  • Somewhat wild datapoints in this space: nuclear weapons, space race. in each case, salient motivations such as "war" led some of the best technical people to work on hard technical problems. in my view, the problems the technical people ended up working on were often "vs. nature" and distant from the original social motivations
  • Another take on this is, some people want to technically interesting and import problems, but some of them want to work on "legibly important" or "legibly high-status" problems
  • I do believe there are some opportunities in steering some fraction of this attention toward some of the core technical problems (not toward all of them, at this moment). 
  • This can often depend on framing; while my guess is e.g. you shouldn't probably work on this, my guess is some people who understand alignment technical problems should
  • This can also depend on social dynamics; your "naive guess" seem a good starting point
  • Also: it seems there are many low-hanging fruits in low-difficulty problems which someone should work on - eg at this moment, many humans should be spending a lot of time trying to get empirical understanding of what types of generalization are LLMs capable of. 

With prioritization, I think it would be good if someone made some sort of a curated list "who is working on which problems, and why" - my concern with part of the "EAs figuring out what to do" is many people are doing some sort of expert-aggregation on the wrong level. (Like, if someone basically averages your and Eliezer Yudkowsky's conclusions giving 50% weight each, I don't think it is useful and coherent model)

Let's See You Write That Corrigibility Tag

Quick attempt at rough ontology translation between how I understand your comment, and the original post. (Any of you can correct me if I'm wrong)

I think what would typically count as "principles" in Eliezer's meaning are
1. designable things which make the "true corrigibility" basin significantly harder to escape, e.g. by making it deeper
2. designable things which make the "incorrigible" basin harder to reach, e.g. by increasing the distance between them, or increasing the potential barrier
3. somehow, making the "incorrigible" basin less lethal
4. preventing low-dimensional, low-barrier "tunnels" (or bridges?) between the basins

Eg some versions of "low impact" often makes the "incorrigible" basin harder to reach, roughly because "elaborate webs of deceptions an coverups" may require complex changes to the environment. (Not robustly)

In contrast, my impression is, what does not count as "principles" are statements about properties which are likely true in the corrigibility basin, but don't seem designable - eg "corrigible AI does not try to hypnotize you". Also the intended level of generality likely is: more specific than "make the basin deeper" and more general than "

Btw my impression is what makes the worst-case scenario hard to robustly solve is basically #4 from the list  above. Otherwise there are many ways how to make the basin work "in most directions". 

Where I agree and disagree with Eliezer

I'm not sure if you actually read carefully what you are commenting on. I emphasized early response, or initial governmental-level response in both comments in this thread. 

Sure, multiple countries on the list made mistakes later, some countries sort of become insane, and so on.  Later, almost everyone made mistakes with vaccines, rapid tests, investments in contact tracing, etc.

Arguing that the early lockdown was more costly than "an uncontrolled pandemic" would be pretty insane position (cf GDP costs,  Italy had the closest thing to an uncontrolled pandemic). (Btw the whole notion of "an uncontrolled pandemic" is deeply confused - unless you are a totalitarian dictatorship, you cannot just order people "live as normally" during a pandemic when enough other people are dying; you get spontaneous "anarchic lockdowns" anyway, just later and in a more costly way)

Where I agree and disagree with Eliezer

What do you think is the primary component? I seem to recall reading somewhere that previous experience with SARS makes a big difference. I guess my more general point is that if the good COVID responses can mostly be explained by factors that predictably won't be available to the median AI risk response, then the variance in COVID response doesn't help to give much hope for a good AI risk response.

What seemed to make a difference

  1. someone with a good models what to do getting to advisory position when the politicians freak out
  2. previous experience with SARS
  3. ratio of "trust in institutions" vs. "trust in your neighbors wisdom"
  4. raw technological capacity
  5. ability of the government to govern (ie execute many things at short time)

In my view, 1. and 4. could go better than in covid, 2. is irrelevant, 3. and 5. seem broad parameters which can develop in different directions. Image you somehow become the main advisor to US president when the situation becomes really weird, and she follows your advice closely - my rough impression is in most situations you would be able to move the response to be moderately sane.

  • it's relatively intuitive for humans to think about the mechanics of the danger and possible countermeasures

Empirically, this often wasn't true. Humans had mildly confused ideas about the micro-level, but often highly confused ideas about the exponential macro-dynamics.  (We created a whole educational game on that, and have some feedback that for some policymakers it was the thing that helped them understand... after a year in the pandemic)

  • previous human experiences with pandemics, including very similar ones like SARS
  • there are very effective countermeasures that are much easier / less costly than comparable countermeasures for AI risk, such as distributing high quality masks to everyone and sealing one's borders
  • COVID isn't agenty and can't fight back intelligently
  • potentially divisive issues in AI risk response seem to be a strict superset of politically divisive issues in COVID response (additional issues include: how to weigh very long term benefits against short term costs, the sentience, moral worth, and rights of AIs, what kind of values do we want AIs to have, and/or who should have control/access to AI)

One factor which may make governments more responsive to AI risk is covid wasn't exactly threatening to states. Covid was pretty bad for individual people, and some businesses, but in some cases, the relative power of states even grew during covid. In contrast, in some scenarios it may be clear that AI is existential risk for states as well.

Where I agree and disagree with Eliezer
  • I doubt that's the primary component that makes the difference. Other countries which did mostly sensible things early are eg Australia, Czechia, Vietnam, New Zealand, Iceland.
  • My main claim isn't about what a median response would be, but something like "difference between median early covid governmental response and actually good early covid response was something between 1 and 2 sigma; this suggests bad response isn't over-determined, and sensibe responses are within human reach". Even if  Taiwan was an outlier, it's not like it's inhabited by aliens or run by a friendly superintelligence.
  • Empirically, median governmental response to a novel crisis is copycat policymaking from some other governments

Where I agree and disagree with Eliezer

Broadly agree with this in most points of disagreement with Eliezer,  and also agree with many points of agreement.

Few points where I sort of disagree with both, although this is sometimes unclear


Even if there were consensus about a risk from powerful AI systems, there is a good chance that the world would respond in a totally unproductive way. It’s wishful thinking to look at possible stories of doom and say “we wouldn’t let that happen;” humanity is fully capable of messing up even very basic challenges, especially if they are novel.

I literally agree with this, but at the same time, in contrast to Eliezer's original point, I also think there is a decent chance the world would respond in a somewhat productive way,  and this is a mayor point of leverage. 

For people who doubt this, I'd point to variance in initial governmental-level response to COVID19, which ranged from "highly incompetent" (eg. early US) to "quite competent" (eg  Taiwan). (I also have some intuitions around this based on non-trivial amounts of first-hand experience with how governments actually internally worked and made decisions - which you certainly don't need to trust, but if you are highly confident in inability of governments to act, or do reasonable things, you should at least be less confident.)


AI systems will ultimately be wildly superhuman, and there probably won’t be strong technological hurdles right around human level. Extrapolating the rate of existing AI progress suggests you don’t get too much time between weak AI systems and very strong AI systems, and AI contributions could very easily go from being a tiny minority of intellectual work to a large majority over a few years.

While I do agree there likely won't be strong technological hurdles, I think "right around the human level" is the point where it seems most likely some regulatory hurdles can be erected, or the human coordination landscape can change, or resources spent on alignment research could grow extremely fast, or,  generally, weird things can happen. While I generally agree weird bad things can happen, I also do think weird good things can happen, and this also likely seems a potential period of increased leverage.


There are strong social and political pressures to spend much more of our time talking about how AI shapes existing conflicts and shifts power. This pressure is already playing out and it doesn’t seem too likely to get better. I think Eliezer’s term “the last derail” is hyperbolic but on point.

I do agree that the pressures do exist,  and would be bad if it caused many people working on the pessimistic-assumptions-side to switch to work on e.g. corporate governance; on the other hand, I don't agree it's just a distraction. Given previous two points, I think the overall state of power / coordination / conflict can have significant trajectory-shaping influence.

Also, this dynamic will likely bring many more people to work on alignment-adjacent topics, and I think there is some chance to steer part of this attention to productive work on important problems; I think this is more likely if at least some alignment researchers bother to engage with this influx of attention (as opposed to ignoring it as random distraction). 

This response / increases in attention in some sense seem like the normal way how humanity solves problems, and it may be easier to steer it, rather than e.g. try to find&convince random people to care about technical alignment problems.

Load More