Former AI safety research engineer, now PhD student in philosophy of ML at Cambridge. I'm originally from New Zealand but have lived in the UK for 6 years, where I did my undergrad and masters degrees (in Computer Science, Philosophy, and Machine Learning). Blog:


Shaping safer goals
AGI safety from first principles


What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)

These aren't complicated or borderline cases, they are central example of what we are trying to avert with alignment research.

I'm wondering if the disagreement over the centrality of this example is downstream from a disagreement about how easy the "alignment check-ins" that Critch talks about are. If they are the sort of thing that can be done successfully in a couple of days by a single team of humans, then I share Critch's intuition that the system in question starts off only slightly misaligned. By contrast, if they require a significant proportion of the human time and effort that was put into originally training the system, then I am much more sympathetic to the idea that what's being described is a central example of misalignment.

My (unsubstantiated) guess is that Paul pictures alignment check-ins becoming much harder (i.e. closer to the latter case mentioned above) as capabilities increase? Whereas maybe Critch thinks that they remain fairly easy in terms of number of humans and time taken, but that over time even this becomes economically uncompetitive.

Taboo "Outside View"

I really like this post, it feels like it draws attention to an important lack of clarity.

One thing I'd suggest changing: when introducing new terminology, I think it's much better to use terms that are already widely comprehensible if possible, than terms based on specific references which you'd need to explain to people who are unfamiliar in each case.

So I'd suggest renaming 'ass-number' to wild guess and 'foxy aggregation' to multiple models or similar.

Challenge: know everything that the best go bot knows about go

I'm not sure what you mean by "actual computation rather than the algorithm as a whole". I thought that I was talking about the knowledge of the trained model which actually does the "computation" of which move to play, and you were talking about the knowledge of the algorithm as a whole (i.e. the trained model plus the optimising bot).

Richard Ngo's Shortform

In the scenario governed by data, the part that counts as self-improvement is where the AI puts itself through a process of optimisation by stochastic gradient descent with respect to that data.

You don't need that much hardware for data to be a bottleneck. For example, I think that there are plenty of economically valuable tasks that are easier to learn than StarCraft. But we get StarCraft AIs instead because games are the only task where we can generate arbitrarily large amounts of data.

Richard Ngo's Shortform

Yudkowsky mainly wrote about recursive self-improvement from a perspective in which algorithms were the most important factors in AI progress - e.g. the brain in a box in a basement which redesigns its way to superintelligence.

Sometimes when explaining the argument, though, he switched to a perspective in which compute was the main consideration - e.g. when he talked about getting "a hyperexponential explosion out of Moore’s Law once the researchers are running on computers".

What does recursive self-improvement look like when you think that data might be the limiting factor? It seems to me that it looks a lot like iterated amplification: using less intelligent AIs to provide a training signal for more intelligent AIs.

I don't consider this a good reason to worry about IA, though: in a world where data is the main limiting factor, recursive approaches to generating it still seem much safer than alternatives.

Snyder-Beattie, Sandberg, Drexler & Bonsall (2020): The Timing of Evolutionary Transitions Suggests Intelligent Life Is Rare

Yeah, this seems like a reasonable argument. It feels like it really relies on this notion of "pretty smart" though, which is hard to pin down. There's a case for including all of the following in that category:

And yet I'd guess that none of these were/are on track to reach human-level intelligence. Agree/disagree?

Snyder-Beattie, Sandberg, Drexler & Bonsall (2020): The Timing of Evolutionary Transitions Suggests Intelligent Life Is Rare

My argument is consistent with the time from dolphin- to human-level intelligence being short in our species, because for anthropic reasons we find ourselves with all the necessary features (dexterous fingers, sociality, vocal chords, etc).

The claim I'm making is more like: for every 1 species that reaches human-level intelligence, there will be N species that get pretty smart, then get stuck, where N is fairly large. (And this would still be true if neurons were, say, 10x smaller and 10x more energy efficient.)

Now there are anthropic issues with evaluating this argument by pegging "pretty smart" to whatever level the second-most-intelligent species happens to be at. But if we keep running evolution forward, I can imagine elephants, whales, corvids, octopuses, big cats, and maybe a few others reaching dolphin-level intelligence. But I have a hard time picturing any of them developing cultural evolution.

Formal Inner Alignment, Prospectus

Mesa-optimizers are in the search space and would achieve high scores in the training set, so why wouldn't we expect to see them?

I like this as a statement of the core concern (modulo some worries about the concept of mesa-optimisation, which I'll save for another time).

With respect to formalization, I did say up front that less-formal work, and empirical work, is still valuable.

I missed this disclaimer, sorry. So that assuages some of my concerns about balancing types of work. I'm still not sure what intuitions or arguments underlie your optimism about formal work, though. I assume that this would be fairly time-consuming to spell out in detail - but given that the core point of this post is to encourage such work, it seems worth at least gesturing towards those intuitions, so that it's easier to tell where any disagreement lies.

Formal Inner Alignment, Prospectus

I have fairly mixed feelings about this post. On one hand, I agree that it's easy to mistakenly address some plausibility arguments without grasping the full case for why misaligned mesa-optimisers might arise. On the other hand, there has to be some compelling (or at least plausible) case for why they'll arise, otherwise the argument that 'we can't yet rule them out, so we should prioritise trying to rule them out' is privileging the hypothesis. 

Secondly, it seems like you're heavily prioritising formal tools and methods for studying mesa-optimisation. But there are plenty of things that formal tools have not yet successfully analysed. For example, if I wanted to write a constitution for a new country, then formal methods would not be very useful; nor if I wanted to predict a given human's behaviour, or understand psychology more generally. So what's the positive case for studying mesa-optimisation in big neural networks using formal tools?

In particular, I'd say that the less we currently know about mesa-optimisation, the more we should focus on qualitative rather than quantitative understanding, since the latter needs to build on the former. And since we currently do know very little about mesa-optimisation, this seems like an important consideration.

Load More