Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Machine learning is touching increasingly many aspects of our society, and its effect will only continue to grow. Given this, I and many others care about risks from future ML systems and how to mitigate them.

When thinking about safety risks from ML, there are two common approaches, which I'll call the Engineering approach and the Philosophy approach:

  • The Engineering approach tends to be empirically-driven, drawing experience from existing or past ML systems and looking at issues that either: (1) are already major problems, or (2) are minor problems, but can be expected to get worse in the future. Engineering tends to be bottom-up and tends to be both in touch with and anchored on current state-of-the-art systems.
  • The Philosophy approach tends to think more about the limit of very advanced systems. It is willing to entertain thought experiments that would be implausible with current state-of-the-art systems (such as Nick Bostrom's paperclip maximizer) and is open to considering abstractions without knowing many details. It often sounds more "sci-fi like" and more like philosophy than like computer science. It draws some inspiration from current ML systems, but often only in broad strokes.

I'll discuss these approaches mainly in the context of ML safety, but the same distinction applies in other areas. For instance, an Engineering approach to AI + Law might focus on how to regulate self-driving cars, while Philosophy might ask whether using AI in judicial decision-making could undermine liberal democracy.

While Engineering and Philosophy agree on some things, for the most part they make wildly different predictions both about what the key safety risks from ML will be and how we should address them:

  • Both Engineering and Philosophy would agree on some high-level points: they would agree that misaligned objectives are an important problem with ML systems that is likely to get worse. Engineering believes this because of examples like the Facebook recommender system, while Philosophy believes this based on conceptual arguments like those in Superintelligence. Philosophy is more confident that misaligned objectives are a big problem and thinks they could pose an existential threat to humanity if not addressed.
  • Engineering and Philosophy would both agree that out-of-distribution robustness is an important issue. However, Philosophy might view most engineering-robustness problems (such as those faced by self-driving cars) as temporary issues that will get fixed once we train on more data. Philosophy is more worried about whether systems can generalize from settings where humans can provide data, to settings where they cannot provide data even in principle.
  • Engineering tends to focus on tasks where current ML systems don't work well, weighted by their impact and representativeness. Philosophy focuses on tasks that have a certain abstract property that seems important, such as imitative deception.

In my experience, people who strongly subscribe to the Engineering worldview tend to think of Philosophy as fundamentally confused and ungrounded, while those who strongly subscribe to Philosophy think of most Engineering work as misguided and orthogonal (at best) to the long-term safety of ML. Given this sharp contrast and the importance of the problem, I've thought a lot about which—if either—is the "right" approach.

Coming in, I was mostly on the Engineering side, although I had more sympathy for Philosophy than the median ML researcher (who has ~0% sympathy for Philosophy). However, I now feel that:

  • Philosophy is significantly underrated by most ML researchers.
  • The Engineering worldview, taken seriously, actually implies assigning significant weight to thought experiments.

On the other hand, I also feel that:

  • Philosophy continues to significantly underrate the value of empirical data.
  • Neither of these approaches is satisfying and we actually have no single good approach to thinking about risks from future ML systems.

I've reached these conclusions through a combination of thinking, discussing with others, and observing empirical developments in ML since 2011 (when I entered the field). I've distilled my thoughts into a series of blog posts, where I'll argue that:

  1. Future ML Systems Will be Qualitatively Different from those we see today. Indeed, ML systems have historically exhibited qualitative changes as a result of increasing their scale. This is an instance of "More Is Different", which is commonplace in other fields such as physics, biology, and economics (see Appendix: More Is Different in Other Domains). Consequently, we should expect ML to exhibit more qualitative changes as it scales up in the future.
  2. Most discussions of ML failures are anchored either on existing systems or on humans. Thought Experiments Provide a Third Anchor, and having three anchors is much better than having two, but each has its own weaknesses.
  3. If we take thought experiments seriously, we end up predicting that ML Systems Will Have Weird Failure Modes. Some important failure modes of ML systems will not be present in any existing systems, and might manifest quickly enough that we can't safely wait for them to occur before addressing them.
  4. My biggest disagreement with the Philosophy view is that I think Empirical Findings Generalize Surprisingly Far, meaning that well-chosen experiments on current systems can tell us a lot about future systems.

This post is the introduction to the series. I'll post the next part each Tuesday, and update this page with links once the post is up. In the meantime, leave comments with any thoughts you have, or contact me if you'd like to preview the upcoming posts and leave feedback.

New Comment
23 comments, sorted by Click to highlight new comments since:

Really excited about this sequence, as I'm currently spending a lot of time on clarifying and formalizing the underlying assumptions and disagreements of what you're calling the Philosophy view (I don't completely agree with the term, but I think you're definitely pointing at important aspects of it). Hence having someone else posting on the different strengths and weaknesses of this and the Engineering view sound great!

I've just binge-read the entire sequence and thoroughly enjoyed it, thanks a lot for writing! I really like the framework of three anchors - thought experiments, humans, and empirical ML - and emphasising the strength and limitations of each and the need for all 3. Most discourse I've seen tends to strongly favour just one anchor, but 'all are worth paying attention to, none should be totalising' seems obviously true

I really enjoyed this sequence, it provides useful guidance on how to combine different sources of knowledge and intuitions to reason about future AI systems. Great resource on how to think about alignment for an ML audience. 


Curated. This post cleanly gets at some core disagreement in the AI [Alignment] fields, and I think does so from a more accessible frame/perspective than other posts on LessWrong and the Alignment forum. I'm hopeful that this post and others in the sequence will enable better and more productive conversations between researchers, and for that matter, just better thoughts!

Thanks Ruby! Now that the other posts are out, would it be easy to forward-link them (by adding links to the italicized titles in the list at the end)?


We can also make a Sequence. I assume "More Is Different for AI" should be the title of the overall Sequence too?

Yup! That sounds great :)


Here it is!

You might want to edit the description and header image.



Thanks for writing this post.

I tend to believe that we need a combination of both. Purely pursuing these problems from a philosophical perspectives risks becoming untethered from reality due to the lossiness and slipperiness of high-level abstractions. Purely pursuing these problems from an engineering mindset risks losing the forest for the trees or a failure to look outside of the present situation, even as it is rapidly changing.

I suspect that the suspicion of those in an engineering mindset towards philosophy may be a factor in the bounty that I recently posted on the potential circular dependency of counterfactuals receiving very little engagement on the issue that I was trying to highlight, despite the large number of comments.

Are you aware of other concepts that have a similar circular dependency, or have seemed to have it?

If you start following the parents from  any concept you'll ultimately end up going around in circles, but:

a) Further down the chain
b) Counterfactuals or something very much like them (like possible worlds) are somewhere in the loop.

I think this makes a difference because if circularity is far down the chain then you can effectively ignore the circularity. Depending on other quite different concepts also makes it easier to ignore.

I'm reluctant to frame engineering and philosophy as adversarial disciplines in this conversation as AI and ML research have long drawn on both. As an example, Minsky and Papert's work on the "Society of Mind" and Minsky's "Perceptrons" are hands that wash each other then reach forward to underpin much of what is now accepted in neural network research.

Moreover, there aren't just two disciplines feeding this sport; learnings have been taken from computer science, philosophy, psychology and neuroscience over the fifty odd years of AI work. The more successful ML shops have been using the higher order language of psychology to describe and intervene on operational aspects (in game AlphaGo, i.e.) and neuroscience to create the models (Hassabis, 2009).

I will be surprised if biological models of neurotransmitters don't make an appearance as an nth anchor in the next decade or so. These may well take inspiration from Patricia Churchland's decades long cross disciplinary work in philosophy and neuroscience. They may also draw from the intersection of psychology and neuroscience that is informing mental health treatments; both chemical and experiential.

This is all without getting into those fjords of philosophy in which many spend their time prioritising happiness over truth; ethics and morality... which is what I think this blogpost is really talking about when it says philosophy. Will connectionist modelling learn from and contribute to deontological, utilitarian, consequentialist and hedonistic ethics? I don't see how it cannot.

I'm reluctant to frame engineering and philosophy as adversarial disciplines in this conversation

I think it was talking about how people approach/view 'AI risk'.


Really excited to read this sequence as well!

I would actually give concrete evidence in favor of what I think you're calling "Philosophy," although of course there's not much dramatic evidence we'd expect to see if the theory is wholly true.

Here, however, is a YouTube video that should really be called, Why AI May Already Be Killing You. It uses hard data to show that an algorithm accidentally created real-world money and power, that it did so by working just as intended in the narrowest sense, and that its creators opposed the logical result so much that they actively tried to stop it. (Of course, this last point had to do with their own immediate profits, not the long-term effects.)

I'd be mildly surprised, but not shocked, to find that this creation of real-world power has already unbalanced US politics, in a way which could still destroy our civilization.

If you know of any more such analyses could you share?

see Appendix: More Is Different in Other Domains

Is there a link to that?

I am also quite interested to read this sequence.

Interesting. Reading the different paragraphs I am somewhat confused on how you classify thought experiments: part of engineering, part of philosophy, or third thing by itself?

I'd be curious to see you expand on following question: if we treat thought experiments as not being a philosophical technique, what other techniques or insights does philosophy have to offer to alignment?

Another comment: you write

When thinking about safety risks from ML, there are two common approaches, which I'll call the Engineering approach and the Philosophy approach.

My recent critique here (and I expand on this in the full paper) is that the x-risk community is not in fact using a broad engineering approach to ML safety at all. What is commonly used instead is a much more narrow ML research approach, the approach which sees every ML safety problem as a potential ML research problem. On the engineering side, things need to get much more multi-disciplinary.

I really like this post, you explained your purpose in writing the sequence very clearly. Thanks also for writing about how your beliefs updated over the process of writing this.

Looking forward to reading these series