My AI Risk Model

Wiki Contributions


I was previously pretty dubious about interpretability results leading to capabilities advances. I've only really seen two papers which did this for LMs and they came from the same lab in the past few months. 
It seemed to me like most of the advances in modern ML (other than scale) came from people tinkering with architectures and seeing which modifications increased performance.[1]

But in a conversation with Oliver Habryka and others, it was brought up that as AI models are getting larger and more expensive, this tinkering will get more difficult and expensive. This might cause researchers to look for additional places for capabilities insights, and one of the obvious places to find such insights might be interpretability research. 

  1. ^

    I'd be interested to hear if this isn't actually how modern ML advances work. 

They did run the tests for all models, from Table 1:

(the columns are GPT-4, GPT-4 (no vision), GPT-3.5)

So could an AI engineer create an AI blob of compute the same size as the brain, with its same structural parameters, feed it the same training data, and get the same result ("don't steal" rather than "don't get caught")?

There is a disconnect with this question. 

I think Scott is asking “Supposing an AI engineer could create something that was effectively a copy of a human brain and the same training data, then could this thing learn the “don’t steal” instinct over the “don’t get caught” instinct?” 
Eliezer is answering “Is an AI engineer able to create a copy of the human brain, provide it with the same training data a human got, and get the “don’t steal” instinct?”

Are you offering productivity/performance coaching for new alignment researchers, or coaching + research mentorship for alignment researchers? 

If you're offering research mentorship, it might be helpful to give some examples of the type of research you do

Yeah, I totally agree. My motivation for writing the first section was that people use the word 'deception' to refer to both things, and then make what seem like incorrect inferences. For example, current ML systems do the 'Goodhart deception' thing, but then I've heard people use this to imply that it might be doing 'consequentialist deception'. 

These two things seem close to unrelated, except for the fact that 'Goodhart deception' shows us that AI systems are capable of 'tricking' humans. 

My initial thought is that I don't see why this powerful optimizer would attempt to optimize things in the world, rather than just do some search thing internally. 

I agree with your framing of "how did this thing get made, since we're not allowed to just postulate it into existence?". I can imagine a language model which manages to output words which causes strokes in whoever reads its outputs, but I think you'd need a pretty strong case for why this would be made in practice by the training process. 

Say you have some powerful optimizer language model which answers questions. If you ask a question which is off its training distribution, I would expect it to either answer the question really well (e.g. it genealises properly), or it kinda breaks and answers the question badly. I don't expect it to break in such a way where it suddenly decides to optimize for things in the real world. This would seem like a very strange jump to make, from 'answer questions well' to 'attempt to change the state of the world according to some goal'. 

But I think if we trained the LM on 'have a good on-going conversation with a human', such that the model was trained with reward over time, and its behaviour would effect its inputs (because it's a conversation), then I think it might do dangerous optimization, because it was already performing optimization to affect the state of the world. And so a distributional shift could cause this goal optimization to be 'pointed in the wrong direction', or uncover places where the human and AI goals become unaligned (even though they were aligned on the training distribution). 

This list of email scripts from 80,000 Hours also seems useful here

My Favourite Slate Star Codex Posts

This is what I send people when they tell me they haven't read Slate Star Codex and don't know where to start.

Here are a bunch of lists of top ssc posts:

These lists are vaguely ranked in the order of how confident I am that they are good (if interested in psychology almost all the stuff here is good


I think that there are probably podcast episodes of all of these posts listed below. The headings are not ranked by which heading I think is important, but within each heading they are roughly ranked. These are missing any new great ACX posts. I have also left off all posts about drugs, but these are also extremely interesting if you like just reading about drugs, and were what I first read of SSC. 

If you are struggling to pick where to start I recommend either Epistemic Learned Helplessness or Nobody is Perfect, Everything is Commensurable

The ‘Thinking good’ posts I think have definitely improved how I reason, form beliefs, and think about things in general. The ‘World is broke’ posts have had a large effect on how I see the world working, and what worlds we should be aiming for. The ‘Fiction’ posts are just really really good fiction short stories. The ‘Feeling ok about yourself’ have been extremely helpful for developing some self-compassion, and also for being compassionate and non-judgemental about others; I think these posts specifically have made me a better person.


Posts I love:

Thinking good


World is broke and how to deal with it

Fiction (read after moloch stuff I think)


Feeling ok about yourself


Other?? (this is actual crazy, read last)

Load More