Sohaib Imran

Research associate at the University of Lancaster, studying AI safety. 

My current views on AI:

Wiki Contributions


Being very intelligent, the LLM understands that the humans will interfere with the ability to continue running the paperclip-making machines, and advises a strategy to stop them from doing so. The agent follows the LLM's advice, as it learnt to do in training, and therefore begins to display power-seeking behaviour.

I found this interesting so just leaving some thoughts here: 

- The agent has learnt an instrumentally convergent goal: ask LLM when uncertain.

-The LLM is exhibiting power-seeking behaviour, due to one (or more) of the below:

  1. Goal misspecification: instead of being helpful and harmless to humans, it is trained to be (as every LLM today) helpful to everything that uses it. 
  2. being pre-trained on text consistent with power-seeking behaviour.
  3. having learnt power-seeking during fine-tuning.

I think "1. " is the key dynamic at play here. Without this, the extra agent is not required at all.  This may be a fundamental problem with how we are training LLMs. If we want LLMs to be especially subservient to humans, we might want to change this. I don't see any easy way of accomplishing this without introducing many more failure modes though.

For "2." I expect LLMs to always be prone to failure modes consistent with the text they are pre-trained on. The set of inputs which can instantiate an LLM in one of these failure modes should diminish with (adversarial) fine-tuning. (In some sense, this is also a misspecification problem: Imitation is the wrong goal to be training the AI system with if we want an AI system that is helpful and harmless to humans.)  

I think "3. " is what most people focus on when they are asking "Why would a model power-seek upon deployment if it never had the opportunity to do so during training?". Especially since almost all discourse in power-seeking has been in the RL context.