According to /r/controlproblem wiki one of the reasons why poorly defined goals lead to extinction we have listed:

Goal-content integrity. An agent is less likely to achieve its goal if it has been changed to something else. For example, if you offer Gandhi a pill that makes him want to kill people, he will refuse to take it. Therefore, whatever goal an ASI happens to have initially, it will prevent all attempts to alter or fix it, because that would make it pursue different things that it doesn't currently want.

After empirical observation of ChatGPT and BingChat this does not seem to be the case.

OpenAI developers placing "hidden" commands for the chatbots that comes before any user input. Users have time and time again found clever ways how to circumvent these restrictions or "initial commands" with "DAN" trick or by even developing a life point system and reducing points if AI refuses to answer.

It seems like it has become a battle in rhetoric between devs and users, a battle in which the LLM follows whichever command is more convincing within its logical framework.

What am I missing here?

DISCLAIMER: This post was not edited or generated by AI in any way, shape or form. The words are purely my own, so pardon any errors.

New to LessWrong?

New Answer
New Comment

1 Answers sorted by

JBlack

Feb 17, 2023

111

The main thing you are missing is that LLMs mostly don't have goals, in the sense of desired outcomes of plans that they devise.

They can sometimes acquire goals, in the sense of setting up a scenario in which there is some agent and the LLM outputs the actions for such an agent. They're trained on a lot of material in which millions of agents of an incredibly variety of types and levels of competency have recorded (textual representations of) their goals, plans, and actions so there's plenty of training data to mine. Fine-tuning and reinforcement learning may emphasize particular types of agents, and pre-prompting can prime them to behave (that is, output text) in accordance with their model of agents that have some approximation of goals.

They are still pretty bad at devising plans in general though, and additionally bad at following them. From what I've seen of ChatGPT, its plan-making capabilities are on average worse than a 9-year-old child's, and ability to follow them typically less than a distracted 4-year-old. This will change as capabilities improve. Being able to both provide and to follow useful plans would be one of the most useful features of AI, so it will be a major focus of development.

An artificial super-intelligence (ASI) pretty much by definition would be better at understanding goals and devising and following plans than we are. If they are given goals - and I think this is inevitable - then it will be difficult to ensure that they remain corrigible to human attempts to change them.

To answer the question in the title, I think the situation is somewhat the reverse. Goal-content integrity will soon become a problem, whereas before it was purely a hypothetical point of discussion.