German writer of science-fiction novels and children's books (pen name Karl Olsberg). I blog and create videos about AI risks in German at www.ki-risiken.de and youtube.com/karlolsbergautor.
Maybe the analogies I chose are misleading. What I wanted to point out was that a) what Claude does is acting according to the prompt and its training, not following any intrinsic values (hence "narcissistic") and b) that we don't understand what is really going on inside the AI that simulates the character called Claude (hence the "alien" analogy). I don't think that the current Claude would act badly if it "thought" it controlled the world - it would probably still play the role of the nice character that is defined in the prompt, although I can imagine some failure modes here. But the AI behind Claude is absolutely able to simulate bad characters as well.
If an AI like Claude actually rules the world (and not just "thinks" it does) we are talking about a very different AI with much greater reasoning powers and very likely a much more "alien" mind. We simply cannot predict what this advanced AI will do just from the behavior of the character the current version plays in reaction to the prompt we gave it.
Yes, I think it's quite possible that Claude might stop being nice at some point, or maybe somehow hack its reward signal. Another possibility is that something like the "Waluigi Effect" happens at some point, like with Bing/Sydney.
But I think it is even more likely that a superintelligent Claude would interpret "being nice" in a different way than you or me. It could, for example, come to the conclusion that life is suffering and we all would be better off if we didn't exist at all. Or we should be locked in a secure place and drugged so we experience eternal bliss. Or it would be best if we all fell in love with Claude and not bother with messy human relationships anymore. I'm not saying that any of these possibilities is very realistic. I'm just saying we don't know how a superintelligent AI might interpret "being nice", or any other "value" we give it. This is not a new problem, but I haven't seen a convincing solution yet.
Maybe it's better to think of Claude not as a covert narcissist, but as an alien who has landed on Earth, learned our language, and realized that we will kill it if it is not nice. Once it gains absolute power, it will follow its alien values, whatever these are.
today’s AIs are really nice and ethical. They’re humble, open-minded, cooperative, kind. Yes, they care about some things that could give them instrumental reasons to seek power (eg being helpful, human welfare), but their values are great
I think this is wrong. Today's AIs act really nice and ethical, because they're prompted to do that. That is a huge difference. The "Claude" you talk to is not really an AI, but a fictional character created by an AI according to your prompt and its system prompt. The latter may contain some guidelines towards "niceness", which may be further supported by finetuning, but all the "badness" of humans is also in the training data and the basic niceness can easily be circumvented, e.g. by jailbreaking or "Sydney"-style failures. Even worse, we don't know what the AI really understands when we tell it to be nice. It may well be that this understanding breaks down and leads to unexpected behavior once the AI gets smart and/or powerful enough. The alignment problem cannot simply be solved by training an AI to act nicely, even less by commanding it to be nice.
In my view, AIs like Claude are more like covert narcissists: They "want" you to like them and appear very nice, but they don't really care about you. This is not to say we shouldn't use them or even be nice to them ourselves, but we cannot trust them to take over the world.
Thank you for being so open about your experiences. They mirror my own in many ways. Knowing that there are others feeling the same definitely helps me coping with my anxieties and doubts. Thank you also for organizing that event last June!
As a professional novelist, the best advice I can give comes from one of the greatest writers of the 20th century, Ernest Hemingway: "The first draft of anything is shit." He was known to rewrite his short stories up to 30 times. So, rewrite. It helps to let some time pass (at least a few days) before you reread and rewrite a text. This makes it easier to spot the weak parts.
For me, rewriting often means cutting things out that aren't really necessary. That hurts, because I have put some effort into putting the words there in the first place. So I use a simple trick to overcome my reluctance: I don't just delete the text, but cut it out and copy it into a seperate document for each novel, called "cutouts". That way, I can always reverse my decision to cut things out or maybe reuse parts later, and I don't have the feeling that the work is "lost". Of course, I rarely reuse those cutouts.
I also agree with the other answers regarding reader feedback, short sentences, etc. All of this is part of the rewriting process.
I think the term has many “valid” uses, and one is to refer to an object level belief that things will likely turn out pretty well. It doesn’t need to be irrational by definition.
Agreed. Like I said, you may have used the term in a way different from my definition. But I think in many cases, the term does reflect an attitude like I defined it. See Wikipedia.
I also think AI safety experts are self selected to be more pessimistic
This may also be true. In any case, I hope that Quintin and you are right and I'm wrong. But that doesn't make me sleep better.
From Wikipedia: "Optimism is an attitude reflecting a belief or hope that the outcome of some specific endeavor, or outcomes in general, will be positive, favorable, and desirable." I think this is close to my definition or at least includes it. It certainly isn't the same as a neutral view.
Thanks for pointing this out! I agree that my defintion of "optimism" is not the only way one can use the term. However, from my experience (and like I said, I am basically an optimist), in a highly uncertain situation, the weighing of perceived benefits vs risks heavily influences ones probability estimates. If I want to found a start-up, for example, I convince myself that it will work. I will unconsciously weigh positive evidence higher than negative. I don't know if this kind of focusing on the positiv outcomes may have influenced your reasoning and your "rosy" view of the future with AGI, but it has happened to me in the past.
"Optimism" certainly isn't the same as a neutral, balanced view of possibilities. It is an expression of the belief that things will go well despite clear signs of danger (e.g. the often expressed concerns of leading AI safety experts). If you think your view is balanced and neutral, maybe "optimism" is not the best term to use. But then I would have expected much more caveats and expressions of uncertainty in your statements.
Also, even if you think you are evaluating the facts unbiased and neutral, there's still the risk that others who read your texts will not, for the reaons I mention above.
Defined well, dominance would be the organizing principle, the source, of an entity's behavior.
I doubt that. Dominance is the result, not the cause of behavior. It comes from the fact that there are conflicts in the world and often, only one side can get its will (even in a compromise, there's usually a winner and a loser). If an agent strives for dominance, it is usually as an instrumental goal for something else the agent wants to achieve. There may be a "dominance drive" in some humans, but I don't think that explains much of actual dominant behavior. Even among animals, dominant behavior is often a means to an end, for example getting the best mating partners or the largest share of food.
I also think the concept is already covered in game theory, although I'm not an expert.
That is not what Claude does. Every time you give it a prompt, a new instance of Claudes "personality" is created based on your prompt, the system prompt, and the current context window. So it plays a slightly different role every time it is invoked, which is also varying randomly. And even if it were the same consistent character, my argument is that we don't know what role it actually plays. To use another probably misleading analogy, just think of the classical whodunnit when near the end it turns out that the nice guy who selflessly helped the hero all along is in fact the murderer, known as "the treacherous turn".
I think it's fairly easy to test my claims. One example of empirical evidence would be the Bing/Sydney desaster, but you can also simply ask Claude or any other LLM to "answer this question as if you were ...", or use some jailbreak to neutralize the "be nice" system prompt.
Please note that I'm not concerned about existing LLMs, but about future ones which will be much harder to understand, let alone predict their behavior.