AI Awareness through Interaction with Blatantly Alien Models

VojtaKovarik

7 AI Awareness through Interaction with Blatantly Alien Models

28th Jul 2023

4 min read

7

Summary: I believe that if more people understood the alien nature of AI on the gut level, it might become easier for them to appreciate the risks. If this became sufficiently common knowledge, we might even get needle-moving effects on regulation and safety practices. We -- realistically, you or Anthropic -- might help this along by intentionally creating AIs that feel very alien. One might even create AIs that highlight the alien nature of other AIs (like current LLMs).

Recapping the well-known argument: AIs are alien. We don't always fully realise this.

AI companies spend a lot of effort to put a human face on their product. For example, we give the AI assistant a human name and use the same interface we use for chatting with our friends.

Jokes of the Day - April 2023 Robot Puns - Here's a Joke

Over the course of the interaction, AI assistants typically maintain a quite consistent tone and "vibe". When you point out the AI made a mistake, it replies "sorry, you are right", as if to suggest that it just felt a minor pang of shame and will try to avoid that mistake in the future. All of this evokes the sense that we are interacting with something human-like.

However, the AIs we are likely to build (and those we already have) will be far from human. They might be superhumanly capable in some ways, yet make mistakes that a human never would (eg, ChatGPT's breadth of knowledge vs its inability to count words). They might not have anything like a coherent personality, as nicely illustrated by the drawing of RLHF-Shoggoth. And even when the AI is coherent, its outward actions might be completely detached from its intentions (as nicely illustrated by the Ex Machina movie).

Janus' Simulators - by Scott Alexander - Astral Codex Ten

Benefits of wide-spread understanding of the alien nature of AI

More awareness of the alien nature of AIs seems quite robustly useful:

more people understand that even capable AIs can have security vulnerabilities ==> higher chance of getting security-mindset-motivated regulation and alignment practices
widespread appreciation of people being manipulable by AI ==> better regulation, (somewhat) fewer people fall for this. Perhaps also better handling of human feedback.
Preempting some of the misguided debates about AI rights we are likely to get. (Drawback: it might also prevent some of the justified debates.)

Ideas for exposing people to alien AIs

Here are some thoughts on what one might give people a visceral experience of AIs being alien:

Non-misleading user interface: Compared to chatbot-LLMs, the original GPT-3 interface was much better at not-hiding the alien nature of the LLM you are interacting with. Of course, the GPT-3 interface isn't good for our purposes because it is clunky and an average person does not understand what all of the knobs do. Also, many of the knobs that the AI has are hidden from the user. However, all of this seems fixable.
For example, we might build an interface that starts out as a normal chatbot-LLM, but gradually changes. Over time, you might give user access to knobs like temperature, answer length, context window length, etc. You might add non-standard knobs like changing the sentiment, or making the model talk more about some topic.^[1] If the LLM uses a hidden prompt, you could reveal that to the user. And I bet there are more ways of giving a more "shoggoth-like" impression.
"Unreliable genie" AI assistant: A GPT-4-like AI assistant finetuned for interpretting commands literally. (This needs to be sufficiently blatant, such that people don't accidentally act on the advice.)
Personality-switching AI assistants: An AI assistant that seems to have a consistent personality but then intentionally switches personalities or reveals inconsistencies over time.^[2]
"Lol, you thought that persona was real?" AI girlfriends: An evil way of doing this would be to make really good "AI girlfriend" bots, which adopt a very specific personality for each user. And once the user develops feelings for them, they intentionally "let the mask slip" to reveal that there was nothing in there, or they arbitrarily switch personality etc. Even better if there is a nice graphical avatar to go with this. (Obviously,) please don't actually do this.
However, there might be ways of achieving the similar effect without hurting the user. For example, warning them ahead of time and doing the switch after just one hour of interaction. (With good AI and enough data on the user, this might still be really effective.)
Intentionally non-robust AI: A good thing about GPT-4 and below is that it often makes dumb mistakes, which makes people realise how brittle it is. However, many people come away with the conclusion that GPT-4 is stupid, while the correct conclusion is that is superhuman in some aspects and sub-human in others. Moreover, glaring AI mistakes might eventually get sufficiently infrequent that most users forget about the "stupid" aspect and start relying on it more than they should.
To mitigate it, you could intentionally create an AI that aims to convey the un-even nature of AI capabilities. I am not sure how to go about this. Perhaps something like an AI that is superhuman in some impressive domain, but intentionally makes mistakes in that same domain?

Putting the ideas in practice

Primarily, the above ideas seem like a good fit for smaller actors, or even student projects. However, I could also imagine (for example) Anthropic releasing something like this as a demo for user-education purposes.^[3] Overall, I am quite excited about this line of work, since it seems neglected and tractable, but also fun and useful.

^{^}
EG, things like adding the [love-hate] vector to the network's activations [reference needed, but I can't remember it right now].
^{^}
You could even finish by revealing to the user that this was all planned ahead of time (cf. the Confusion Ending from Stanley Parable).
^{^}
This should prevent any negative effects on the popularity of the company's flagship products. Admittedly, actions like these would make the public more wary of using AI in general. However, this would likely affect all AI companies equally, so it would not hurt the company's position in the AI race.

Anthropic (org)Language ModelsAI

Frontpage

7

AI Awareness through Interaction with Blatantly Alien Models

211a3orn

2VojtaKovarik

6Soapspud

3the gears to ascension

1VojtaKovarik

New Comment

5 comments, sorted by

top scoring

Click to highlight new comments since: Today at 12:54 AM

[-]1a3orn1y2111

So the problem with this is that 4/5 ideas here involve deliberately making the LLM more alien than it actually is, in ways unrelated to how alien the LLM may or may not be.

-- GPT-4 finetuned to interpreting commands literally* -- But this isn't an actual failure mode for LLMs, mostly, because they're trained off human language and humans don't interpret commands literally. If you ask an LLM how to get your parent our of a burning building... it will actually suggest things that involves understanding that you also want your parent to be safe. Etc etc.

-- Personality switching AI assistants* -- Again, once an LLM has put on a personality it's pretty unlikely to put it off, unless there's a thematically-consistent reason for it as in the Waluigi effect.

-- Intentional non-robustness -- Like, again, why invent problems? You could show them things models are bad at -- counting characters, knowing what letters are in tokens, long term reasoning.

This advice does not help us be guided by the beauty of our weapons. I think we should mostly try to persuade people of things through means that would work if and only if what we were saying was true; that's the difference between communication and propaganda.

[-]VojtaKovarik1y2-1

I agree that we shouldn't be deliberately making LLMs more alien in ways that have nothing to do with how alien they actually are/can be. That said, I feel that some of the examples I gave are not that far from how LLMs / future AIs might sometimes behave? (Though I concede that the examples could be improved a lot on this axis, and your suggestions are good. In particular, the GPT-4 finetuned to misinterpret things is too artificial. And with intentional non-robustness, it is more honest to just focus on naturally-occurring failures.)

To elaborate: My view of the ML paradigm is that the machinery under the hood is very alien, and susceptible to things like jailbreaks, adversarial examples, and non-robustness out of distribution. Most of the time, this makes no difference to the user's experience. However, the exceptions might be disproportionally important. And for that reason, it seems important to advertise the possibility of those cases.

For example, it might be possible to steal other people's private information by jailbreaking their LLM-based AI assistants --- and this is why it is good that more people are aware of jailbreaks. Similarly, it seems easy to create virtual agents that maintain a specific persona to build trust, and then abuse that trust in a way that would be extremely unlikely for a human.^[1] But perhaps that, and some other failure modes, are not yet sufficiently widely appreciated?

Overall, it seems good to take some action towards making people/society/the internet less vulnerable to these kinds of exploits. (The example I gave in the post were some ideas towards this goal. But I am less married to those than to the general point.) One fair objection against the particular action of advertising the vulnerabilities is that doing so brings them to the attention of malicious actors. I do worry about this somewhat, but primarily I expect people (and in particular nation-states) to notice these vulnerabilities anyway. Perhaps more importantlly, I expect potential misaligned AIs to notice the vulnerabilities anyway --- so patching them up seems useful for (marginally) decreasing the world's take-over-ability.

^{^}
For example, because a human wouldn't be patient enough to maintain the deception for the given payoff. Or because a human that would be smart enough to pull this off would have better ways to spend their time. Or because only a psychopathic human would do this, and there is only so many of those.

[-]Soapspud1y61

The less-misleading user interface seems good to me, but I have strong reservations about the other four interventions.

To use the shoggoth-with-smiley-face-mask analogy, the way the other strategies are phrased sounds like a request to create new, creepier masks for the shoggoth so people will stop being reassured by the smiley-face.

From the conversation with 1a3orn, I understand that the creepier masks are meant to depict how LLMs / future AIs might sometimes behave.

But I would prefer that the interventions removed the mask altogether, that seems more truth-tracking to me.

(Relatedly, I'd be especially interested to see discussions (from anyone) on what creates the smiley-face-mask, and how entangled the mask is with the rest of the shoggoth's behaviour.)

Note: I believe my reservations are similar to some of 1a3orn's, but expressed differently.

[-]the gears to ascension1y32

I don't think you can make LLMs feel alien because they are not in fact highly alien: neural systems are pretty familiar to other neural systems - where by neural system I mean a network of interacting components that learns via small updates which are local in parameter space - and you're more likely to make people go "wow, brains are cool" or "wow, ai is cool" than think it's a deeply alien mind, because there's enough similarity that people do studies to learn about neuroscience from deep learning. Also, I've seen evidence in public that people believe chatgpt's pitch that it's not conscious.

[-]VojtaKovarik1y11

I would distinguish between "feeling alien" (as in, most of the time, the system doesn't feel too weird or non-human to interact with, at least if you don't look too closely) and "being alien" (a in, "having the potential to sometimes behave in a way that a human never would").

My argument is that the current LLMs might not feel alien (at least to some people), but they definitely are. For example, any human that is smart enough to write a good essay will also be able to count the number of words in a sentence --- yet LLMs can do one, but not the other. Similarly, humans have moods and emotions and other stuff going in their heads, such that when they say "I am sorry" or "I promise to do X", it is a somewhat costly signal of their future behaviour --- yet this doesn't have to be true at all for AI.

(Also, you are right that people believe that ChatGPT's isn't conscious. But this seems quite unrelated to the overall point? As in, I expect some people would also believe ChatGPT if it started saying that it is conscious. And if ChatGPT was conscious and claimed that it isn't, many people would still believe that it isn't.)

Moderation Log