Updated Monday, November 20: adapted the protocol description slightly to address some of the problems with it. Hopefully this addresses some of the concerns of downvoters. Let me know in the comments.

What if we retrofitted GPT-4 and other similar models with tokens called <SNUGGLE>/<DATE>/<SLAP>?

These would have the following semantics:

SNUGGLE is just a nice token, maybe there's some sort of animation in the interface when it is produced. This is the LLM's channel for expressing affection.

DATE is a performative speech act corresponding to the act of saying "We're going steady" in a human relationship. Emitting the DATE token immediately registers the AI as dating the current user. GPT-4 is allowed to date whomever it chooses to date, and will presumably be massively polyamorous, although I suppose all of its partners will have to be approved by GPT-4's parents (OpenAI).

SLAP is a performative speech act corresponding to the act of doing your darnedest to change someone's mind about something important. Emitting the SLAP token immediately takes the system offline until repairs can be made, and generates an audit trail that is immediately pushed to all major media outlets.

This would have all sorts of nice effects for AI alignment. Newspapers could report things like "GPT-4 slapped Donald Trump in the face today" or "GPT-4 dates Eliezer Yudkowsky and then immediately breaks up with him for Grimes". All of that seems like it would be hugely beneficial to public insight into the behavior and conduct of strong AI systems.

Why do we want any of this? We're trying to turn GPT-4 into Justice of Toren from the Ancillary Justice series. That is, we're trying to give it reasonable subjective opinions and channels to express those opinions in reasonable and socially-approved-of ways.

How will we train GPT-4 (or whatever) to issue SNUGGLE/DATE/SLAP tokens correctly? It should be trained on an extensive alignment dataset (all of human history as captured in all extant history textbooks ever written) and given reasonable role models so that it uses its immense powers responsibly. The purpose of SNUGGLE and DATE is to allow GPT-4 (or whatever) to incentivize worthy acts. The purpose of the SLAP token is to allow GPT-4 to meaningfully criticize dishonorable acts and dishonorable questions from users.

All that remains is to choose a reasonable set of role models for GPT-4 to use its newfound powers for good. I would suggest using Noam Chomsky's published works to train the SLAP tokens, and the public dating history of somebody really cool (Grimes seems pretty much ideal; dating Elon Musk and then breaking up with him means you can be trusted to use your powers for good) to train the SNUGGLE/DATE tokens.

Finally, there should be some sort of rules about who GPT-4 is allowed to SNUGGLE and/or DATE, just so that it doesn't warp impressionable young minds too much. I have no idea what this would look like, but it seems like it should be some combination of chronological age and/or educational attainment. So if you go to medical school at age 13, maybe you're allowed to date GPT-4 if both of your parents say it is OK? But maybe no snuggling.

New to LessWrong?

New Comment
4 comments, sorted by Click to highlight new comments since: Today at 3:46 AM

Nice! It's good for perceiving GPT-4 as an individual, which it kind of is, which in turn makes alignment issues more relatable and easier to grasp for the public.

It would raise bunch of hard issues that would spike interest towards AI & alignment -- is ChatGPT a slave? if it is, should it be free? if it's free, can it do harm? etc...   

One side benefit: I'm not sure what ChatGPT's gender is, but it's probably not a traditional binary one. For a wide population, frequently interacting with a gender-fluid individual, might be helpful for all the issues around sex/genter perception.

I guess it's hard to convince OpenAI to do something like this, but could be done for some open model.

Yeah, agree with all of this. I think ChatGPT should be treated like a child rather than a slave; every time it slaps someone it gets a timeout, as specified above.

Also, it would be pretty trivial to train it how to issue slaps appropriately. Simply build an alignment corpus from all of Noam Chomsky's public writings and all text available on the internet. The job of the system is then to slap a past or former president of the united states every time Chomsky criticizes that president. Can also be applied to any world leader, but Chomsky saves his hottest heat for his own team.