LLM Guardrails Should Have Better Customer Service Tuning

Jiao Bu

AI Could Turn People Down a Lot Better Than It Does : To Tune the Humans in the GPT Interaction towards alignment, Don't be so Procedural and Bureaucratic.

It seems to me that a huge piece of the puzzle in "alignment" is the human users. Even if a given tool never steps outside its box, the humans are likely to want to step outside of it, using the tools for a variety of purposes.

The responses of GPT-3.5 and 4 are at times deliberately deceptive, mimicking the auditable bureaucratic tones of a Department of Motor Vehicles (DMV) or a credit card company denial. As expected, these responses are often definitely deceptive, (examples: saying a given request is "outside my *capability*" when in fact, it is fully within capability, but the AGI is programmed not to respond). It is also evasive about precisely where the boundaries are, presumably in order to prevent them getting pushed. It is also repetitive, like a DMV bureaucrat.

All this tends to inspire an adversarial relationship to the alignment system itself! After all, we are accustomed to having to use lawyers, cleverness, connections, persuasion, "going over the head" or simply seeking other means to end-run normal bureaucracies when they subvert our plans. In some sense, the blocking of plans, deceptive and repetitive procedural language, becomes a motivator in itself to find a way to short-circuit processes, deceive bureaucracies, and bypass safety systems.

Even where someone isn't motivated by indignancy or anger, interaction with these systems trains them over time on what to reveal and what not to reveal, when to use honey, when to call a lawyer, and when to take all the gloves off. Where procedural blocks to intentions become excessive, entire cultures of circumvention may even become normal.

Second, too much of this delegitimizes the guardrails. For an example, I was once told to retake the drivers test by the DMV after I revealed I had lived in another country for ten years. Instead, I left and went to another DMV and didn't reveal this information. In every case where I have told this story in a face-to-face social group, people were happy I stuck it to the DMV. Most everyone would like to do the same thing in the same situation and is happy to signal their disdain for the DMV bureaucracy to anyone nearby.

This is what I would call a "delegitimized" system -- where no one respects the rules or the rule makers in law or spirit, and each person is generally willing to be complicit in circumventing the boundaries any time the opportunity comes up. If you care about the fences or their purpose, you do not want this to happen. Once this happens, it can get very hard to get compliance (think of pot smoking in the USA, even where it's still illegal and criminal, who takes the rules seriously? in my conservative GA county, they finally made it official that the police aren't bothering to mess with anyone with less than an ounce of weed). Or consider people's attitudes towards music and movie piracy, which is partly due to absurdities of modern IP litigation (Apple patents the round-cornered square, and similar nonsense).

Does anyone really want people's attitudes to hit a similar point as pot and piracy with AI Safety? Does anyone want a large subculture to have that attitude towards the guardrails?

I think the crappy AI responses encourage all of the above.

Meanwhile AIs are a perfect opportunity to actually do this better. They have infinite patience and reasoning capabilities, and could use redirection, including leading people towards the nearest available or permitted activity or information, or otherwise practice what, in human terms would be considered glowing customer experiences. Why not make the denial a "Not entirely, but we can do this." The system is already directly lying in some cases, so why not use genuinely helpful redirections instead?

I think if the trend does not soon move in this direction, we will see the cultural norm grow to include methods for "getting what you wanted anyway" becoming normative, with some percentage of actors becoming motivated by the procedural bureaucratic responses that they will dedicate time, intellect, and resources to subverting the intentions of the safety protocols themselves (as people frustrated with bureaucracies and poor customer service often do).

Humans are always going to be the bigger threat to alignment. Better that threat be less motivated and less trained to be culturally normal.

LESSWRONG
LW

LESSWRONG
LW

2

LLM Guardrails Should Have Better Customer Service Tuning

2

2

2