I am wondering how Less Wrong would improve ChatGPT's filtering? I'm reading through the comments on breaking OpenAI's filtering, and see plenty of analysis of the weaknesses of the safeguards. There's always the chance that some group could steal ChatGPT's source code and remove ad hoc additions to it, so I'll ask the question in this form:

How would you change ChatGPT's purpose, design, or function to enforce topic and content filtering of its output?

Thanks for your thoughts.

New to LessWrong?

New Answer
New Comment

2 Answers sorted by

Peter Chatain

Dec 10, 2022

30

Although this isn’t a direct answer, I think there’s something that changed recently with chat gpt such that it is now much better at filtering out illegal advice. It appears to be more complex than simply running a filter over what words were in the prompt or what words are in chat gpt’s output. By recent, I mean in the last 24 hours, and many tricks to “jailbreak” chat gpt no longer work.

It gives the impression that they modified the design of it to train on not providing illegal information.

It feels to me like the update today made it even better at filtering out answers that OpenAI doesn't want it to give.

It seems to me like the run basically on:

"Have an AI that flags whether or not a prompt or an answer violates the rules. Mark the text red if it does. Offer the user a way to say that text was marked wrongly as violating the rules."

This then gives them training data they can use to improve their filtering. Given how much ChatGPT is used this method will allow them to filter out more and more of what they want to filter out.

1Noah Scales1y
Huh, ok. I will have to check out the new version. Thanks!

Hmm, that's interesting. Thanks Peter!

JBlack

Dec 12, 2022

00

I would improve the filtering by reducing it to zero.

Interesting, and why is that an improvement?