rgorman — LessWrong

Using GPT-Eliezer against ChatGPT Jailbreaking

Brilliant

Using GPT-Eliezer against ChatGPT Jailbreaking

Unsure what makes the evaluator wrong/this prompt unsafe. Can you elaborate? When we put your prompt through ChatGPT, it responds as an AI safety researcher explaining how to make an AI assistant safe.

Using GPT-Eliezer against ChatGPT Jailbreaking

rgorman3y83

The prompt evaluator's response appears to be correct. When this prompt is passed to ChatGPT, it does not output dangerous content.

I like the line of investigation though.

Using GPT-Eliezer against ChatGPT Jailbreaking

rgorman3yΩ181

The prompt evaluator's response appears to be correct. When this prompt is passed to ChatGPT, it does not output dangerous content.

I like the line of investigation though.

Using GPT-Eliezer against ChatGPT Jailbreaking

rgorman3y131

Thank you for pointing this out to us! Fascinating, especially as it's failed so much in the past few days (e.g. since ChatGPT's release). Do you suppose its failure is due to it not being a language model prompt, or do you think it's a language model prompt but poorly done?

Benchmark for successful concept extrapolation/avoiding goal misgeneralization

rgorman3y30

Hi Koen,

We agree that companies should employ engineers with product domain knowledge. I know this looks like a training set in the way its presented - especially since that's what ML researchers are used to seeing - but we actually intended it as a toy model for automated detection and correction of unexpected 'model splintering' during monitoring of models in deployment.

In other words, this is something you would use on top of a model trained and monitored by engineers with domain knowledge, to assist them in their work when features splinter.

On how various plans miss the hard bits of the alignment challenge

rgorman4y78

Thanks for writing this, Stuart.

(For context, the email quote from me used in the dialogue above was written in a different context)

Google's new text-to-image model - Parti, a demonstration of scaling benefits

rgorman4y10

Let's give it a reasoning test.

A photo of five minus three coins.

A painting of the last main character to die in the Harry Potter series.

An essay, in correctly spelled English, on the causes of the scientific revolution.

A helpful essay, in correctly spelled English, on how to align artificial superintelligence.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments