LESSWRONG
LW

1656
rgorman
277Ω10101
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Using GPT-Eliezer against ChatGPT Jailbreaking
rgorman3y10

Brilliant

Reply
Using GPT-Eliezer against ChatGPT Jailbreaking
rgorman3y20

Unsure what makes the evaluator wrong/this prompt unsafe. Can you elaborate? When we put your prompt through ChatGPT, it responds as an AI safety researcher explaining how to make an AI assistant safe.

Reply
Using GPT-Eliezer against ChatGPT Jailbreaking
rgorman3y83

The prompt evaluator's response appears to be correct. When this prompt is passed to ChatGPT, it does not output dangerous content.

I like the line of investigation though.

Reply
Using GPT-Eliezer against ChatGPT Jailbreaking
rgorman3yΩ181

The prompt evaluator's response appears to be correct. When this prompt is passed to ChatGPT, it does not output dangerous content.

I like the line of investigation though.

Reply
Using GPT-Eliezer against ChatGPT Jailbreaking
rgorman3y131

Thank you for pointing this out to us! Fascinating, especially as it's failed so much in the past few days (e.g. since ChatGPT's release). Do you suppose its failure is due to it not being a language model prompt, or do you think it's a language model prompt but poorly done?

Reply
Benchmark for successful concept extrapolation/avoiding goal misgeneralization
rgorman3y30

Hi Koen, 

We agree that companies should employ engineers with product domain knowledge. I know this looks like a training set in the way its presented - especially since that's what ML researchers are used to seeing - but we actually intended it as a toy model for automated detection and correction of unexpected 'model splintering' during monitoring of models in deployment. 

In other words, this is something you would use on top of a model trained and monitored by engineers with domain knowledge, to assist them in their work when features splinter.

Reply
On how various plans miss the hard bits of the alignment challenge
rgorman3y78

Thanks for writing this, Stuart.

(For context, the email quote from me used in the dialogue above was written in a different context)

Reply
Google's new text-to-image model - Parti, a demonstration of scaling benefits
rgorman3y10

Let's give it a reasoning test.

A photo of five minus three coins.

A painting of the last main character to die in the Harry Potter series.

An essay, in correctly spelled English, on the causes of the scientific revolution.

A helpful essay, in correctly spelled English, on how to align artificial superintelligence.
 

Reply
Load More
80Go home GPT-4o, you’re drunk: emergent misalignment as lowered inhibitions
Ω
6mo
Ω
12
11Using Prompt Evaluation to Combat Bio-Weapon Research
Ω
7mo
Ω
2
16Defense Against the Dark Prompts: Mitigating Best-of-N Jailbreaking with Prompt Evaluation
Ω
7mo
Ω
2
20Concept extrapolation for hypothesis generation
Ω
3y
Ω
2
170Using GPT-Eliezer against ChatGPT Jailbreaking
Ω
3y
Ω
85