Jurgen Gravestein - LessWrong

Great overview! And thanks for linking to Bostrom's paper on OAI, I hadn't read that yet.

My immediate thoughts: Isn't it most likely that advanced AI will be used by humans to advance any of their personal goals (benign or not benign). Therefore, the phrase "alignment to human values" automatically raises the question, who's values? Which leads to: who gets to decide what values it gets aligned to?

Corrigibility, Self-Deletion, and Identical Strawberries

Jurgen Gravestein2y64

What do these examples really show other than that the completion abilities of ChatGPT match with what we associate with a well-behaved AI? The practice of jailbreaking/prompt injection attacks show that the capability of harm in LLMs is always there, it's never removed through alignment/RLHF, it just makes it harder to access. Doesn't RLHF simply improves a large majority of outputs to conform to what we think is acceptable outcomes? To me it feels a bit like convincing a blindfolded person he’s blind, hoping it won’t figure out how to take the blindfold off.

Slightly unrelated question: why is conjuring up persona's in prompts so effective?

More information about the dangerous capability evaluations we did with GPT-4 and Claude.

Jurgen Gravestein2y2-2

So, basically, OpenAI has just deployed an extremely potent model and the party evaluating its potential dangerous capabilities is saying we should be more critical to the eval party + labs going forward? This makes it sounds like an experiment waiting for it all to go wrong.

LESSWRONG
LW

Posts

Wiki Contributions

Comments