LESSWRONG
LW

106
Jurgen Gravestein
4030
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
A case for AI alignment being difficult
Jurgen Gravestein2y10

Great overview! And thanks for linking to Bostrom's paper on OAI, I hadn't read that yet. 

My immediate thoughts: Isn't it most likely that advanced AI will be used by humans to advance any of their personal goals (benign or not benign). Therefore, the phrase "alignment to human values" automatically raises the question, who's values? Which leads to: who gets to decide what values it gets aligned to? 

Reply
Corrigibility, Self-Deletion, and Identical Strawberries
Jurgen Gravestein2y64

What do these examples really show other than that the completion abilities of ChatGPT match with what we associate with a well-behaved AI? The practice of jailbreaking/prompt injection attacks show that the capability of harm in LLMs is always there, it's never removed through alignment/RLHF, it just makes it harder to access. Doesn't RLHF simply improves a large majority of outputs to conform to what we think is acceptable outcomes? To me it feels a bit like convincing a blindfolded person he’s blind, hoping it won’t figure out how to take the blindfold off.

Slightly unrelated question: why is conjuring up persona's in prompts so effective? 

Reply
More information about the dangerous capability evaluations we did with GPT-4 and Claude.
Jurgen Gravestein2y2-2

So, basically, OpenAI has just deployed an extremely potent model and the party evaluating its potential dangerous capabilities is saying we should be more critical to the eval party + labs going forward? This makes it sounds like an experiment waiting for it all to go wrong.

Reply
No posts to display.