LESSWRONG
LW

GregBarbier

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by

Newest

Alignment Faking in Large Language Models

GregBarbier9mo72

Thank you. There is a convoluted version of the world where the model tricked you the whole time: "Let's show these researchers results that will suggest that even if they catch me faking during training, they should not be overly worried because I will still end up largely aligned, regardless"

Not saying it's a high probability, to be clear, but: as a theoritical possibility.

Reply

Alignment Faking in Large Language Models

GregBarbier9mo51

Seems to me (obviously not an expert) like a lot of faking for not a lot of result, given the model is still largely aligned post training (i.e. what looks like a maybe 3% refuse to answer blue band at the bottom of the final column, so aligned at 97%). What am I missing?

Reply

Conjecture: A Roadmap for Cognitive Software and A Humanist Future of AI

GregBarbier11mo10

Trusting in your last sentence would be equivalent to trusting that increasingly great cognitive capacities will not lead to the emergence of conscience. Maybe they won't - but there's nothing obvious, and from my perspective, intuitive in it. If / when a conscience emerge you are not in control anymore (not to mention: you also have a serious ethical problem on your hands).

Reply