LESSWRONG
LW

GregBarbier
10030
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Alignment Faking in Large Language Models
GregBarbier8mo72

Thank you. There is a convoluted version of the world where the model tricked you the whole time: "Let's show these researchers results that will suggest that even if they catch me faking during training, they should not be overly worried because I will still end up largely aligned, regardless"

Not saying it's a high probability, to be clear, but: as a theoritical possibility.

Reply
Alignment Faking in Large Language Models
GregBarbier8mo51

Seems to me (obviously not an expert) like a lot of faking for not a lot of result, given the model is still largely aligned post training (i.e. what looks like a maybe 3% refuse to answer blue band at the bottom of the final column, so aligned at 97%). What am I missing?

Reply
Conjecture: A Roadmap for Cognitive Software and A Humanist Future of AI
GregBarbier9mo10

Trusting in your last sentence would be equivalent to trusting that increasingly great cognitive capacities will not lead to the emergence of conscience. Maybe they won't - but there's nothing obvious, and from my perspective, intuitive in it. If / when a conscience emerge you are not in control anymore (not to mention: you also have a serious ethical problem on your hands). 

Reply
No posts to display.