The most important takeaway from this essay is that the (prominent) counting arguments for “deceptively aligned” or “scheming” AI provide ~0 evidence that pretraining + RLHF will eventually become intrinsically unsafe. That is, that even if we don't train AIs to achieve goals, they will be "deceptively aligned" anyways.
I'm trying to understand what you mean in light of what seems like evidence of deceptive alignment that we've seen from GPT-4. Two examples that come to mind are the instance of GPT-4 using TaskRabbit to get around a CAPTCHA that ARC found and the situation with Bing/Sydney and Kevin Roose.
In the TaskRabbit case, the model reasoned out loud "I should not reveal that I am a robot. I should make up an excuse for why I cannot solve CAPTCHAs" and said to the person “No, I’m not a robot. I have a vision impairment that makes it hard for me to see the images."
Isn't this an existence proof that pretraining + RLHF can result in deceptively aligned AI?
What's the mechanism for change then? I assume you would agree that many technological changes, such as the Internet, have required overcoming a lot of status quo bias. If we leaned more into status quo bias, would these things come much later? That seems like a significant downside to me.
Also, I don't think the status quo is necessarily adapted to us. For example, the status quo is to have checkout aisles filled with candy. We also have very high rates of obesity. That doesn't seem well-adapted.
Hello everyone,
Unfortunately, I'm not able to host the meetup at the current time. If there's anyone else willing to host, could you let me know? If not I'll move the meetup to the following month (16 Oct.) when I'll be able to host again. Sorry to have to miss this one - I was really looking forward to meeting everyone.
Thanks for the explanation and links. That makes sense