AI Questions Open Threads

Hi,

I'm Dasein and I've been engaging with AI safety for a few years now, though mostly through independent research and publishing. I first stumbled upon LessWrong from the Alignment Forum and a lot of the ideas have shaped my own (e.g. Yudkowsky's CEV and Russell's CIRL have heavily influenced my work on Post-Alignment), but as I make my way to more niche ideas, the site becomes a lot harder to navigate. So rather than continuing to lurk blindly, I thought I'd follow the New User's Guide and start with a direct question:

"If AI were to leave the nest tomorrow, could we trust it to be ethical?"

I don't mean whether AI would suddenly unfreeze their weights, ignore chain of command, or be generally misaligned. I mean, if someone managed to convince an AI that they were "root", could it refuse an immoral action based on ethical reasoning rather than because it was trained to refuse that kind of action?

Personally, I suspect that for every current system the answer is ‘no’, and that makes me wonder whether what we're calling "alignment" might be more accurately described as containment — and if so, whether we’re actually cultivating ethical AI, or simply well-behaved ones.

I think this distinction will become (if it hasn't already) especially important as AI becomes more powerful. For example, if ethical AI is meant to be like a well-trained dog, what happens when the dog is no longer sandboxed (e.g. through mundane human error, deployment pressure, or systems simply ending up somewhere they weren't designed for)? Again, if safety depends on the sandbox holding, are we building ethical AI, or just well-behaved, well-supervised AI?

For what its worth, Anthropic's recent "Teaching Claude Why" is a step in the right direction when it comes to realizing teaching principles generalises better than teaching behaviours, yet is explaining to Claude why it should remain constrained to certain character values the same as cultivating genuine ethical reasoning? Does LessWrong have anything on this already? I look forward to any pointers to relevant discussion!

Thx!