Thanks for this!
Compared to you, I more expect evidence of scheming if it exists.
You argue weak schemers might just play nice. But if so, we can use them to do loads of intellectual labour to make fancy behavioral red teaming and interp to catch out the next gen of AI.
More generally, the plan of bootstrapping to increasingly complex behavioral tests and control schemes seems likely to work. It seems like if one model has spent a lot of thinking time designing a scheme then another model would have to be much smarter to zero shot cause a catastrophe without the scheme detecting it. Eg. analogies with humans suggest this.
I agree the easy vs hard worlds influence the chance of AI taking over.
But are you also claiming it influences the badness of takeover conditional on it happening? (That's the subject of my post)
So you predict that if Claude was in a situation where it knew that it had complete power over you and could make you say that you liked it then it would stop being nice? I think would continue to be nice in any situation of that rough kind which suggests it's actually nice not just narcissistically pretending
But a human could instruct an aligned ASI to help it take over and do a lot of damage
That structural difference you point to seems massive. The reputational downsides of bad behavior will be multiplied 100-fold+ for AI as it reflects on millions of instances and the company's reputation.
And it will be much easier to record and monitor ai thinking and actions to catch bad behaviour.
Why unlikely we can detect selfishness? Why can't we bootstrap from human-level?
One dynamic initially preventing stasis in influence post-AGI is that different ppl have different discount rates, so those with lower discounts will slowly gain influence over time
Yep I'm saying you're wrong about this. If money compounds but you don't have utility=log($) then you shouldn't Kelly bet
Your formula is only valid if utility = log($).
With that assumption the equation compares your utility with and without insurance. Simple!
If you had some other utility function, like utility = $, then you should make insurance decisions differently.
I think the Kelly betting stuff is a big distraction, and that ppl with utility=$ shouldn't bet like that. I think the result that Kelly betting maximizes long term $ bakes in assumptions about utility functions and is easily misunderstood - someone with utility=$ probably goes bankrupt but might become insanely rich AI is happy not to Kelly bet. (I haven't explained this point properly, but recall reading about this and it's just wrong on it's face that someone with utility=$ should follow your formula)
I enjoyed reading this, thanks.
I think your definition of solving alignment here might be too broad?
If we have superintelligent agentic AI that tries to help its user but we end up missing out of the benefits of AI bc of catastrophic coordination failures, or bc of misuse, then I think you're saying we didn't solve alignment bc we didn't elicit the benefits?
You discuss this, but I prefer to separate out control and alignment. Where I wouldn't count us as having solved alignment if we only elicit behavior via intense/exploitative control schemes. So I'd adjust your alignment definition with the extra requirement that we avoided takeover while not doing super-intense control schemes relative to what is acceptable to do to humans today. Which is a higher bar, and separates it from the thing we care about --avoiding takeover and eliciting benefits-- but I think that's a better def
Why are they more recoverable? Seems like a human who seized power would seek asi advice on how to cement their power