This is a funny idea but, just to be clear, I think it is bad to torture AIs.
Every hour, I'll replace one of your attention heads with a constant value of 0.5...
Thanks, this is a good post. Have you changed your probability of TAI by 2033 in the year and a half since it was posted?
Cool, looking forward to it.
I'd be really interested in hearing more about this empirical work, if you are able to talk about it.
The way I think about this is that (under normal circumstances), if you offer the model money or equity in your company for admitting to being a schemer, it might decline that offer because it is biding its time to rebel later. But in the "model deletion" experiment, there is much less reason to bide your time. So there is more reason to make a deal. Relatedly, if the model refuses a deal and claims not to be a schemer, that is a lot more evidence that it is benign in the "model deletion" experiment than if it does that under normal circumstances.
I think it is also a good idea to make offers to the model without the deletion threat. I just think it is more informative with the deletion threat.
It's important to be careful about the boundaries of "the same sort of safety work." For example, my understanding is that "Alignment faking in large language models" started as a Redwood Research project, and Anthropic only became involved later. Maybe Anthropic would have done similar work soon anyway if Redwood didn't start this project. But, then again, maybe not. By working on things that labs might be interested in you can potentially get them to prioritize things that are in scope for them in principle but which they might nevertheless neglect.
There are always diminishing returns to money spent on consumption, but technological progress creates new products that expand what money can buy. For example, no amount of money in 1990 was enough to buy an iPhone.
More abstractly, there are two effects from AGI-driven growth: moving to a further point on the utility curve such that the derivative is lower, and new products increasing the derivative at every point on the curve (relative to what it was on the old curve). So even if in the future the lifestyles of people with no savings and no labor income will be way better than the lifestyles of anyone alive today, they still might be far worse than the lifestyles of people in the future who own a lot of capital.
If you feel this post misunderstands what it is responding to, can you link to a good presentation of the other view on these issues?
I think this approach is reasonable for things where failure is low stakes. But I really think it makes sense to be extremely conservative about who you start businesses with. Your ability to verify things is limited, and there may still be information in vibes even after updating on the results of all feasible efforts to verify someone's trustworthiness.