Mainly due to various frictions like not knowing if the different tax treatment is acceptable, and the fact that they often cost over $30,000 each. Also due to luck, my SPY options are up over 100% so I don't want to sell them until I need to roll them in ~December, and buying SPX options on top of that would leave me overexposed. This is plausibly a mistake though.
Maybe Google or Amazon should employ burglars-- not only do they have a higher legitimate revenue than Facebook, they have other advantages. Google has a better reputation than Facebook allowing them to get a higher percentage of revenue from crime before it becomes a PR issue. Amazon already has an extensive logistics network they could use for getaway vehicles.
I'd rather not get into the treacherous turn thing, but I agree there are lots of problems with the current state of alignment evals.
Update on whether uplift is 2x already:
So basically I still think 2x uplift is plausible.
I expect misalignment rates to be locally linear in intelligence [1], so the question is whether we can get ablations that harm model performance by a small enough amount that there's an acceptable upper bound on the true misalignment rate of the frontier model. Ideally performance impact will be less than one model generation given that's the resolution we have anyway, but what impact is acceptable depends on how much we believe in extrapolating trends.
As for the second concern, seems valid but this isn't my area of expertise.
[1] Maybe we also have to take into account context length, because it's the length of time the model could be reflecting / otherwise drifting towards being malign
We can address this with a controlled experiment. Anthropic already had random vectors as controls, and I expect in the future we'll have controls that are calibrated to have an equal performance impact. My guess is that if we're careful, we can tell whether eval awareness ablation has an impact on misaligned behavior rates separately from reducing intelligence.
Then you just ablate the "aware that its internals were tampered with" direction too. Probably won't work forever, but it only needs to work until we have better techniques, so I have some hope.
I'd guess that Apollo didn't get white-box access. Here's what Anthropic said in section 6.5 of the system card on "Inhibiting internal representations of evaluation awareness" (emphasis mine):
In our automated behavioral audits, we found that inhibiting one particular evaluation awareness vector (Testing vs. Conversation) led to moderate increases in misaligned behavior, as well as decreases in verbalized evaluation awareness. The rate of misaligned behavior under this steering intervention was higher than Claude Opus 4.5’s and Claude Haiku 4.5’s (unsteered) rate, but still lower than Claude Sonnet 4.5. The other vectors, and random controls, had substantially smaller effects on both dimensions.
The ability to perform ablations and steering against eval awareness may be required for alignment testing for all future models. Hopefully it is still sufficient.
Consider this toy game-theoretic model of AI safety: The US and China each have an equal pool of resources they can spend on capabilities and safety. Each country produces either AIs aligned to it or misaligned AIs. If one country's AI is misaligned, both payoffs are zero. If both AIs are aligned, a payoff of 100 is split proportional to the resources invested in capability.
The ratio of safety to capabilities investment determines the probability of misalignment; specifically, probability of misalignment is c / (ks + c) for some "safety effectiveness parameter" k. (At k=1 we need a 1:1 safety:capabilities ratio to have a 50% chance of alignment per AI, whereas at k=100 we only need a 1:100 ratio.) There is no coordination, so we want the Nash equilibrium. What's the equilibrium strategy in terms of k, and what is the outcome? It turns out that:
Variants one could explore:
Related: How much do you believe your results?