You cannot incentivize people to make that sacrifice at anything close to the proper scale because people don’t want money that badly. How many hands would you amputate for $100,000?
There's just no political will to do it, since the solutions would be harsh or expensive enough that nobody could impose them upon society. A god-emperor, who really wished to increase fertility numbers and could set laws freely without the society revolting, could use some combination of these methods:
Communication is indeed hard, and it's certainly possible that this isn't intentional. On the other hand, making mistakes is quite suspicious when they're also useful for your agenda. But I agree that we probably shouldn't read too much into it. The system card doesn't even mention the possibility of the model acting maliciously, so maybe that's simply not in scope for it?
While reading OpenAI Operator System Card, the following paragraph on page 5 seemed a bit weird:
We found it fruitful to think in terms of misaligned actors, where:
- the user might be misaligned (the user asks for a harmful task),
- the model might be misaligned (the model makes a harmful mistake), or
- the website might be misaligned (the website is adversarial in some way).
Interesting use of language here. I can understand calling the user or website misaligned, as understood as alignment relative to laws or OpenAI's goals. But why call a model misaligned when it makes a mistake? To me, misalignment would mean doing that on purpose.
Later, the same phenomenon is described like this:
The second category of harm is if the model mistakenly takes some action misaligned with the user’s intent, and that action causes some harm to the user or others.
Is this yet another attempt to erode the meaning of "alignment"?
Ellison is the CTO of Oracle, one of the three companies running the Stargate Project. Even if aligning AI systems to some values can be solved, selecting those values badly can still be approximately as bad as the AI just killing everyone. Moral philosophy continues to be an open problem.
I would have written a shorter letter, but I did not have the time.
I am actually mildly surprised OA has bothered to deploy o1-pro at all, instead of keeping it private and investing the compute into more bootstrapping of o3 training etc.
I'd expect that deploying more capable models is still quite useful, as it's one of the best ways to generate high-quality training data. In addition to solutions, you need problems to solve, and confirmation that the problem has been solved. Or is your point that they already have all the data they need, and it's just a matter of speding compute to refine that?
I'd go a step beyond this: merely following incentives is amoral. It's the default. In a sense, moral philosophy discusses when and how you should go against the incentives. Superhero Bias resonates with this idea, but from a different perspective.
Yet Batman lets countless people die by refusing to kill the Joker. What you term "coherence" seems to be mostly "virtue ethics", and The Dark Knight is a warning what happens when virtue ethics goes too far.
I personally identify more with HPMoR's Voldermort than any other character. He seems decently coherent. To me "villain" is a person whose goals and actions are harmful to my ingroup. This doesn't seem to have much to do with coherence.
A reliable gear in a larger machine might be less agentic but more useful than a scheming Machiavellian.
The scheming itself brings me joy. Self-sacrifice does not. I assume this to be case for most people who read this. So if the scheming is a willpower restorer, keeping it seems useful. I'm not an EA, but I'd guess most of them can point to coherent-looking calculations on why what they're doing is better for effiency reasons as well.
I'm drawing parallels between conventional system auditing and AI alignment assessment. I'm admittedly not sure if my intuitions transfer over correctly. I'm certainly not expecting the same processes to be followed here, but many of the principles should still hold.
In my experience, if an audit finds lots of issues, it means nobody has time to look for the hard-to-find issues. I get the same feeling from this section; Apollo easily found scheming issues where the model deferred to the system prompt too much. Often subtler issues get completely shadowed, e.g. some findings could be attributed to the system prompt deference, when in reality they were caused by something else.
What I'm worried about that these potential blind spots were not found, as per my reasoning above. I think the marginal value produced by a second external assessment wasn't diminished much by the first one. That said, I agree that deploying Claude 4 is quite unlikely to pose any catastrophic risks, especially with ASL-3 safeguards. Deploying earlier, allowing anyone to run evaluations on the model is also valuable.