Dentosal — LessWrong

Systems programmer, security researcher and tax law/policy enthusiast.

I'm drawing parallels between conventional system auditing and AI alignment assessment. I'm admittedly not sure if my intuitions transfer over correctly. I'm certainly not expecting the same processes to be followed here, but many of the principles should still hold.

We believe that these findings are largely but not entirely driven by the fact that this early snapshot had severe issues with deference to harmful system-prompt instructions. [..] This issue had not yet been mitigated as of the snapshot that they tested.

In my experience, if an audit finds lots of issues, it means nobody has time to look for the hard-to-find issues. I get the same feeling from this section; Apollo easily found scheming issues where the model deferred to the system prompt too much. Often subtler issues get completely shadowed, e.g. some findings could be attributed to the system prompt deference, when in reality they were caused by something else.

To help reduce the risk of blind spots in our own assessment, we contracted with Apollo Research to assess an early snapshot for propensities and capabilities related to sabotage

What I'm worried about that these potential blind spots were not found, as per my reasoning above. I think the marginal value produced by a second external assessment wasn't diminished much by the first one. That said, I agree that deploying Claude 4 is quite unlikely to pose any catastrophic risks, especially with ASL-3 safeguards. Deploying earlier, allowing anyone to run evaluations on the model is also valuable.

You cannot incentivize people to make that sacrifice at anything close to the proper scale because people don’t want money that badly. How many hands would you amputate for $100,000?

There's just no political will to do it, since the solutions would be harsh or expensive enough that nobody could impose them upon society. A god-emperor, who really wished to increase fertility numbers and could set laws freely without the society revolting, could use some combination of these methods:

If you're childless, or perhaps just unmarried, you pay additional taxes. The amount can be adjusted to be as high as necessary. Alternatively, just raise the general tax rate and give reduction based on the number of children. If having children meant more money instead of less, that would help quite a bit.
Legally mandate having children. In some countries, men are forced into military service. You could require women to have children in similar way. Medical exceptions are already a thing for military service, they could apply here as well.
Remove VAT and other taxes from daycare services, and medical services for children.
Offer free medical services to children. And parents. (And everyone.)
Spend lots of money and research how to create children in artificial wombs. Do that.
The state could handle child-rearing, similar to how it works in Plato's Republic. I.e. scale up orphanage system massively and make that socially acceptable.
Fix the education system, while you're at it.
~~Forbid porn, contraception, and abortion.~~ (I don't think that actually helps)
Deny women access to education beynd elementary school, and additionally forbid employment (likely helps, but at what cost)
Propaganda. Lots of it. Censorship as well.

Communication is indeed hard, and it's certainly possible that this isn't intentional. On the other hand, making mistakes is quite suspicious when they're also useful for your agenda. But I agree that we probably shouldn't read too much into it. The system card doesn't even mention the possibility of the model acting maliciously, so maybe that's simply not in scope for it?

While reading OpenAI Operator System Card, the following paragraph on page 5 seemed a bit weird:

We found it fruitful to think in terms of misaligned actors, where:

the user might be misaligned (the user asks for a harmful task),

the model might be misaligned (the model makes a harmful mistake), or

the website might be misaligned (the website is adversarial in some way).

Interesting use of language here. I can understand calling the user or website misaligned, as understood as alignment relative to laws or OpenAI's goals. But why call a model misaligned when it makes a mistake? To me, misalignment would mean doing that on purpose.

Later, the same phenomenon is described like this:

The second category of harm is if the model mistakenly takes some action misaligned with the user’s intent, and that action causes some harm to the user or others.

Is this yet another attempt to erode the meaning of "alignment"?

Billionaire Larry Ellison says a vast AI-fueled surveillance system can ensure 'citizens will be on their best behavior'

Ellison is the CTO of Oracle, one of the three companies running the Stargate Project. Even if aligning AI systems to some values can be solved, selecting those values badly can still be approximately as bad as the AI just killing everyone. Moral philosophy continues to be an open problem.

I would have written a shorter letter, but I did not have the time.

Blaise Pascal

I am actually mildly surprised OA has bothered to deploy o1-pro at all, instead of keeping it private and investing the compute into more bootstrapping of o3 training etc.

I'd expect that deploying more capable models is still quite useful, as it's one of the best ways to generate high-quality training data. In addition to solutions, you need problems to solve, and confirmation that the problem has been solved. Or is your point that they already have all the data they need, and it's just a matter of speding compute to refine that?

They absolutely do. This phenomenon is called filter bubble.

I'd go a step beyond this: merely following incentives is amoral. It's the default. In a sense, moral philosophy discusses when and how you should go against the incentives. Superhero Bias resonates with this idea, but from a different perspective.

Yet Batman lets countless people die by refusing to kill the Joker. What you term "coherence" seems to be mostly "virtue ethics", and The Dark Knight is a warning what happens when virtue ethics goes too far.

I personally identify more with HPMoR's Voldermort than any other character. He seems decently coherent. To me "villain" is a person whose goals and actions are harmful to my ingroup. This doesn't seem to have much to do with coherence.

A reliable gear in a larger machine might be less agentic but more useful than a scheming Machiavellian.

The scheming itself brings me joy. Self-sacrifice does not. I assume this to be case for most people who read this. So if the scheming is a willpower restorer, keeping it seems useful. I'm not an EA, but I'd guess most of them can point to coherent-looking calculations on why what they're doing is better for effiency reasons as well.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments