‘Safe by Design’: A Speculative Paradigm for Positive AI Development

Status: These were some rough ideas I was playing with in June of 2025. This still seems close to me to be 'the best way AI systems could possibly develop' but I'm confident I haven't thought of everything.

The shape of AI development and its affordances fundamentally delineates a set of risks. These in turn inform the structure of the types of governance interventions that are necessary to mitigate them.

Here's a speculative question: what shape could AI development conceivably take, such that it would be easier to govern?

One shape might be a speculative paradigm I think of as ‘Safe by Design’. It has three main components:

Publicly available, decentralized, verifiable compute
Open-source, aligned, safeguarded, non-detuneable AI
Verifiably or effectively secure systems

I’ve seen writing about these things in isolation, but I don’t think I’ve ever read a piece that integrates them together and shows how these affordances create a risk portfolio that is possible to mitigate with a lower governance burden.

I also don’t think they make sense in isolation, for reasons I try to gesture at in the conclusion.

This is a speculative paradigm. This isn’t an argument that these are necessarily achievable, or that we’re on track to achieve them. At the same time, they don’t seem totally beyond the realm of possibility. Putting the idea together like this gives me hope.

1. Publicly available, decentralized, verifiable compute

By ‘publically available’, I envisage a system with universal basic compute, such that people would be able to train, fine-tune and run models for basic purposes at a rate guaranteed by the state. This would ensure that decentralization is actually meaningful (since everyone is ‘enfranchised’ to participate in the decentralized network).

By ‘decentralized’, I mean that compute should be owned by groups and individuals and in such a way that facilitates decentralized training runs. Crucially, this means that compute can’t be refused to particular actors. This would ensure that publicly available compute doesn’t have a single point of failure (e.g. governments or cloud actors can withhold it).

By ‘verifiable’, I mean that compute must be able to perform some tasks and not others. I think about this in the spirit of Flexible Hardware-Enabled Guarantees. These would need to be tamper proof, both physically and against cyberattacks and spoofing. I would want these guarantees to exclude tasks that could create harmful models (e.g. training models on prohibited virology data). I recognise that this seems extremely hard to operationalise and enforce.

The remaining governance burden would then be on: tracking pre-FlexHEG era chips (which quickly become redundant/constitute a small portion of active compute), tracking non-FlexHEG chips and fabs (seems easier), continuing to monitor the efficacy of these measures (seems easier).

Some resources:

[Edit 1] AI Moonshots: flexHEG
“Quoting: Isolated confidential compute environments with trusted execution, secure boot, tamper-proof logging, and hardware attestation.”
FlexHEGs: https://arxiv.org/pdf/2506.15093, https://arxiv.org/pdf/2506.03409, https://arxiv.org/pdf/2506.15100,

2. Open-source, aligned, safeguarded, non-detuneable AI

‘Open-source’ primarily means open-weights, but ideally, I’d like model runs to be reproducible from the ground up (see OLMo).

By ‘safe’, I mean that users couldn’t use the model to do harmful tasks. It seems like a lot of this would require implementing safeguards into the models value systems, rather than at the API level, which would seem difficult. Then of course there’s the problem of alignment.

By ‘detuneable’, I mean effectively detuneable, where it would require the same amount of compute to elicit the harmful capabilities from a trained model as it would to train a model to perform at that level from scratch (this would also ideally be impossible due to on-chip guarantees that make it difficult).

The remaining governance measures would still be substantial. A lot of actions that someone could take with an AI system could be dual use (e.g. cyberattacks), so you’d want to ensure that critical systems are constantly red-teamed and verified (see pillar 3). You’d also want to have some monitoring on multi-agent systems to make sure that these didn’t cause harms that couldn’t be predicted in individual models.

3. Verifiably or effectively secure systems

By ‘systems’ here I refer to the societal infrastructure that AI models are navigating and co-creating: traditional digital and cyberinfrastructure, initially, and then the increasingly sophisticated cyber-physical and cyber-biological domains as robotics and bio-integrated technology improves.

‘Verifiable’ system security would rely on the basis of mathematical proofs.

‘Effectively secure’ systems are systems that have been red-teamed by and repeatedly proven robust against a team with skills, resources and information greater than or equal to a particular adversary. Whilst effective security is a weaker guarantee, it may be sufficient, and it may also be feasible in a wider variety of domains than verifiable ones.

Remaining governance measures would still be substantial. You’d probably still want some monitoring and incident reporting for high-stakes domains, along with contingency protocols and fall backs.

Papers

Conclusion

One problem about this paradigm is that it’s fragile. Without one pillar, the other two might cause more harm than good. This makes transitioning to it difficult — if the different pillars don’t arrive at the same time, working on any one of them could be harmful.

Differential acceleration—the idea that we should accelerate parts of technological development but not others—should be tempered with strategic timing, whereby we ensure that the right technologies comes ‘online’ in approximately the right sequence and proximity in order to achieve the optimal outcome.

Without pillar 1, we get a world where anyone can train harmful models as algorithmic efficiency improves (unless we govern compute carefully!)
Without pillar 2, we get a world where models continue to be provided by large providers, centralizing power (unless we structure model access carefully!)
Without pillar 3, we get secure models producing insecure code, creating huge vulnerabilities that can be exploited (unless we red-team them first!)

Same goes for properties within the pillars.

If you make compute publicly available but not verifiably safe, you could support training runs of dangerous systems.
If you make systems open source without being non-detuneable, you create capabilities that can be trivially applied to malicious attacks.
If you make increasingly powerful digital infrastructure that isn’t secure, you open up the world to more vulnerability.

Some of this has the flavour of a triple moonshot. If you don’t get all three, plausibly you create some level of harm.

If we knew that we would get all three eventually, perhaps one could think about governance measures actions we take to 'pick up the slack' while we develop the technological paradigm we truly need. But we don't know that we will get all three eventually, so it seems sensible to try other approaches.

Perhaps one day humanity will be skilled enough at performing frontier science and developing technological pathways to develop them like the technologies themselves: creating maximal benefits, at the cost of minimal risks.

LESSWRONG
LW

LESSWRONG
LW

2

‘Safe by Design’: A Speculative Paradigm for Positive AI Development

2

2