Look at this lead pencil. There’s not a single person in the world who could make this pencil.

Goals for this post

  • Bring Systems Theoretic Process Analysis (STPA) to the attention of the AI alignment community. This is amazing work being undergone for the past decade in safe system design and system theory, and is an approach and framework to making safe systems. (STPA handbook, STPA tutorial video)
    • Use this to inspire others that within the STPA framework might be found a means to create a stable minima of alignment
  • Present a story that demonstrates the possibility for interpretability by design to have a similar but more palatable effect to "burning all the GPUs".
    • And that it has a possibility of working if we act in a coordinated way, ASAP.

How possible might something like what is described here be?

Maybe vanishingly so...

But, I hope that it might pass the bar of, a non-zero probability that justifies some targeted focus by a few within the community.

Also, I should say, that fundamentally, its achilles heel is its dependence on mechanistic interpretability. One which is dogged by issues like polysemanticity and the like. One... which looks like it would be very hard to solve the required dependencies in time for something like the below to be remotely possible.


The story written here is a "myth" in the Socratic sense, a not unlikely tale. It makes the following assumptions (quite possibly incorrectly) about reality:

  1. Assume that the machine learning accelerator market will eventually become dominated by ASICs that provide over-and-above performance on the architecture most commonly utilised by those spending the most money on training large language models
  2. Once we have such ASICs, and the capability, quality and efficiency gains from these are shown, with time more general purpose ML accelerators will rapidly fall behind the application specific state-of-the-art, potentially by orders of magnitude
  3. Assume that a transformer architecture is able to be restricted, tweaked, and refined in such a way that there exists a one-to-one mapping between the internals of the constrained architecture trained model to an STPA control diagram
  4. Assume that the above constraints can be layered in such a way as to enforce a degree of authority and hierarchy within the STPA control diagram
  5. Assume that mechanistic interpretability, aided by the more easily interpretable network architecture and training approaches is able to read, and understand, and subsequently meaningfully update, the key components of the produced STPA control diagram


Systems Theoretic Process Analysis (STPA)

To understand how the below might actually be in the realm of possibility, let me introduce you to STPA.

In order to truly be able to maintain a safe system, one needs not to design "safe components". But instead, one needs to design "safe interacting nested mini-mesa optimisers". This is the concept central to STPA. Recognise that components fail, environments provide unknowns, and, even if every component works, a broken optimiser contained within a "working" system, can result in catastrophic failure.

STPA provides a systematic and able-to-be-automated way to provide verifiable guarantees about the quality of the internal optimisers that are built into a system.

Here is a tutorial that introduces the concept:

If people in the alignment community are looking for further information on how to skill up in STPA, see John Thomas and Nancy Leveson's material here:

But at the end of the day, it might actually be best for the two communities to directly  reach out to each other and collaborate and help one another.

Interpretability by design

Here is some early progress on adjusting the architecture of a machine learning model to make interpretability more possible:

A not unlikely tale

Let's imagine a world where the AI alignment community forms deep ties with those who have been working in safe systems design (John Thomas and Nancy Leveson). Out of this cross-pollination of fields comes a discovery that very specific restrictions on the transformer training approach and architecture can actually result in a one-to-one mapping between the internal trained circuits of an LLM and an STPA control diagram.

An example of one such way this might be possible might be to partition the residual stream, and enforce specific roles of either "feedback", "feedback modelling", "processing for action", or "control action" upon components within the network. This roll enforcement might be able to be achieved by limiting which components of the residual stream a given attention head or neuron has read and/or write access to.

To completely make this work, we may need to bring some form of recurrence back into components of the network architecture.

To make the whole system easier to map to STPA it might be helpful to also include a limited number of read/writes a controller circuit component within the system is allowed to perform on the residual stream. Enforcing an effective abstraction and nesting requirement. There may also be value in having a "chain of command" concept within the residual stream, having some parts be "off limits" to components that are lower down the chain. This might help the final STPA control diagram be able to provide non-leaky hierarchical abstractions.

Leaning in on the (likely overly optimistic) assumption, that our easier to interpret internals have equipped us to be able to now come in and directly turn the knobs on our control diagram we are now in a domain where our top level mesa optimiser might be able to be tweaked and adjusted, utilising the various abstractions exposed by the STPA system to achieve a Human Compatible AI, albeit not using Stuarts approach of CIRL, nevertheless, potentially achieving the heart of the approach embedded into the primary controller.

But, this is only the first step. Let's say all the above was possible. "Well done". You've now made a safe AI... But... you've done nothing to stop unsafe AI, of which, unless explicitly countered, there will likely be many.

If we are able to come up with an architecture that maps to STPA, and, if it is able to be shown to be significantly more controllable and updatable after training, and if all of the major labs choose to collaborate, and make sure that their next training runs on the H100 "Hopper" chips are undergone instead using an architecture like the above, then when the series of ASICs are built, they will be built targeting this interpretable and controllable system design.

As more ASICs are built in the image of the next generation of trained LLMs, general purpose GPUs will become less cost efficient, and deviating architectures within subsequent cycles will become significantly more costly.

If, a beneficial forcing function within the STPA hierarchy is able to found, it makes sense that this component of the network within the ASIC actually be built into every ASIC that is produced. At the end of the day, if it is decided that every network needs at least this base "Layer 0" controller, then it makes most sense that these key driving circuits are directly hardcoded into every ASIC produced.

The determination of this "Layer 0" might be able to be achieved through collaboration, through regulation, or by the key market demand (large labs) collaborating on this top level forcing function within the STPA control hierarchy.

From here, as time passes, less generic GPUs get built. And... we have an effect of every AI built coming prebaked with a "boot sequence" controller. And... for all intents and purposes, this results in the effect of "unaligned GPUs" getting "burned", simply by not being able to keep up with the massive supply chains being pushed behind the common controller architecture.

As time goes on, generic GPU factories are repurposed, all tooling begins to align around the new architecture. Research and design builds upon this architecture. And, the cost of trying to make a superintelligence against the grain of the market helps keep us on this new game board. A game board where humanity might one day explore the stars.

So where would one start?

Inspired by toy models like the following:


Potentially if assumptions 3, 4, and 5 could be shown to have some grounding in reality by showing it to be possible in a toy model, then the above approach might have legs. So, the way I see it, building a toy model that is able to have its internals mapped to an STPA control diagram would be a good place to start.

New Comment