In the last post, I expressed that it would be nice to have safety data sheets for different optimization processes. These data sheets would describe the precautions necessary to use each optimization process safely. This post describes an approach for writing such safety guidelines, and some high-level guidelines for safely applying optimization.

The Safety Mindset

We can think of an optimization metric as defining what an optimization process is "trying" to do. I'm also persuaded by Yudkowsky's take that we shouldn't build software systems that are "trying" to do things we don't want them to do:

Suppose your AI suddenly became omniscient and omnipotent - suddenly knew all facts and could directly ordain any outcome as a policy option. Would the executing AI code lead to bad outcomes in that case? If so, why did you write a program that in some sense 'wanted' to hurt you and was only held in check by lack of knowledge and capability? Isn't that a bad way for you to configure computing power? Why not write different code instead?

That whole article is worth a read, and describes what Yudkowsky calls the safety mindset, analogous to the security mindset. Where the real adversary isn't hackers trying to subvert your system, it's your own mind that makes assumptions during system design that might not hold for all contexts your system might operate in. One approach that Yudkowsky advocates in both contexts is to limit those assumptions as much as possible.

Safety Guidelines

 Here is a high-level overview of some guidelines that might appear on a safety data sheet for optimization processes:

  • Think about other ways to achieve the same goal, other than using optimization.
    • Optimization can be dangerous, and often many of the benefits can be achieved by other means.
    • For example, instead of asking your system to search over ways to behave, just tell it what you want it to do.
  • If you need to use optimization, use it as little as possible.
    • To the greatest extent possible, your software's behavior should be defined outside of an optimization process.
    • Optimization should occupy as small a module as possible in your system's architecture.
  • Notice what assumptions need to hold in order for this system to operate safely. Try to limit these as much as possible.
    • For example, don't rely on the assumption that your software will never be fed an invalid input. Design your software so that it operates correctly no matter what input it receives.
    • The rest of these guidelines can be framed in terms of highlighting an implicit assumption one might make when using optimization in one's software systems.
  • Only use well-vetted implementations of optimization processes, whose properties are well-understood.
    • An optimization process must never output a solution which is not in the domain of valid candidates, for example.
  • Come up with criteria for what would constitute a safe candidate. Use these constraints to ignore unsafe candidates during the search procedure.
    • This is a computational speedup, and it also helps to prevent a correctly-working optimization process from ever having an unsafe output.
    • Often this can be done before optimization even starts, by limiting the domain of candidates in the first place.
      • A thermostat doesn't need to consider lighting a small fire in its owner's home, and then reject that course of action due to it being unsafe. It can be designed to simply not consider unsafe actions in the first place.
    • An allow list of known-safe candidates is safer than an ignore list of known-unsafe candidates.
    • These constraints can be extremely aggressive, constraining the space of candidates quite a lot. We can then carefully loosen these constraints later if we want to add candidates back in.
  • Manually check the outputs of optimization processes anyway.
    • They might be unsafe in a way that you didn't expect when you were thinking of safety criteria.
    • This lets you catch unsafe outputs before they're used, and can highlight the need to add more safety constraints or further limit the domain of candidates.
  • Don't deploy programs that you don't understand.
    • Optimization processes can be used to generate programs, whose workings are not understood by anyone.
    • Do not deploy these programs. Ideally, don't search over inscrutable computer programs like neural networks in the first place.
  • If you do need to use programs you don't understand, use them in as little of your system architecture as possible.
    • Building up a 3D model of a scene from camera footage, for example, is currently beyond what software engineers can manually write a program to do. But once this is done, navigation might be straightforward.
    • It's better to have an opaque modelling process, that feeds into a well-understood action-selection process, than to have a system that was end-to-end trained to opaquely map sensor data to actions.
      • At least with the modelling process, we can check our program against the ground truth of an actual 3D scene as seen by a camera.
  • Formally verify that your software really has all the properties it needs to work safely.
    • Formal verification is great for identifying which assumptions need to hold for your software to operate correctly. They're listed at the top of any correctness proof as the axioms.

The State of AI Safety

The dominant AI-building paradigms used today generally do not provide any safety guarantees at all. It's straightforward to come up with a family of computer programs which are parameterized by lists of numbers, such as neural networks. And it's standard practice to pick an optimization metric and apply the magic of calculus to efficiently search this large space of computer programs for candidates that score highly on that metric.

This is already not great from a safety perspective. But it turns out that the resulting programs are also generally totally inscrutable to human inspection! It takes additional work, which we don't currently know how to do in general, to take a learned program and turn it into a well-documented and human-legible codebase.

That paradigm is basically a recipe for disaster. And the entire field of AI Safety, and this entire sequence, is about coming up with a different recipe, which reliably allows us to safely build and deploy highly capable software systems.

Up next: back to game theory! Where I think optimization can play an important role in achieving good outcomes.

New Comment
1 comment, sorted by Click to highlight new comments since:

But it turns out that the resulting programs are also generally totally inscrutable to human inspection!

The paper "White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?" (from a team in mostly in Berkeley) claims to have found a transformer-like architecture with the property that the optima of the SGD training process applied to it are themselves (unrolled) alternating optimization processes that optimize a known and understood informational metric. If that is correct, that would mean that there is a mathematically tractable description of what models trained using this architecture are actually doing. So rather then just being enormous inscrutable black boxes made of tensors, their behavior can be reasoned about in close form in terms of what the resulting mesa-optimizer is actually optimizing.. The authors also claim that they would expect of the internals of models using this architecture to be particularly sparse, orthogonal, and thus interpretable. If true, these claims both sound like they would have huge implications for the mathematical analysis of the safety of machine learned models, and for Mechanical Interpretability. Is anyone with the appropriate matchatical and Mech Interpretability skills looking at thos paper and trained models using the variant o transformer architecture that the authors describe?