People frequently ask me about my backstory - how I got into alignment/agency research, what I did before, that sort of thing. One of the main things I emphasize is that I was thinking about analogous problems in economics and especially in biology, and I think the view from that angle made it much more obvious where AI alignment was going to run into the same major barriers.

Below is an essay I wrote in summer 2017, arguing that understanding foundational problems of agency is the primary bottleneck to progress in a wide variety of scientific fields. Hopefully this will give some idea of where my views on alignment/agency research stem from.

The Scientific Bottleneck

Imagine you’re in a sci-fi universe in the style of StarTrek or Stargate or the like. You’ve bumped into a new alien species, drama ensued, and now you’re on their ship and need to hack into their computer system. Actually, to simplify the discussion, let’s say you’re the aliens, and you’re hacking into the humans’ computer system.

Let’s review just how difficult this problem is.

You’re looking at billions of tiny electronic wires and switches and capacitors. You have a rough idea of the high-level behavior they produce - controlling the ship, navigating via the stars, routing communications, etc. But you need to figure out how that behavior is built up out of wires and switches and electronic pulses and whatnot. As a first step, you’ll probably scan the whole CPU and produce a giant map of all the switches and wires and maybe even run a simulation of the system. But this doesn’t really get you any closer to understanding the system or, more to the point, any closer to hacking it.

So how can we really understand the computer system? Well, you’ll probably notice pretty quickly that there’s regular patterns on the CPU. At the low level, there’s things like wires and switches. You might also measure the voltages in those wires and switches, and notice that the exact voltage level doesn’t matter much; there’s high voltages and low voltages, and the exact details don’t seem to matter once you know whether it’s high or low. Then you might notice some higher-level structures, patterns of wires and switches which form other standard elements, like memory elements and logic gates. But eventually, you’re going to exhaust the “hardware” properties, and you’ll need to start mapping “software”. That problem will be even harder: you’ll basically be doing reverse compilation, except you’ll need to reverse compile the operating system at the same time as the programs running on it, and without knowing what language(s) any of those programs were written in.

That’s basically the state of biology research today.

There’s millions of researchers poking at this molecule or that molecule, building very detailed pictures of small pieces of the circuitry of living organisms. But we don’t seem much closer to decoding the higher-level language. We don’t seem any closer to assigning meaning to the signals propagating around in the code of living organisms.

Of course, part of the problem is that organisms weren’t written in any higher level language. They were evolved. It’s not clear that it’s possible to assign meaning to a single molecular signal in a cell, any more than you could assign meaning to a single electron in a circuit. There certainly is meaning somewhere in the mess - organisms model their environments, so the information they’re using is in there somewhere. But it’s not obvious how to decode that information.

All that said, biologists have a major advantage over aliens trying to hack human computer systems: software written by humans is *terrible*. (Insert obligatory Java reference here.) Sure, there’s lots of abstraction levels, lots of patterns to find, but there’s no universal guiding principle.

Organisms, on the other hand, all came about by evolution. That means they’re a mad hodgepodge of random bits and pieces, but it also means that every single piece in that hodgepodge is *optimized*. Every single piece has been tweaked toward the same end goal.

The Problem: General

There’s a more general name for systems which arise by optimization: adaptive systems. Typical examples include biological organisms, economic/financial systems, the brain, and machine learning/AI systems.

Each of these fields faces the same fundamental problem as biology: we have loads of data on the individual components of a big, complicated system. Maybe it’s protein expression and signalling in organisms, maybe it’s financial data on individual assets in an economy, maybe it’s connectivity and firing data on neurons in a brain, maybe it’s parameters in a neural network. In each case, we know that the system somehow processes information into a model of the world around it, and acts on that model. In some cases, we even know the exact utility function. But we don’t have a good way to back out the system’s internal model.

What we need is some sort of universal translator: a way to take in protein expression data or neuron connectivity or what have you, and translate it into a human-readable description of the system’s internal model of the world.

Note that this is fundamentally a theory problem. The limiting factor is not insufficient data or insufficient computing power. Google throws tremendous amounts of data and computational resources into training neural networks, but decoding the internal models used by those networks? We lack the mathematical tools to even know where to start.


A while ago I wrote a post on the hierarchy of the sciences, featuring this diagram:

Yeah, I know, it's kinda cheesy. It was five years ago, ok?

The dotted line is what I called the “real science and engineering frontier”. The fields within the line are built on robust experiments and quantitative theory. Their foundations and core principles are well-understood, enough that engineering disciplines have been built on top of them. The fields outside have not yet reached that point. Fields right on the frontier or just outside are exciting places to be - these are the fields which are, right now, crossing the line from crude experiments and incomplete theories to robust, quantitative sciences.

What’s really interesting is that the fields on or just outside the frontier - biology, AI, economics, and psychology - are exactly the fields which study adaptive systems. And they are all stuck on qualitatively similar problems: decoding the internal models of complex systems.

This suggests that the lack of mathematical tools for decoding adaptive systems is the major bottleneck limiting scientific progress today.

Removing that bottleneck - developing useful theory for decoding adaptive systems - would unblock progress in at least four fields. It would revolutionize AI and biology almost overnight, and economics and psychology would likely see major advances shortly thereafter.


Let’s make the problem a little more concrete. Here are a few questions which a solid theory of adaptive systems should be able to answer.

  • How can we recognize adaptive systems in the wild? What universal behaviors indicate an adaptive optimizer?
  • There are already strong theoretical reasons to believe that any adaptive system which predicts effectively has learned to approximate some Bayesian model; the history of machine learning provides plenty of evidence supporting the theory as well. Given a fully specified adaptive system, e.g. a trained neural network, how can we back out the Bayesian model which it approximates?
  • Bayesian models are constrained by the rules of probability, but we can also add the rules of causality. How can we tell when an adaptive system (e.g. a neural net) has learned to approximate a causal model, and how can we back out that model?
  • Outside of machine learning/AI, utility functions are generally unknown. We know that e.g. a bacteria is evolved to maximize evolutionary fitness, but how can we estimate the shape of the fitness function based on parameters of the optimized system?
  • Under what conditions will an adaptive system learn models with levels of abstraction? How can those abstractions be translated into something human-readable?
  • Once the fitness function and internal models used by a bacteria have been decoded, how can new information or objectives be passed back into the cell via chemical concentrations or genetic modification? More generally, how can human-readable information (including probabilities, causal relationships, utility, and abstractions) be translated back into the parameter space of an adaptive system?

Obviously this list is just a start, but it captures the flavor of the major problems.


8 comments, sorted by Click to highlight new comments since: Today at 1:37 AM
New Comment

Every single piece has been tweaked toward the same end goal

I feel like there'll be a better way to say this sentence once we figure out the answer to your first question,

How can we recognize adaptive systems in the wild? What universal behaviors indicate an adaptive optimizer?

It most definitely seems to make sense to say that systems can have goals, in a "if it looks like a duck it makes sense to call it a duck" kind of way. But at the same time, every single piece hasn't been tweaked to the same end goal as the system. They are each tweaked towards their own survival, and that's somewhat aligned with the system's survival.

Something I wish there are more lesswrong posts for (or at least I wish I've seen more lesswrong posts for) is posts exploring alignment in the context of :

  1. Organisms and their smaller replicator components (organisms < cells, cells < transposons& endoviruses & organelles)
  2. Social thingies and their smaller sorta-replicator components (religions < religious ideas, companies < replicating Management ideas)

If you have your favorite post that falls into the above genre or mentions something to that effect, please absolutely link me to it! I'd love to read more.

I'm only halfway through the A-Z sequence, so I'd also very much appreciate it if you could point to things in there to get me excited about progressing through it!

I don't think we have hope of developing such tools, at least not in a way that looks like anything we had in the past. In the past we have been able to analyse large systems by throwing away an immense amount of detail - it turns out that you don't need the specific position of atoms to predict the movement of the planets, and you don't need the details to predict all of the other things we have successfully predicted with traditional math.

With the systems you are describing, this is simply impossible. Changing a single bit in a computer can change its output completely, so you can't build a simple abstraction that predicts it, you need to simulate it completely. 

We already have a way of taking immense amounts of complicated data and finding patterns in it, it's machine learning itself. If you want to translate what it learned into human readable descriptions, you just have to incorporate language in it - humans after all can describe their reasoning steps and why they believe what they believe (maybe not easily).

Google throws tremendous amounts of data and computational resources into training neural networks, but decoding the internal models used by those networks? We lack the mathematical tools to even know where to start.

I predict this will be done in the coming years by using large multimodal models to analyse neural network parameters, or to explain their own workings.

Changing a single bit in a computer can change its output completely, so you can't build a simple abstraction that predicts it, you need to simulate it completely. 

Biology is complex, but changing a single molecule in a bacterium or neuron in a brain doesn't completely change the output because they're evolved to be robust to such things

I’m not sure the problem in biology is decoding. At least not in the same sense it is with neural networks. I see the main difficulty in biology more one of mechanistic inference where a major roadblock may be getting better measurements of what is going on in cells over time rather some algorithm that’s just going to be able to overcome the fact that you’re getting both very high levels of molecular noise in biological data and single snapshots in time that are difficult to place in context. With a neural network you have the parameters and it seems reasonable to say you just need some math to make it more interpretable.

Whereas in biology I think we likely need both better measurements and better tools. I’m not sure the same tools would be particularly applicable to the ai interpretability problem either.

If, for example, I managed to create mathematical tools to reliably learn mechanistic dependencies between proteins and/or genes from high dimensional biological data sets, it’s not clear to me that would be easily applicable to extracting bayes nets from large neural networks.

I’m coming at this from a comp bio angle so it’s possible I’m just not seeing the connections well, having not worked in both fields.

Strong-voted. This is so exciting.
Any specific research avenues where AI and economics research could overlap?

Around the time I first got into alignment, I was thinking about how to model markets as agents (e.g. what beliefs does a market as a whole have? What goals does it have?). That turned into Why Subagents?.

I also spent a little bit of time reading up on Theory of the Firm, looking for alignment-relevant ideas; there's a lot of stuff there about aligning employees and firms, or when it makes sense for a firm to outsource (i.e. use "subagents") vs do things in-house, etc. That eventually led to the Pointers Problem post (via the ideas in Incentive Design With Imperfect Credit Allocation).

I expect there's plenty more useful analogies to mine on either of those paths, and probably many other paths besides. Though note that this does require a nontrivial skill: one needs to be able to boil down the generalizable "core idea" of an argument, in a form which can carry over to another field.

What would such a representation look like for a computer? There might exist some method for computing how the circuits are divided into modules and submodules, but how would you represent what they do? You don’t expect it to be annotated in natural language, do you?

I mean, just in case I wasn’t clear enough, you want a program that takes in a representation of some system and outputs something a human can understand, right? But even if you could automatically divide a system into a tree of submodules such that a human could in principle describe how any one works in terms of short descriptions of the function of its submodules, there is no obvious way of automatically computing those descriptions. So if you gave a circuit diagram of a CPU as the input to that universal translator, what do you want it to output?