This is a reference post. It explains a fairly standard class of arguments, and is intended to be the opposite of novel; I just want a standard explanation to link to when invoking these arguments.
When planning or problem-solving, we focus on the hard subproblems. If I’m planning a road trip from New York City to Los Angeles, I’m mostly going to worry about which roads are fastest or prettiest, not about finding gas stations. Gas stations are abundant, so that subproblem is easy and I don’t worry about it until harder parts of the plan are worked out. On the other hand, if I were driving an electric car, then the locations of charging stations would be much more central to my trip-planning. In general, the hard subproblems have the most influence on the high-level shape of our solution, because solving them eats up the most degrees of freedom.
In the context of AI alignment, which subproblems are hard and which are easy?
Here’s one class of arguments: compute capacity and data capacity are both growing rapidly over time, so it makes sense to treat those as “cheap” - i.e. anything which can be solved by throwing more compute/data at it is easy. The hard subproblems, then, are those which are still hard even with arbitrarily large amounts of compute and data.
In particular, with arbitrary compute and data, we basically know how to get best-possible predictive power on a given data set: Bayesian updates on low-level physics models or, more generally, approximations of Solomonoff induction. So we’ll also assume predictive power is “cheap” - i.e. anything which can be solved by more predictive power is easy.
This is also reasonable in machine learning practice - once a problem is reduced to predictive power on some dataset, we can throw algorithms at it until it’s solved. The hard part - as many data scientists will attest - is reducing our real objective to a prediction problem and collecting the necessary data. It’s rare to find a client with a problem where all we need is predictive power and the necessary data is just sitting there.
(We could also view this as an interface argument: “predictive problems” are a standard interface, with libraries, tools, algorithms, theory and specialists all set up to handle them. As in many other areas, setting up our actual problem to fit that interface while still consistently doing what we want is the hard/expensive part.)
The upshot of all this: in order to identify alignment subproblems which are likely to be hard, it’s useful to ask what would go wrong if the world-modelling parts of our system just do Bayesian updates on low-level physics models or use approximations of Solomonoff induction. We don’t ask this because we actually expect to use such algorithms, but rather because we expect that the failure modes which still appear under such assumptions are the hard failure modes.