# Ω 16

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

# HIGHLIGHTS

Modeling the Human Trajectory (David Roodman) (summarized by Nicholas): This post analyzes the human trajectory from 10,000 BCE to the present and considers its implications for the future. The metric used for this is Gross World Product (GWP), the sum total of goods and services produced in the world over the course of a year.

Looking at GWP over this long stretch leads to a few interesting conclusions. First, until 1800, most people lived near subsistence levels. This means that growth in GWP was primarily driven by growth in population. Since then population growth has slowed and GWP per capita has increased, leading to our vastly improved quality of life today. Second, an exponential function does not fit the data well at all. In an exponential function, the time for GWP to double would be constant. Instead, GWP seems to be doubling faster, which is better fit by a power law. However, the conclusion of extrapolating this relationship forward is extremely rapid economic growth, approaching infinite GWP as we near the year 2047.

Next, Roodman creates a stochastic model in order to analyze not just the modal prediction, but also get the full distribution over how likely particular outcomes are. By fitting this to only past data, he analyzes how surprising each period of GWP was. This finds that the industrial revolution and the period after it was above the 90th percentile of the model’s distribution, corresponding to surprisingly fast economic growth. Analogously, the past 30 years have seen anomalously lower growth, around the 25th percentile. This suggests that the model's stochasticity does not appropriately capture the real world -- while a good model can certainly be "surprised" by high or low growth during one period, it should probably not be consistently surprised in the same direction, as happens here.

In addition to looking at the data empirically, he provides a theoretical model for how this accelerating growth can occur by generalizing a standard economic model. Typically, the economic model assumes technology is a fixed input or has a fixed rate of growth and does not allow for production to be reinvested in technological improvements. Once reinvestment is incorporated into the model, then the economic growth rate accelerates similarly to the historical data.

Nicholas's opinion: I found this paper very interesting and was quite surprised by its results. That said, I remain confused about what conclusions I should draw from it. The power law trend does seem to fit historical data very well, but the past 70 years are fit quite well by an exponential trend. Which one is relevant for predicting the future, if either, is quite unclear to me.

The theoretical model proposed makes more sense to me. If technology is responsible for the growth rate, then reinvesting production in technology will cause the growth rate to be faster. I'd be curious to see data on what fraction of GWP gets reinvested in improved technology and how that lines up with the other trends.

Rohin’s opinion: I enjoyed this post; it gave me a visceral sense for what hyperbolic models with noise look like (see the blog post for this, the summary doesn’t capture it). Overall, I think my takeaway is that the picture used in AI risk of explosive growth is in fact plausible, despite how crazy it initially sounds. Of course, it won’t literally diverge to infinity -- we will eventually hit some sort of limit on growth, even with “just” exponential growth -- but this limit could be quite far beyond what we have achieved so far. See also this related post.

The ground of optimization (Alex Flint) (summarized by Rohin): Many arguments about AI risk depend on the notion of “optimizing”, but so far it has eluded a good definition. One natural approach is to say that an optimizer causes the world to have higher values according to some reasonable utility function, but this seems insufficient, as then a bottle cap would be an optimizer (AN #22) for keeping water in the bottle.

This post provides a new definition of optimization, by taking a page from Embedded Agents (AN #31) and analyzing a system as a whole instead of separating the agent and environment. An optimizing system is then one which tends to evolve toward some special configurations (called the target configuration set), when starting anywhere in some larger set of configurations (called the basin of attraction), even if the system is perturbed.

For example, in gradient descent, we start with some initial guess at the parameters θ, and then continually compute loss gradients and move θ in the appropriate direction. The target configuration set is all the local minima of the loss landscape. Such a program has a very special property: while it is running, you can change the value of θ (e.g. via a debugger), and the program will probably still work. This is quite impressive: certainly most programs would not work if you arbitrarily changed the value of one of the variables in the middle of execution. Thus, this is an optimizing system that is robust to perturbations in θ. Of course, it isn’t robust to arbitrary perturbations: if you change any other variable in the program, it will probably stop working. In general, we can quantify how powerful an optimizing system is by how robust it is to perturbations, and how small the target configuration set is.

The bottle cap example is not an optimizing system because there is no broad basin of configurations from which we get to the bottle being full of water. The bottle cap doesn’t cause the bottle to be full of water when it didn’t start out full of water.

Optimizing systems are a superset of goal-directed agentic systems, which require a separation between the optimizer and the thing being optimized. For example, a tree is certainly an optimizing system (the target is to be a fully grown tree, and it is robust to perturbations of soil quality, or if you cut off a branch, etc). However, it does not seem to be a goal-directed agentic system, as it would be hard to separate into an “optimizer” and a “thing being optimized”.

This does mean that we can no longer ask “what is doing the optimization” in an optimizing system. This is a feature, not a bug: if you expect to always be able to answer this question, you typically get confusing results. For example, you might say that your liver is optimizing for making money, since without it you would die and fail to make money.

The full post has several other examples that help make the concept clearer.

Rohin's opinion: I’ve previously argued (AN #35) that we need to take generalization into account in a definition of optimization or goal-directed behavior. This definition achieves that by primarily analyzing the robustness of the optimizing system to perturbations. While this does rely on a notion of counterfactuals, it still seems significantly better than any previous attempt to ground optimization.

I particularly like that the concept doesn’t force us to have a separate agent and environment, as that distinction does seem quite leaky upon close inspection. I gave a shot at explaining several other concepts from AI alignment within this framework in this comment, and it worked quite well. In particular, a computer program is a goal-directed AI system if there is an environment such that adding the computer program to the environment transforms it into an optimizing system for some “interesting” target configuration states (with one caveat explained in the comment).

# TECHNICAL AI ALIGNMENT

## AGENT FOUNDATIONS

Public Static: What is Abstraction? (John S Wentworth) (summarized by Rohin): If we are to understand embedded agency, we will likely need to understand abstraction (see here (AN #83)). This post presents a view of abstraction in which we abstract a low-level territory into a high-level map that can still make reliable predictions about the territory, for some set of queries (whether probabilistic or causal).

For example, in an ideal gas, the low-level configuration would specify the position and velocity of every single gas particle. Nonetheless, we can create a high-level model where we keep track of things like the number of molecules, average kinetic energy of the molecules, etc which can then be used to predict things like pressure exerted on a piston.

Given a low-level territory L and a set of queries Q that we’d like to be able to answer, the minimal-information high-level model stores P(Q | L) for every possible Q and L. However, in practice we don’t start with a set of queries and then come up with abstractions, we instead develop crisp, concise abstractions that can answer many queries. One way we could develop such abstractions is by only keeping information that is visible from “far away”, and throwing away information that would be wiped out by noise. For example, when typing 3+4 into a calculator, the exact voltages in the circuit don’t affect anything more than a few microns away, except for the final result 7, which affects the broader world (e.g. via me seeing the answer).

If we instead take a systems view of this, where we want abstractions of multiple different low-level things, then we can equivalently say that two far-away low-level things should be independent of each other when given their high-level summaries, which are supposed to be able to quantify all of their interactions.

Rohin's opinion: I really like the concept of abstraction, and think it is an important part of intelligence, and so I’m glad to get better tools for understanding it. I especially like the formulation that low-level components should be independent given high-level summaries -- this corresponds neatly to the principle of encapsulation in software design, and does seem to be a fairly natural and elegant description, though of course abstractions in practice will only approximately satisfy this property.

## LEARNING HUMAN INTENT

Safe Imitation Learning via Fast Bayesian Reward Inference from Preferences (Daniel S. Brown et al) (summarized by Zach): Bayesian reward learning would allow for rigorous safety analysis when performing imitation learning. However, Bayesian reward learning methods are typically computationally expensive to use. This is because a separate MDP needs to be solved for each reward hypothesis. The main contribution of this work is a proposal for a more efficient reward evaluation scheme called Bayesian REX (see also an earlier version (AN #86)). It works by pre-training a low-dimensional feature encoding of the observation space which allows reward hypotheses to be evaluated as a linear combination over the learned features. Demonstrations are ranked using pair-wise preference which is relativistic and thus conceptually easier for a human to evaluate. Using this method, sampling and evaluating reward hypotheses is extremely fast: 100,000 samples in only 5 minutes using a PC. Moreover, Bayesian REX can be used to play Atari games by finding a most likely or mean reward hypothesis that best explains the ranked preferences and then using that hypothesis as a reward function for the agent.

Prerequisites: T-REX

Zach's opinion: It's worth emphasizing that this isn't quite a pure IRL method. They use preferences over demonstrations in addition to the demonstrations themselves and so they have more information than would be available in a pure IRL context. However, it’s also worth emphasizing that (as the authors show) pixel-level features make it difficult to use IRL or GAIL to learn an imitation policy, which means I wasn’t expecting a pure IRL approach to work here. Conceptually, what's interesting about the Bayesian approach is that uncertainty in the reward distribution translates into confidence intervals on expected performance. This means that Bayesian REX is fairly robust to direct attempts at reward hacking due to the ability to directly measure overfitting to the reward function as high variance in the expected reward.

Avoiding Side Effects in Complex Environments (Alexander Matt Turner, Neale Ratzlaff et al) (summarized by Rohin): Previously, attainable utility preservation (AUP) has been used to solve (AN #39) some simple gridworlds. Can we use it to avoid side effects in complex high dimensional environments as well? This paper shows that we can, at least in SafeLife (AN #91). The method is simple: first train a VAE on random rollouts in the environment, and use randomly generated linear functions of the VAE features as the auxiliary reward functions for the AUP penalty. The Q-functions for these auxiliary reward functions can be learned using deep RL algorithms. Then we can just do regular deep RL using the specified reward and the AUP penalty. It turns out that this leads to fewer side effects with just one auxiliary reward function and a VAE whose latent space is size one! It also leads to faster learning for some reason. The authors hypothesize that this occurs because the AUP penalty is a useful shaping term, but don’t know why this would be the case.

## FORECASTING

Reasons you might think human level AI soon is unlikely (Asya Bergal) (summarized by Rohin): There is a lot of disagreement about AI timelines, that can be quite decision-relevant. In particular, if we were convinced that there was a < 5% chance of AGI in the next 20 years, that could change the field’s overall strategy significantly: for example, we might focus more on movement building, less on empirical research, and more on MIRI’s agent foundations research. This talk doesn't decisively answer this question, but discusses three different sources of evidence one might have for this position: the results of expert surveys, trends in compute, and arguments that current methods are insufficient for AGI.

Expert surveys usually suggest a significantly higher than 5% chance of AGI in 20 years, but this is quite sensitive to the specific framing of the question, and so it’s not clear how informative this is. If we instead ask experts what percentage of their field has been solved during their tenure and extrapolate to 100%, the extrapolations for junior researchers tend to be optimistic (decades), whereas those of senior researchers are pessimistic (centuries).

Meanwhile, the amount spent on compute (AN #7) has been increasing rapidly. At the estimated trend, it would hit \$200 billion in 2022, which is within reach of large governments, but would presumably have to slow down at that point, potentially causing overall AI progress to slow. Better price performance (how many flops you can buy per dollar) might compensate for this, but hasn't been growing at comparable rates historically.

Another argument is that most of our effort is now going into deep learning, and methods that depend primarily on deep learning are insufficient for AGI, e.g. because they can’t use human priors, or can’t do causal reasoning, etc. Asya doesn’t try to evaluate these arguments, and so doesn’t have a specific takeaway.

Rohin's opinion: While there is a lot of uncertainty over timelines, I don’t think under 5% chance of AGI in the next 20 years is very plausible. Claims of the form “neural nets are fundamentally incapable of X” are almost always false: recurrent neural nets are Turing-complete, and so can encode arbitrary computation. Thus, the real question is whether we can find the parameterization that would correspond to e.g. causal reasoning.

I’m quite sympathetic to the claim that this would be very hard to do: neural nets find the simplest way of doing the task, which usually does not involve general reasoning. Nonetheless, it seems like by having more and more complex and diverse tasks, you can get closer to general reasoning, with GPT-3 (AN #102) being the latest example in this trend. Of course, even then it may be hard to reach AGI due to limits on compute. I’m not claiming that we already have general reasoning, nor that we necessarily will get it soon: just that it seems like we can’t rule out the possibility that general reasoning does happen soon, at least not without a relatively sophisticated analysis of how much compute we can expect in the future and some lower bound on how much we would need for AGI-via-diversity-of-tasks.

Relevant pre-AGI possibilities (Daniel Kokotajlo) (summarized by Rohin): This page lists 47 things that could plausibly happen before the development of AGI, that could matter for AI safety or AI policy. You can also use the web page to generate a very simple trajectory for the future, as done in this scenario that Daniel wrote up.

Rohin's opinion: I think this sort of reasoning about the future, where you are forced into a scenario and have to reason what must have happened and draw implications, seems particularly good for ensuring that you don’t get too locked in to your own beliefs about the future, which will likely be too narrow.

## MISCELLANEOUS (ALIGNMENT)

Preparing for "The Talk" with AI projects (Daniel Kokotajlo) (summarized by Rohin): At some point in the future, it seems plausible that there will be a conversation in which people decide whether or not to deploy a potentially risky AI system. So one class of interventions to consider is interventions that make such conversations go well. This includes raising awareness about specific problems and risks, but could also include identifying people who are likely to be involved in such conversations and concerned about AI risk, and helping them prepare for such conversations through training, resources, and practice. This latter intervention hasn't been done yet: some simple examples of potential interventions would be generating official lists of AI safety problems and solutions which can be pointed to in such conversations, or doing "practice runs" of these conversations.

Rohin's opinion: I certainly agree that we should be thinking about how we can convince key decision makers of the level of risk of the systems they are building (whatever that level of risk is). I think that on the current margin it's much more likely that this is best done through better estimation and explanation of risks with AI systems, but it seems likely that the interventions laid out here will become more important in the future.

# AI STRATEGY AND POLICY

Medium-Term Artificial Intelligence and Society (Seth D. Baum) (summarized by Rohin): Like a previously summarized paper (AN #90), this paper aims to find common ground between near-term and long-term priorities in medium-term concerns. This can be defined along several dimensions of an AI system: when it chronologically appears, how feasible it is to build it, how certain it is that we can build it, how capable the system is, how impactful the system is, and how urgent it is to work on it.

The paper formulates and evaluates the plausibility of the medium term AI hypothesis: that there is an intermediate time period in which AI technology and accompanying societal issues are important from both presentist and futurist perspectives. However, it does not come to a strong opinion on whether the hypothesis is true or not.

#### FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

#### PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

# Ω 16

New Comment
Rohin’s opinion: I enjoyed this post; it gave me a visceral sense for what hyperbolic models with noise look like (see the blog post for this, the summary doesn’t capture it). Overall, I think my takeaway is that the picture used in AI risk of explosive growth is in fact plausible, despite how crazy it initially sounds.

One thing this post led me to consider is that when we bring together various fields, the evidence for 'things will go insane in the next century' is stronger than any specific claim about (for example) AI takeoff. What is the other evidence?

We're probably alone in the universe, and anthropic arguments tend to imply we're living at an incredibly unusual time in history. Isn't that what you'd expect to see in the same world where there is a totally plausible mechanism that could carry us a long way up this line, in the form of AGI and eternity in six hours? All the pieces are already there, and they only need to be approximately right for our lifetimes to be far weirder than those of people who were e.g. born in 1896 and lived to 1947 - which was weird enough, but that should be your minimum expectation.

In general, there are three categories of evidence that things are likely to become very weird over the next century, or that we live at the hinge of history:

1) Specific mechanisms around AGI - possibility of rapid capability gain, and arguments from exploratory engineering

2) Economic and technological trend-fitting predicting explosive growth in the next century

3) Anthropic and Fermi arguments suggesting that we live at some extremely unusual time

All of these are evidence for such a claim. 1) is because a superintelligent AGI takeoff is just a specific example for how the hinge occurs. 3) is already directly arguing for that, but how does 2) fit in with 1) and 3)?

There is something a little strange about calling a fast takeoff from AGI and whatever was driving superexponential growth throughout all history the same trend - there is some huge cosmic coincidence that causes there to always be superexponential growth - so as soon as population growth + growth in wealth per capita or whatever was driving it until now runs out in the great stagnation (which is visible as a tiny blip on the RHS of the double-log plot), AGI takes over and pushes us up the same trend line. That's clearly not possible, so there would have to be some factor responsible for both if AGI is what takes us up the rest of that trend line - a factor that was at work in the founding of Jericho but predestined that AGI would be invented and cause explosive growth in the 21st century, rather than the 19th or the 23rd.

For AGI to be the driver of the rest of that growth curve, there has to be a single causal mechanism that keeps us on the same trend and includes AGI as its final step - if we say we are agnostic about what that mechanism is, we can still call 2) evidence for us living at the hinge point, though we have to note that there is a huge blank spot in need of explanation. Is there anything that can fill it to complete the picture?

The mechanism proposed in the article seems like it could plausibly include AGI.

If technology is responsible for the growth rate, then reinvesting production in technology will cause the growth rate to be faster. I'd be curious to see data on what fraction of GWP gets reinvested in improved technology and how that lines up with the other trends.

But even though the drivers seem superficially similar - they are both about technology, the claim is that one very specific technology will generate explosive growth, not that technology in general will - it seems strange that AGI would follow the same growth curve caused by reinvesting more GWP in improving ordinary technology which doesn't improve your own ability to think in the same way that AGI would.

As for precise timings, the great stagnation (last 30ish years) just seems like it would stretch out the timeline a bit, so we shouldn't take the 2050s seriously - as much as the last 70 years work on an exponential trend line there's really no way to make it fit overall as that post makes clear.

we will eventually hit some sort of limit on growth, even with “just” exponential growth -- but this limit could be quite far beyond what we have achieved so far. See also this related post.

One major intuitive finding that came out of that post was that most of the adjustments I made to the speed and continuity of the takeoff seem fairly marginal - I think that if you presented any one of those trajectories in isolation you would call them exceptionally fast.

I strongly suspect that as well as disagreements about discontinuities, there are very strong disagreements about 'post-RSI speed' - maybe over orders of magnitude.

This is what the curves look like if s (the effective 'power' of RSI) is set to 0.1 - the takeoff is much slower even if RSI comes about fairly abruptly.

Claims of the form “neural nets are fundamentally incapable of X” are almost always false: recurrent neural nets are Turing-complete, and so can encode arbitrary computation.

I think RNNs are not Turing-complete (assuming the activations and weights can be represented by a finite number of bits). Models with finite state space (reading from an infinite input stream) can't simulate a Turing machine.

(Though I share the background intuition.)

[This comment is no longer endorsed by its author]Reply