This is a special post for short-form writing by mesaoptimizer. Only they can create top-level comments. Comments here also appear on the Shortform Page and All Posts page.
14 comments, sorted by Click to highlight new comments since: Today at 12:53 AM

2022-08; Jan Leike, John Schulman, Jeffrey Wu; Our Approach to Alignment Research

OpenAI's strategy, as of the publication of that post, involved scalable alignment approaches. Their philosophy is to take an empirical and iterative approach[1] to finding solutions to the alignment problem. Their strategy for alignment is cyborgism, where they create AI models that are capable and aligned enough to further alignment research enough that they can align even more capable models.[2]

Their research focus is on scalable approaches to direct models[3]. This means that the core of their strategy involves RLHF. They don't expect RLHF to be sufficient on its own, but it is necessary for the other scalable alignment strategies they are looking at[4].

They intend to augment RLHF with AI-assisted scaled up evaluation (ensuring RLHF isn't bottlenecked by a lack of accurate evaluation data for tasks too onerous for baseline humans to evaluate performance for)[5].

Finally, they then intend to use these partially-aligned models to do alignment research, since they anticipate that alignment approaches that work and are viable for low capability models may not be adequate for models with higher capabilities.[6] They intend to use the AI-based evaluation tools to both RLHF-align models, and as part of a process where humans evaluate alignment research produced by these LLMs (here's the cyborgism part of the strategy).[7]

Their "Limitations" section of their blog post does clearly point out the vulnerabilities in their strategy:

  • Their strategies involve using one black box (scalable evaluation models) to align another black box (large LLMs being RLHF-aligned), a strategy I am pessimistic about, although it probably is good enough given low enough capability models
  • They ignore non-Godzilla strategies such as interpretability research and robustness (aka robustness to distribution shift and adverserial attacks - see Stephen Casper's research for an idea about this), and they do intend to hire researchers so their portfolio includes investment in this research direction
  • They may be wrong about achieving the creation of AI models that are partially-aligned and help with alignment research but aren't so capable that they can cause pivotal acts. If so, then the pivotal acts achieved will only be partially aligned to that of the AI wielder and will probably not lead to a good ending.

  1. We take an iterative, empirical approach: by attempting to align highly capable AI systems, we can learn what works and what doesn’t, thus refining our ability to make AI systems safer and more aligned.

  2. We believe that even without fundamentally new alignment ideas, we can likely build sufficiently aligned AI systems to substantially advance alignment research itself.

  3. At a high-level, our approach to alignment research focuses on engineering a scalable training signal for very smart AI systems that is aligned with human intent.

  4. We don’t expect RL from human feedback to be sufficient to align AGI, but it is a core building block for the scalable alignment proposals that we’re most excited about, and so it’s valuable to perfect this methodology.

  5. RL from human feedback has a fundamental limitation: it assumes that humans can accurately evaluate the tasks our AI systems are doing. Today humans are pretty good at this, but as models become more capable, they will be able to do tasks that are much harder for humans to evaluate (e.g. finding all the flaws in a large codebase or a scientific paper). Our models might learn to tell our human evaluators what they want to hear instead of telling them the truth.

  6. There is currently no known indefinitely scalable solution to the alignment problem. As AI progress continues, we expect to encounter a number of new alignment problems that we don’t observe yet in current systems. Some of these problems we anticipate now and some of them will be entirely new.

    We believe that finding an indefinitely scalable solution is likely very difficult. Instead, we aim for a more pragmatic approach: building and aligning a system that can make faster and better alignment research progress than humans can.

  7. We believe that evaluating alignment research is substantially easier than producing it, especially when provided with evaluation assistance. Therefore human researchers will focus more and more of their effort on reviewing alignment research done by AI systems instead of generating this research by themselves. Our goal is to train models to be so aligned that we can off-load almost all of the cognitive labor required for alignment research.


Sidenote: I like how OpenAI ends their blog posts with an advertisement for positions they are hiring for, or programs they are running. That's a great strategy to advertise to the very people they want to reach.

I've noticed that there are two major "strategies of caring" used in our sphere:

  • Soares-style caring, where you override your gut feelings (your "internal care-o-meter" as Soares puts it) and use cold calculation to decide.
  • Carlsmith-style caring, where you do your best to align your gut feelings with the knowledge of the pain and suffering the world is filled with, including the suffering you cause.

Nate Soares obviously endorses staring unflinchingly into the abyss that is reality (if you are capable of doing so). However, I expect that almost-pure Soares-style caring (which in essence amounts to "shut up and multiply", and consequentialism) combined with inattention or an inaccurate map of the world (aka broken epistemics) can lead to making severely sub-optimal decisions.

The harder you optimize for a goal, the better your epistemology (and by extension, your understanding of your goal and the world) should be. Carlsmith-style caring seems more effective since it very likely is more robust to having bad epistemology compared to Soares-style caring.

(There are more pieces necessary to make Carlsmith-style caring viable, and a lot of them can be found in Soares' "Replacing Guilt" series.)

Does this come from a general idea of "optimizing hard" means higher risk of damage caused by errors in detail, and "optimizing soft" has enough slack so as not to have the same risks, but also soft is less ambitious and likely less effective (if both are actually implemented well)?

a general idea of “optimizing hard” means higher risk of damage caused by errors in detail


“optimizing soft” has enough slack so as not to have the same risks, but also soft is less ambitious and likely less effective

I disagree with the idea that "optimizing soft" is less ambitious. "Optimizing soft", in my head, is about as ambitious as "optimizing hard", except it makes the epistemic uncertainty more explicit. In this model of caring I am trying to make more legible, I believe that Carlsmith-style caring may be more robust to certain epistemological errors humans can make that can result in severely sub-optimal scenarios, because it is constrained by human cognition and capabilities.

Note: I notice that this can also be said for Soares-style caring -- both are constrained by human cognition and capabilities, but in different ways. Perhaps both have different failure modes, and are more effective in certain distributions (which may diverge)?

Backing up a step, because I'm pretty sure we have different levels of knowledge and assumptions (mostly my failing) about the differences between "hard" and "soft" optimizing.

I should acknowledge that I'm not particularly invested in EA as a community or identity. I try to be effective, and do some good, but I'm exploring rather than advocating here. 

Also, I don't tend to frame things as "how to care", so much as "how to model the effects of actions, and how to use those models to choose how to act".  I suspect that's isomorphic to how you're using "how to care", but I'm not sure of that.

All that said, I think of "optimizing hard" as truly taking seriously the "shut up and multiply" results, even where it's uncomfortable epistemically, BECAUSE that's the only way to actually do the MOST POSSIBLE good.  actually OPTIMIZING, you know?  "soft" is almost by definition less ambitious, BECAUSE it's epistemically more conservative, and gives up average expected value in order to increase modal goodness in the face of that uncertainty.  I don't actually know if those are the positions taken by those people.  I'd love to hear different definitions of "hard" and "soft", so I can better understand why they're both equal in impact.

I predict this is not really an accurate representation of Soares-style caring. (I think there is probably some vibe difference between these two clusters that you're tracking, but I doubt Nate Soares would advocate "overriding" per se)

I doubt Nate Soares would advocate “overriding” per se

Acknowledged, that was an unfair characterization of Nate-style caring. I guess I wanted to make explicit two extremes. Perhaps using the name "Nate-style caring" is a bad idea.

(I now think that "System 1 caring" and "System 2 caring" would have been much better.)

Just a quote I find rather interesting, since it is rare to see a Hero's Journey narrative with a Return that involves the hero not knowing if he will ever belong or find meaning once he returns, and yet chooses to return, having faith in his ability to find meaning again:

If every living organism has a fixed purpose for its existence, then one thing's for sure. I [...] have completed my mission. I've fulfilled my purpose. But a great amount of power that has served its purpose is a pain to deal with, just like nuclear materials that have reached the end of their lifespan. If that's the case, there'll be a lot of questions. Would I now become an existence that this place doesn't need anymore?
The time will come when the question of whether it's okay for me to remain in this place will be answered.
If there's a reason to remain in this place, then it's probably that there are still people that I love in this place.
And that people who love me are still here.
Which is why that's enough reason for me to stay here.
I'll stay here and find other reasons as to why I should stay here...
That's what I've decided on.

-- The final chapter of Solo Leveling

Thoughts on Tom Everitt, Ramana Kumar, Victoria Krakovna, Shane Legg; 2019; Modeling AGI Safety Frameworks with Causal Influence Diagrams:

Causal Influence Diagrams are interesting, but don't really seem all that useful. Anyway, the latest formal graphical representation for agents that the authors seem to promote are structured causal models so you don't read this paper for object level usefulness but incidental research contributions that are really interesting.

The paper divides AI systems into two major frameworks:

  • MDP-based frameworks (aka RL-based systems such as AlphaZero), which involve AI systems that take actions and are assigned a reward for their actions
  • Question-answering systems (which includes all supervised learning systems, including sequence modellers like GPT), were the system gives an output given an input and is scored based on a label of the same data type as the output. This is also informally known as tool AI (they cite Gwern's post, which is nice to see).

I liked how lucidly they defined wireheading:

In the basic MDP from Figure 1, the reward parameter ΘR is assumed to be unchanging. In reality, this assumption may fail because the reward function is computed by some physical system that is a modifiable part of the state of the world. [...] This gives an incentive for the agent to obtain more reward by influencing the reward function rather than optimizing the state, sometimes called wireheading.

The common definition of wireheading is informal enough that different people would map it to different specific formalizations in their head (or perhaps have no formalization and therefore be confused), and having this 'more formal' definition in my head seems rather useful.

Here's their distillation for Current RF-optimization, a strategy to avoid wireheading (which reminds me of shard theory, now that I think about it -- models that avoid wireheading by modelling effects of resulting changes to policy and then deciding what trajectory of actions to take):

An elegant solution to this problem is to use model-based agents that simulate the state sequence likely to result from different policies, and evaluate those state sequences according to the current or initial reward function.

Here's their distillation of Reward Modelling:

A key challenge when scaling RL to environments beyond board games or computer games is that it is hard to define good reward functions. Reward Modeling [Leike et al., 2018] is a safety framework in which the agent learns a reward model from human feedback while interacting with the environment. The feedback could be in the form of preferences, demonstrations, real-valued rewards, or reward sketches. [...] Reward modeling can also be done recursively, using previously trained agents to help with the training of more powerful agents [Leike et al., 2018].

The resulting CI diagram modelling actually made me feel like I grokked Reward Modelling better.

Here's their distillation of CIRL:

Another way for agents to learn the reward function while interacting with the environment is Cooperative Inverse Reinforcement Learning (CIRL) [Hadfield-Menell et al., 2016]. Here the agent and the human inhabit a joint environment. The human and the agent jointly optimize the sum of rewards, but only the human knows what the rewards are. The agent has to infer the rewards by looking at the human’s actions.

The difference between RM and CIRL causal influence diagrams is interesting, because there is a subtle difference. The authors imply that this minor difference matters and can imply different things about system incentives and therefore safety guarantees, and I am enthusiastic about such strategies for investigating safety guarantees.

The authors describe a wireheading-equivalent for QA systems called self-fulfilling prophecies:

The assumption that the labels are generated independently of the agent’s answer sometimes fails to hold. For example, the label for an online stock price prediction system could be produced after trades have been made based on its prediction. In this case, the QA-system has an incentive to make self-fulfilling prophecies. For example, it may predict that the stock will have zero value in a week. If sufficiently trusted, this prediction may lead the company behind the stock to quickly go bankrupt. Since the answer turned out to be accurate, the QA-system would get full reward. This problematic incentive is represented in the diagram in Figure 9, where we can see that the QA-system has both incentive and ability to affect the world state with its answer [Everitt et al., 2019].

They propose a solution to the self-fulfilling prophecies problem, via making oracles optimize for reward in the counterfactual world where their answer doesn't influence the world state and therefore the label which they are optimized for. While that is a solution, I am unsure how one can get counterfactual labels for complicated questions whose answers may have far reaching consequences in the world.

It is possible to fix the incentive for making self-fulfilling prophecies while retaining the possibility to ask questions where the correctness of the answer depends on the resulting state. Counterfactual oracles optimize reward in the counterfactual world where no one reads the answer [Armstrong, 2017]. This solution can be represented with a twin network [Balke and Pearl, 1994] influence diagram, as shown in Figure 10. Here, we can see that the QA-system’s incentive to influence the (actual) world state has vanished, since the actual world state does not influence the QA-system’s reward; thereby the incentive to make self-fulfilling prophecies also vanishes. We expect this type of solution to be applicable to incentive problems in many other contexts as well.

The authors also anticipate this problem but instead of considering whether and how one can tractably calculate counterfactual labels, they connect this intractability to introducting the debate AI safety strategy:

To fix this, Irving et al. [2018] suggest pitting two QA-systems against each other in a debate about the best course of action. The systems both make their own proposals, and can subsequently make arguments about why their own suggestion is better than their opponent’s. The system who manages to convince the user gets rewarded; the other system does not. While there is no guarantee that the winning answer is correct, the setup provides the user with a powerful way to poke holes in any suggested answer, and reward can be dispensed without waiting to see the actual result.

I like how they explicitly mention that there is no guarantee that the winning answer is correct, which makes me more enthusiastic about considering debate as a potential strategy.

They also have an incredibly lucid distillation of IDA. Seriously, this is significantly better than all the Paul Christiano posts I've read and the informal conversations I've had about IDA:

Iterated distillation and amplification (IDA) [Christiano et al., 2018] is another suggestion that can be used for training QA-systems to correctly answer questions where it is hard for an unaided user to directly determine their correctness. Given an original question Q that is hard to answer correctly, less powerful systems Xk are asked to answer a set of simpler questions Qi. By combining the answers Ai to the simpler questions Qi, the user can guess the answer ˆA to Q. A more powerful system Xk+1 is trained to answer Q, with ˆA used as an approximation of the correct answer to Q.

Once the more powerful system Xk+1 has been trained, the process can be repeated. Now an even more powerful QA-system Xk+2 can be trained, by using Xk+1 to answer simpler questions to provide approximate answers for training Xk+2. Systems may also be trained to find good subquestions, and for aggregating answers to subquestions into answer approximations. In addition to supervised learning, IDA can also be applied to reinforcement learning.

I have no idea why they included Drexler's CAIS -- but it is better than reading 300 pages of the original paper:

Drexler [2019] argues that the main safety concern from artificial intelligence does not come from a single agent, but rather from big collections of AI services. For example, one service may provide a world model, another provide planning ability, a third decision making, and so on. As an aggregate, these services can be very competent, even though each service only has access to a limited amount of resources and only optimizes a short-term goal.

The authors claim that the AI safety issues commonly discussed can be derived 'downstream' of modelling these systems more formally, using these causal influence diagrams. I disagree, due to the amount of degrees of freedom the modeller is given when making these diagrams.

In the discussion section, the authors talk about the assumptions underlying the representations, and their limitations. They explicitly point out how the intensional stance may be limiting and not model certain classes of AI systems or agents (hint: read their newer papers!)

Overall, the paper was an easy and fun read, and I loved the distillations of AI safety approaches in them. I'm excited to read papers by this group.

I want to differentiate between categories of capabilities improvement in AI systems, and here's the set of terms I've come up with to think about them:

  • Infrastructure improvements: Capability boost in the infrastructure that makes up an AI system. This involves software (Pytorch, CUDA), hardware (NVIDIA GPUs), operating systems, networking, the physical environment where the infrastructure is situated. This probably is not the lowest hanging fruit when it comes to capabilities acceleration.

  • Scaffolding improvements: Capability boost in an AI system that involves augmenting the AI system via software features. Think of it as keeping the CPU of the natural language computer the same, but upgrading its RAM and SSD and IO devices. Some examples off the top of my head: hyperparameter optimization for generating text, use of plugins, embeddings for memory. More information is in beren's essay linked in this paragraph.

  • Neural network improvements: Any capability boost in an AI system that specifically involves improving the black-box neural network that drives the system. This is mainly what SOTA ML researchers focus on, and is what has driven the AI hype over the past decade. This can involve architectural improvements, training improvements, finetuning afterwards (RLHF to me counts as capabilities acceleration via neural network improvements), etc.

There probably are more categories, or finer ways to slice the space of capability acceleration mechanisms, but I haven't thought about this in as much detail yet.

As far as I can tell, both capabilities augmentation and capabilities acceleration contribute to achieving recursive self-improving (RSI) systems, and once you hit that point, foom is inevitable.

Alignment agendas can generally be classified into two categories: blueprint-driven and component-driven. Understanding this distinction is probably valuable for evaluating and comprehending different agendas.

Blueprint-driven alignment agendas are approaches that start with a coherent blueprint for solving the alignment problem. They prioritize the overall structure and goals of the solution before searching for individual components or building blocks that fit within that blueprint. Examples of blueprint-driven agendas include MIRI's agent foundations, Vanessa Kosoy and Diffractor's Infrabayesianism, and carado's formal alignment agenda. Research aimed at developing a more accurate blueprint, such as Nate Soares' 2022-now posts, Adam Shimi's epistemology-focused output, and John Wentworth's deconfusion-style output, also fall into this category.

Component-driven alignment agendas, on the other hand, begin with available components and seek to develop new pieces that work well with existing ones. They focus on making incremental progress by developing new components that can be feasibly implemented and integrated with existing AI systems or techniques to address the alignment problem. OpenAI's strategy, Deepmind's strategy, Conjecture's LLM-focused outputs, and Anthropic's strategy are examples of this approach. Agendas that serve as temporary solutions by providing useful components that integrate with existing ones, such as ARC's power-seeking evals, also fall under the component-driven category. Additionally, the Cyborgism agenda and the Accelerating Alignment agenda can be considered component-driven.

The blueprint-driven and component-driven categorization seems to me to be more informative than dividing agendas into conceptual and empirical categories. This is because all viable alignment agendas require a combination of conceptual and empirical research. Categorizing agendas based on the superficial pattern of their current research phase can be misleading. For instance, shard theory may initially appear to be a blueprint-driven conceptual agenda, like embedded agency. However, it is actually a component-driven agenda, as it involves developing pieces that fit with existing components.

Given the significant limitations of using a classifier to detect AI generated text, it seems strange to me that OpenAI went ahead and built one and threw it out for the public to try. As far as I can tell, this is OpenAI aggressively acting to cover its bases for potential legal and PR damages due to ChatGPT's existence.

For me this is a slight positive evidence for the idea that AI Governance may actually be useful in extending the timelines, but only if it involves adverserial actions that act on the vulnerabilities of these companies. But even then, that seems like a myopic decision given the existence of other, less controllable actors (like China), racing as fast as possible towards AGI.

Jan Hendrik Kirchner now works at OpenAI, it seems, given that he is listed as the author of this blog post. I don't see this listed on his profile or on his substack or twitter account, so this is news to me.