Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Here is some obvious advice.

I think a common failure mode when working on AI alignment[1] is to not focus on the hard parts of the problem first. This is a problem when generating a research agenda, as well as when working on any specific research agenda. Given a research agenda, there are normally many problems that you know how to make progress on. But blindly working on what seems tractable is not a good idea.

Let's say we are working on a research agenda about solving problems A, B, and C. We know that if we find solutions to A, B, and C we will solve alignment. However, if we can't solve even one subproblem, the agenda would be doomed. If C seems like a very hard problem, that you are not sure you can solve, it would be a bad idea to flinch away from C and work on problem A instead, when A seems so much more manageable.

If solving A takes a lot of time and effort, all of that time and effort would be wasted, if you can't solve C in the end. It's especially worrisome when A has tight fightback loops, such that you constantly feel like you are making progress. Or when it is just generally fun to work on A.

Of course, it can make sense to work on A first if you expect this to help you solve C, or at least give you more information on its tractability. The general version of this is illustrated by considering that you have a large list of problems that you need to solve. In this case, focusing on problems that will provide you with information that will be helpful for solving many of the other problems can be very useful. But even then you should not lose sight of the hard problems that might block you down the road.

The takeaway is that these two things are very different:

  • Solving A as an instrumental subgoal in order to make progress on C, when C is a potential blocker.
  • Avoiding C, because it seems hard, and instead working on A because it seems tractable.

  1. Though I expect this to be a general problem that comes up all over the place. ↩︎

New Comment
13 comments, sorted by Click to highlight new comments since:

Clearing out the easy parts helps to focus on what remains. Solving easier versions of the hard parts prepares the tools and intuition to overcome what couldn't be confronted immediately. Experience with similar problems allows to navigate the many formulations of the problem that are importantly different from each other in ways that wouldn't be apparent from the outset.

(In the spirit of balancing any advice with some advice in the opposite direction. Also, it's not the consequence that makes a problem important, it is that you have a reasonable attack.)

Yes, that is a good heuristic too, though I feel like one that does not conflict at all with the one proposed in the OP. Seem to me like they are complementary.

I think Richard Hamming would say something like "Identify the important problems of your field and then tackle those and make sure that you can actually solve them. It wouldn't do any good thinking about the most important problems all the time that you then cannot solve." This also seems to not contradict the OP and seems to be a technique that can be combined with the one in the OP.

This essay by Jacob Steinhardt makes a similar (and fairly fleshed out) argument.

What is the hardest part of AI alignment?

Now that is the right question. There is the AGI Ruin list which talks about a lot of the hard problems.

I think a very core thing is figuring out how can we make a system robustly "want" something. There are actually a bunch more heuristics that you can use in order to determine good problems to work. One is to think about what things need to be solved because they will show up in virtually all agendas (or at least all agendas of a particular type). And how to make a system robustly "want" something probably falls into that category.

If we could just figure out this, we might be able to get away with not figuring out human values. Potentially we could make the AI perform some narrow task, that constitutes a pivotal act. However, figuring out just how to make a system robustly "want" something does not seem to be enough. We need to also figure out how to make the system "want" to perform the narrow thing that constitutes a pivotal act. And we also need to make it such that the system would not spawn misaligned subagents. And probably a bunch more problems that did not come immediately to mind.

I think a better question is whether we have the environment that can cultivate the necessary behavior leading us to solve the most challenging aspects of AI alignment.

I assert that if you have subproblems A and B which are tractable and actionable, and subproblem C which is a scary convoluted mess where you wouldn't even know where to start, that is not an indication that you should jump right into trying to solve subproblem C. Instead, I assert that this is an indication that you should take a good hard look at what you're actually trying to accomplish and figure out

  1. Is there any particular reason to expect this problem as a whole to be solvable at all, even in principle?
  2. Is there a way to get most of the value of solving the entire problem without attacking the hardest part?
  3. Is it cheap to check whether the tractable-seeming approaches to A and B will actually work in practice?

If the answers are "yes", "no", "no", then I am inclined to agree that attacking C is the way to go. But also I think that combination of answers barely ever happens.

Concrete example time:

You're starting an app-based, two-sided childcare marketplace. You have identified the following problems, which are limiting your company:

A: In order to ensure that each daycare has enough children to justify staying on your platform, while also ensuring that enough availability remains that new customers can find childcare near them, you need to build feedback loops into your marketing tools for both suppliers and customers such that marketing spend will be distributed according to whichever side of the marketplace needs more people at any given time, on a granular per-location basis. B: In order to build trust and minimize legal risk, you need to create policies and procedures to ensure that all childcare providers are appropriately licensed, background-checked, and ensured, and to ensure that they remain eligible to provide childcare services (e.g. regularly requiring them to provide updated documents, scheduled and random inspections, etc) C: You aim to guarantee that what customers see in terms of availability is what they get, and likewise, that providers can immediately see when a slot is booked. People also need to be able to access your system at any time. You determine that what you need to do is make sure that the data you show your users on their own devices is always consistent with the data you have on your own servers and the state of the world, and always available (you shouldn't lose access to your calendar just because your wifi went out). You reduce this problem to the CAP problem.

In this scenario, I would say that "try to solve the CAP problem straight off the bat" is very much the wrong approach for this problem, and you should instead try to attack subproblems A and B instead.

As analogies to alignment go, I'm thinking approximately

A. How can we determine what a model has learned about the structure of the world by examining the process by which it converts observations about the current state into predictions about the next state (e.g. mechanistic interpretability) B. Come up with a coherent metric that measures how "in-distribution" a given input is to a given model, to detect cases where the model is operating outside of its training distribution C. Come up with a solution to the principal-agent problem.

In my ideal world, there are people working on all three subproblems, just on the off-chance that C is solvable. But in terms of things-that-actually-help-with-the-critical-path, I expect most of them to come from people who are working on A and B, or people who build off the work of the people working on A and B.

I am curious if you are thinking of different concrete examples of A/B/C as they come to alignment though.

Is there any particular reason to expect this problem as a whole to be solvable at all, even in principle?

This is another very good heuristic I think, that I agree is good to do first.

If the answers are "yes", "no", "no", then I am inclined to agree that attacking C is the way to go. But also I think that combination of answers barely ever happens.

I think in alignment 2 is normally not the case, and if 2 is not the case 3 will not really help you. That's why I think it is a reasonable assumption that you need A, B, and C solved.

Your childcare example is weird because the goal is to make money, which is a continuous success metric. You can make money without solving really any of the problems you listed. I did not say it in the original article (and I should have) but this technique is for problems where the actual solution requires you to solve all the problems. It would be like making money with childcare at all requires you to have solve all the problems. You can't solve one problem a bit and then make a bit more money. If C is proven to be impossible, or way harder than some other problem set for a different way to make money, then you should switch to a different way to make money.

In alignment, if you don't solve the problem you die. You can't solve alignment 90% and then deploy an AI build with this 90% level of understanding because then the AI will still be approximately 0% aligned and kill you. We can't figure out everything about a neural network, even what objective function corresponds to human values, except whether it is deceptively misaligned and live.

Your alignment example is very strange. C is basically "Solve Alignment" whereas A and B taken together do not constitute an alignment solution at all. The idea is that you have a set of subproblems that when taken together will constitute a solution to alignment. Having "solve alignment" in this set, breaks everything. The set should be such that when we trim all the unnecessary elements (elements that are not required because some subset of our set already constitutes a solution) you don't remove anything, because all elements are necessary. Otherwise, I could add anything to the set and still end up with a set that solves alignment. The set should be (locally) minimal in size. If we trim the set that contains solve alignment, we just end up with a single element "solve alignment" and we have not simplified the problem by factoring it at all.

Or even better you make the set a tree instead, such that each node is a task, and you split nodes that are large into (ideally independent) subtasks until you can see how to solve each subpart. I guess this is superior to the original formulation. Ideally, there is not even a hardest part in the end. It should be obvious what you need to do to solve every leave node. The point where you look for the hardest part now is looking at what node to split next (the splitting might take a significant amount of time).

Thank you for telling me about the CAP problem, I did not know about it.

Your alignment example is very strange. C is basically "Solve Alignment" whereas A and B taken together do not constitute an alignment solution at all.

That is rather my point, yes. A solution to C would be most of the way to a solution to AI alignment , and would also solve management / resource allocation / corruption / many other problems. However, as things stand now a rather significant fraction of the entire world economy is directed towards mitigating the harms caused by the inability of principals to fully trust agents to act in their best interests. As valuable as a solution to C would be, I see no particular reason to expect it to be possible, and an extremely high lower bound on the difficulty of the problem (most salient to me is the multi-trillion-dollar value that could be captured from someone who did robustly solve this problem).

I wish anyone who decides to tackle C the best of luck, but I expect that the median outcome of such work would be something of no value, and the 99th percentile outcome of such work would be something like a clever incentive mechanism which e.g. bounds the extent to which the actions of an agent can harm the principal while appearing to be helpful to zero in a perfect-information world, and which degrades gracefully in a world of imperfect information.

In the meantime, I expect that attacking subproblems A and B will have a nonzero amount of value even in worlds where nobody finds a robust solution to subproblem C (both because better versions of A and B may be able to shore up an imperfect or partial solution to C, and also because while robust solutions to A/B/C may be one sufficient set of components to solve your overall problem, they may not be the only sufficient set of components, and by having solutions to A and B you may be able to find alternatives to C).

In alignment, if you don't solve the problem you die. You can't solve alignment 90% and then deploy an AI build with this 90% level of understanding because then the AI will still be approximately 0% aligned and kill you.

Whether this is true (in the narrow sense of "the AI kills you because it made an explicit calculation and determined that the optimal course of action was to perform behaviors that result in your death" rather than in the broad sense of "you die for reasons that would still have killed you even in the absence of the particular AI agent we're talking about") is as far as I know still an open question, and it seems to me one where preliminary signs are pointing against a misaligned singleton being the way we die.

The crux might be whether you expect a single recursively-self-improving agent to take control of the light cone, or whether you don't expect a future where any individual agent can unilaterally determine the contents of the light cone.

Thank you for telling me about the CAP problem, I did not know about it.

For reference the CAP problem is one of those problems that sounds worse in theory than it actually is in practice: for most purposes -- significant partitions are rare, and " when a partition happens, data may not be fully up-to-date, and writes may be dropped or result in conflicts" is usually a good-enough solution in those rare cases.


I think that approach has merit as long as one can be sure that the problems are in fact separable and not interdependent.

That said, some times taking a break and working on the simpler tasks can spur progress in the more complex/difficult ones. But I do not think that is really at odd with your approach.

This makes sense to me, but keep in mind we're on this site and debating it because EY went "screw A, B, and C- my aesthetic sense says that the best way forward is to write some Harry Potter fanfiction."

My takeaway is that if anyone reading this is working hard but not currently focused on the hardest problem, don't necessarily fret about that. 

When Eliezer wrote HPMOR, it was not clear to him that things would go down in the 2020s. That's what he said in I think this interview. Had Eliezer's plan worked out to create a dozen new better Eliezers through his writing (which was the plan) this would have been the best action.

Also, I agree that you should not apply the reasoning suggested in the OP blindly. I think it is a useful heuristic.