Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

There are people not motivated to solve AI alignment, who do work related to AI alignment. E.g. people work on adversarial robustness, understanding how to do science mechanically, or on advancing other paradigms that are more interpretable than modern ML. These people might be interested in the science, or work on it for some other personal reason. They probably will do a worse job, compared to, if they would try to advance AI alignment, even when they work on something that is useful for AI alignment.

This basic idea was mentioned by Buck in a talk.

The following is a list of reasons why somebody who tries to solve alignment directly, would be better at solving alignment (though this sentence alone makes it sound obvious):

  • They are more likely to switch directions once they realize that they could be doing something better with their time, to make progress on AI alignment.
    • E.g. somebody who is interested in type theory and then learns that they can help AI alignment might be excited to help, but when there is lots of evidence that they should do something that does not involve type theory, they will keep sticking to doing things with type theory until the end, because their interest in type theory outweighs their desire to advance AI alignment.
  • The path that they take to solve the problem might look very different.
    • It is less likely that they take unpromissing but interesting sidetracks.
    • Simplifications that they make to the problem decrease the value of a solution less, in expectation.
    • In general, if there are multiple ways to solve the problem, the solution we end up with will likely be more relevant for alignment.
  • They can employ the full power of their consequentialist reasoning and be agentic about what to do, without starting to goodhart.
    • E.g. if you just let them do whatever, they are likely to discover things that are useful that you did not think of before. If somebodies main objective is not to solve AI alignment, it is likely that they will follow whatever looks best to their real motivation, as long as they can find some plausible explanation for why this is useful for AI alignment so that they have an excuse (in the case where they are payed to work on this to advance AI alignment).

There are probably many more points I have not thought of. How much you want to solve alignment compared to other things is a spectrum. What you care about might naturally drift. When you work on something for a long time, you get attached to your work. That's something to keep in mind.

It's interesting to think about the difference, between trying to solve alignment and just doing related work. It can help to notice when we fall into this trap ourselves. Also, getting clear on this might help in doing the good things (e.g. the things in the list above) even more.

New Comment
2 comments, sorted by Click to highlight new comments since: Today at 12:00 PM

Counterpoint:

Sometimes its easier to reach a destination when you're not aiming for it. You can't reach the sun by pointing a rocket at it and generating thrust. It's hard to climb a mountain by going up the steepest path. You can't solve a Millenium Prize math problem by staring at it until a solution reveals itself.

Sometimes you need to slingshot around a planet a few times to adjust orbital momentum. Sometimes you'll summit faster by winding around or zigzagging. Sometimes you have to play around with unrelated math problems before you notice something insightful.

And perhaps, working on AI problems that are only tangentially related to alignment could reveal a path of least resistance toward a solution that wouldn't have even been considered otherwise.

You list several possibilities for how directly working on the problem is not the best thing to do. Somebody who is competent and tries to solve the problem would consider these possibilities and make use of them.

I agree that sometimes there will be a promising path, that is discovered by accident. You could not have planned for discovering it. Or could you? Even if you can't predict what path will reveal itself, you can be aware that there are paths that will reveal themselves in circumstances that you could not predict. You can still plan to do some specific type of "random stuff" that you expect might lead to something that you did not think of before.

There are circumstances where you discover something on accident without even thinking of the possibility (and considering that it might be worth investigating). I still expect in these circumstances that somebody who tries to solve AI alignment will make better use of the opportunity for making progress on AI alignment.

To me, it seems that trying to do X is generally better at achieving X, the more competent you are. It does include strategies like don't try hard to solve X, insofar as that seems useful. The worse you are at optimizing, the better it is to do random stuff, as you might stumble upon solutions your cognitive optimization process could not find.