Have an Unreasonably Specific Story About The Future

Jay Bailey

One of the problems with AI safety is that our goals are often quite distant from our day-to-day work. I want to reduce the chance of AI killing us all, but what I'm doing today is filling out a security form for the Australian government, reviewing some evaluation submissions, and editing this post. How does the latter get us to the former?^[1]

Thus, the topic of today’s article: Have an unreasonably specific story about the future. That is - you should be able to come up with at least one concrete scenario for how your work leads to the high-level good outcomes you want. By the conjunction fallacy, the story isn’t likely to work out exactly the way you envision. Hence the phrase unreasonably specific. You don’t want to predict the future here, which relies on a lot of hedging and broad trends. You want to make things as concrete as possible.^[2] This reduces forecasting accuracy, but that isn’t the point. The point is, if you cannot imagine a single concrete story for how your plan helps achieve what you want, that’s a very bad sign.

I find it is best to do this with backchaining. I’ll provide a couple of examples of this, but let’s start with my destination. My high-level good outcome is to decrease the odds of humanity wiping ourselves out with ASI. There are two critical paths I can see to do this - increase the chances of building ASI safely, or decrease the chance of building it unsafely. Ambitious alignment research like agent foundations or some varieties of mechanistic interpretability aims at the former. Governance work and defensive strategies like AI control aim at the latter. Anything that is unlikely to help with at least one of those critical paths is a strategy I reject as insufficient in my preference ordering.

You may have your own critical paths, but this critical path concept is important. I find it crucial to have a filter that can sort lots of plans early so you can focus on the best ones.

So, let’s look at AI control as a first example, since my story here is relatively short. Let's say I want to do some research into cheaper monitoring techniques. How does that lead to a better future?

AI control doesn’t build safer AI directly - it's not about training or building systems. How might it help avoid unsafe ASI?
If we can get useful work out of sub-ASI systems, AI control increases the level of capability we can harness before it becomes dangerous.
Monitoring techniques are meaningfully useful towards this goal.
In order for this to work, frontier AI companies need to adopt these protocols.
Therefore I either need to work at a frontier AI company that is interested in adopting them, or publicise my research in a way that gets the companies to pay attention.
The labs are more likely to adopt techniques if they are cheaper, and thus this work increases the chance of labs doing good x-risk reducing things.

There are lots of questions we could ask about this story. Can we get useful work out of sub-ASI systems? Will AI companies invest in control measures? Will AI companies pay attention to research from outside their own internal control teams? All of these are good questions, and one of the biggest benefits of having this story is to have a story that is concrete enough to be criticised and improved.

The next step is to go over and actually ask, at each step, if the step is realistic. For instance, here is another story about why I might work on a frontier lab’s evaluation team on biorisk (specific threat model chosen basically at random from a shortlist in my head):

Evaluations aim to provide information about how safe systems are; they don’t directly make systems safer. Therefore, this is a “Don’t build unsafe ASI” strategy.
To not build unsafe ASI, a critical decision will need to be made around regulation of AI or deployment/training of a potentially dangerous system flagged by evaluations.
A frontier AI company (we’ll use OpenBrain as the example) is most likely to listen to their own evals team over external teams, so I should work for a company like OpenBrain.
Biorisk is sufficiently concerning that OpenBrain may make a decision not to deploy a model or slow down entirely based on results in this area.
The particular evaluation I am working on is a meaningful input into this threat model.
Therefore it makes sense for me to be on this team, working on this evaluation - it has a chance of finding a danger that, if it exists, would give OpenBrain pause and allow them to mitigate this risk.

The benefit is that this is a very granular story and it helps you keep your eye on the ball. You can ask yourself the same series of questions for every evaluation, and you don’t need to continually recompute the earlier steps every day, either. For each new evaluation you can just ask yourself the last 2 steps, and reevaluate the earlier steps less frequently.

So, what assumptions exist in this story? I think the biggest one here is that OpenBrain has a reasonable probability of slowing down or adding mitigations that prevent unsafe ASI being built as a result of their evaluations team’s results. If you agree with this, it makes sense to work there. If you don’t, I don’t see anywhere near as much value in it. Writing the story out is an exercise that lets you ask and answer these questions.

The last thing here is that this exercise helps you determine if something is helpful to do, but not if it’s the most helpful thing to do. For instance, here is a story I consider pretty reasonable:

In order to avoid building ASI, we need international coordination.
International coordination requires buy-in from politicians in order to be achieved.
Politicians listen to the public. The more the public wants something, the more likely it is to be achieved. (Clearly politicians don’t listen solely to the public, but it is at least one relevant factor in their decision making.)
The more someone hears about something, the more likely they are to consider the arguments and potentially come to support that thing.
People I know are members of the public.
Therefore, my action should be to talk to friends and family about extinction risk, and try to convince them.

This is an entirely sound story in my book. I think talking to friends and family about extinction risk can very much be a positive action, and it’s grounded in a proper chain of events. But it does lead to another question - is this the best thing you can do? Could you take actions like consulting your local representative or widening your reach online to do better with your advocacy? That’s an entirely different question, and I’ll write about one model I use for it in my next post.

In the meantime, I encourage you to take five minutes, and ask yourself if you currently have a clear path to how your work might improve the future. If the answer is yes - great! If the answer is no, it may take more than five minutes to come up with one - but I think this work is well worth doing, and I'd love it if doing this and asking about this was a standard practice.

^{^}
The answer sometimes is "It doesn't". The point of this exercise is not to reach for a plausible story, but to decide if your work is actually on that path, so you can decide whether or not to pivot if the answer is no.
^{^}
I remember a quote once that talked about a journalist from a major publication knowing exactly which Biden official they were trying to reach with a given policy article. You don't have to be this specific, but if you can be, that's an amazing input into this exercise!

[-]Mikhail Samin1mo20

Strongly upvoted. Was writing a post about this exact thing (people should backchain more), went around lw, found a backlink to here from the backchaining post. Thanks for writing this!

[-]Jay Bailey2mo20

But it does lead to another question - is this the best thing you can do? Could you take actions like consulting your local representative or widening your reach online to do better with your advocacy? That’s an entirely different question, and I’ll write about one model I use for it in my next post.

Here are the cliff notes of what I expect that post to look like, so people do still have a pointer to this idea when reading this post now.

My model is a "bullseye" theory of impact, like an archery ring, where you want to be close to the center. Each layer out is perhaps 5-10x less impactful than the next layer. For instance, say you wanted the US and China to sign a treaty on banning ASI development. The bullseye layer here would be literally being Donald Trump or Xi Jinping. The next layer out is being one of their closest advisors, or someone nominally in charge of this area. That's still probably close to an order of magnitude less impact. Being a policy writer who is read by those advisors is another layer out from that, and so on.

To maximise your impact, it thus makes sense to try and move closer to the bullseye (through taking on closer positions or focusing your influence/advocacy inwards) or to take lots of shots on goal. BlueDot Impact are doing this strategy - they have a theory of impact around stuff like "Teach civil servants about AI risk". This is pretty low impact individually, but they plan to put thousands of people every year through these courses, so they can make up for this with volume.

29

Have an Unreasonably Specific Story About The Future

29

29

29