"Correct prediction isn't enough. You must follow that insight with effective action that actually makes a difference" 

- paraphrase of Holden Karnofsky from recent 80 thousand hours podcast 

see also: Timelines post 2

I've been working on a post about my timelines for a while now, but I keep getting bogged down in details, so I'm going to break this into pieces. Hopefully I'll get some critique which will guide me to the most important parts to explain.

My background in neuroscience gives me some specific hypotheses about which capabilities are present in the human brain but still missing in SotA language models. I'm going to avoid going into details about these, to avoid giving ideas or encouragement to capabilities researchers. Unfortunately, my caution may not matter since I'm aware of a variety of research groups, including some with funding primarily from the US military, working on neuroscience-inspired algorithmic improvements for machine learning models.

My main probability space is that we have about 10 +/- 4 years until a dangerously capable model is developed. But I think there is some small but non-negligible chance that researchers come across an algorithmic improvement or clever model-self-improvement scaffolding which allows for a surprisingly discontinuous sharp left turn in abilities.

Everyone I've talked to in detail about this has given me arguments for why this probably won't happen, but no arguments for why it can't happen. Given how dangerous this could be, arguments for improbability are not very reassuring.

We don't have the luxury of waiting until we have a perfect understanding and plan of action for alignment. We need to keep working on our understanding and alignment plans, but also to work on our evaluation and containment just in case.

I have some specific ideas for how we might work on evaluation and containment. On the evaluation side, Beth Barnes' project is making important progress. Søren Elverlin is working on a related but different evaluation approach, testing the abilities of models to make effective long range strategic decisions (using historical business data). In addition to providing a safety check, I think evals offer an important way for the alignment community to gather empirical evidence to help persuade safety-unconvinced capabilities researchers.


On the containment side, I think this post makes a lot of good points. I have some further ideas, particularly I think there's a valuable approach in running the evaluation process many times over starting with a heavily nerfed (impaired) model and backing off the impairment gradually after each run. This process would be stopped if either the alarm triggers for escape potential got hit, or the model performance seemed flatten or dip suspiciously (suggesting the model was sandbagging to hide its true capabilities). Model nerfing can be accomplished in a variety of ways such as adding dropout and noise masks.


In terms of the computer security aspects of containment for safe evaluation, Nova DasSarma has a lot of good thoughts. Perhaps the most important insight being that whoever is working on building containment should have some safety discussions with technical staff at cloud service providers like GCS, AWS, and Lambda.


I think we can build ourselves a safety-net which could give us a warning before a dangerous deployment, and that we should do that. I think the most important focus of this safety-net should be checking for unexpectedly large jumps in ability, particularly in the realms of strategic planning and self-improvement.

New Comment
1 comment, sorted by Click to highlight new comments since: