I'm researching and advocating for policy to prevent catastrophic risk from AI at https://www.aipolicy.us/.
I'm broadly interested in AI strategy and want to figure out the most effective interventions to get to good AI outcomes.
We haven't asked specific individuals if they're comfortable being named publicly yet, but if advisors are comfortable being named, I'll announce that soon. We're also in the process of having conversations with academics, AI ethics folks, AI developers at small companies, and other civil society groups to discuss policy ideas with them.
So far, I'm confident that our proposals will not impede the vast majority of AI developers, but if we end up receiving feedback that this isn't true, we'll either rethink our proposals or remove this claim from our advocacy efforts. Also, as stated in a comment below:
I’ve changed the wording to “Only a few technical labs (OpenAI, DeepMind, Meta, etc) and people working with their models would be regulated currently.” The point of this sentence is to emphasize that this definition still wouldn’t apply to the vast majority of AI development -- most AI development uses small systems, e.g. image classifiers, self driving cars, audio models, weather forecasting, the majority of AI used in health care, etc.
(ETA: these are my personal opinions)
I spoke with a lot of other AI governance folks before launching, in part due to worries about the unilateralists curse. I think that there is a chance this project ends up being damaging, either by being discordant with other actors in the space, committing political blunders, increasing the polarization of AI, etc. We're trying our best to mitigate these risks (and others) and are corresponding with some experienced DC folks who are giving us advice, as well as being generally risk-averse in how we act. That being said, some senior folks I've talked to are bearish on the project for reasons including the above.
DM me if you'd be interested in more details, I can share more offline.
Your current threshold does include all Llama models (other than llama-1 6.7/13 B sizes), since they were trained with > 1 trillion tokens.
Yes, this reasoning was for capabilities benchmarks specifically. Data goes further with future algorithmic progress, so I thought a narrower criteria for that one was reasonable.
I also think 70% on MMLU is extremely low, since that's about the level of ChatGPT 3.5, and that system is very far from posing a risk of catastrophe.
This is the threshold for the government has the ability to say no to, and is deliberately set well before catastrophe.
I also think that one route towards AGI in the event that we try to create a global shutdown of AI progress is by building up capabilities on top of whatever the best open source model is, and so I'm hesitant to give up the government's ability to prevent the capabilities of the best open source model from going up.
The cutoffs also don't differentiate between sparse and dense models, so there's a fair bit of non-SOTA-pushing academic / corporate work that would fall under these cutoffs.
Thanks for pointing this out, I'll think about if there's a way to exclude sparse models, though I'm not sure if its worth the added complexity and potential for loopholes. I'm not sure how many models fall into this category -- do you have a sense? This aggregation of models has around 40 models above the 70B threshold.
It's worth noting that this (and the other thresholds) are in place because we need a concrete legal definition for frontier AI, not because they exactly pin down which AI models are capable of catastrophe. It's probable that none of the current models are capable of catastrophe. We want a sufficiently inclusive definition such that the licensing authority has the legal power over any model that could be catastrophically risky.
That being said -- Llama 2 is currently the best open-source model and it gets 68.9% on the MMLU. It seems relatively unimportant to regulate models below Llama 2 because anyone who wanted to use that model could just use Llama 2 instead. Conversely, models that are above Llama 2 capabilities are at the point where it seems plausible that they could be bootstrapped into something dangerous. Thus, our threshold was set just above the limit.
Of course, by the time this regulation would pass, newer open-source models are likely to come out, so we could potentially set the bar higher.
Yeah, this is fair, and later in the section they say:
Careful scaling. If the developer is not confident it can train a safe model at the scale it initially had planned, they could instead train a smaller or otherwise weaker model.
Which is good, supports your interpretation, and gets close to the thing I want, albeit less explicitly than I would have liked. I still think the "delay/pause" wording pretty strongly implies that the default is to wait for a short amount of time, and then keep going at the intended capability level. I think there's some sort of implicit picture that the eval result will become unconcerning in a matter of weeks-months, which I just don't see the mechanism for short of actually good alignment progress.
The first line of defence is to avoid training models that have sufficient dangerous capabilities and misalignment to pose extreme risk. Sufficiently concerning evaluation results should warrant delaying a scheduled training run or pausing an existing one
It's very disappointing to me that this sentence doesn't say "cancel". As far as I understand, most people on this paper agree that we do not have alignment techniques to align superintelligence. Therefor, if the model evaluations predict an AI that is sufficiently smarter than humans, the training run should be cancelled.
Deliberately create a (very obvious) inner optimizer, whose inner loss function includes no mention of human values / objectives.Grant that inner optimizer ~billions of times greater optimization power than the outer optimizer.Let the inner optimizer run freely without any supervision, limits or interventions from the outer optimizer.
I think that the conditions for an SLT to arrive are weaker than you describe.
For (1), it's unclear to me why you think you need to have this multi-level inner structure. If instead of reward circuitry inducing human values, evolution directly selected over policies, I'd expect similar inner alignment failures. It's also not necessary that the inner values of the agent make no mention of human values / objectives, it needs to both a) value them enough to not take over, and b) maintain these values post-reflection.
For (2), it seems like you are conflating 'amount of real world time' with 'amount of consequences-optimization'. SGD is just a much less efficient optimizer than intelligent cognition -- in-context learning happens much faster than SGD learning. When the inner optimizer starts learning and accumulating knowledge, it seems totally plausible to me that this will happen on much faster timescales than the outer selection.
For (3), I don't think that the SLT requires the inner optimizer to run freely, it only requires one of:
a. the inner optimizer running much faster than the outer optimizer, such that the updates don't occur in time.
b. the inner optimizer does gradient hacking / exploration hacking, such that the outer loss's updates are ineffective.
Evolution, of course, does have this structure, with 2 levels of selection, it just doesn't seem like this is a relevant property for thinking about the SLT.
Sometimes, but the norm is to do 70%. This is mostly done on a case by case basis, but salient factors to me include: