Technical staff at Anthropic, previously #3ainstitute; interdisciplinary, interested in everything; ongoing PhD in CS (learning / testing / verification), open sourcerer, more at zhd.dev
We've deliberately set conservative thresholds, such that I don't expect the first models which pass the ASL-3 evals to pose serious risks without improved fine-tuning or agent-scaffolding, and we've committed to re-evaluate to check on that every three months. From the policy:
Ensuring that we never train a model that passes an ASL evaluation threshold is a difficult task. Models are trained in discrete sizes, they require effort to evaluate mid-training, and serious, meaningful evaluations may be very time consuming, since they will likely require fine-tuning.
This means there is a risk of overshooting an ASL threshold when we intended to stop short of it. We mitigate this risk by creating a buffer: we have intentionally designed our ASL evaluations to trigger at slightly lower capability levels than those we are concerned about, while ensuring we evaluate at defined, regular intervals (specifically every 4x increase in effective compute, as defined below) in order to limit the amount of overshoot that is possible. We have aimed to set the size of our safety buffer to 6x (larger than our 4x evaluation interval) so model training can continue safely while evaluations take place. Correct execution of this scheme will result in us training models that just barely pass the test for ASL-N, are still slightly below our actual threshold of concern (due to our buffer), and then pausing training and deployment of that model unless the corresponding safety measures are ready.
I also think that many risks which could emerge in apparently-ASL-2 models will be reasonably mitigable by some mixture of re-finetuning, classifiers to reject harmful requests and/or responses, and other techniques. I've personally spent more time thinking about the autonomous replication than the biorisk evals though, and this might vary by domain.
Try really hard to avoid this situation for many reasons, among them that de-deployment would suck.. I think it's unlikely, but:
(2d) If it becomes apparent that the capabilities of a deployed model have been under-elicited and the model can, in fact, pass the evaluations, then we will halt further deployment to new customers and assess existing deployment cases for any serious risks which would constitute a safety emergency. Given the safety buffer, de-deployment should not be necessary in the majority of deployment cases. If we identify a safety emergency, we will work rapidly to implement the minimum additional safeguards needed to allow responsible continued service to existing customers.
One year is actually the typical term length for board-style positions, but because members can be re-elected their tenure is often much longer. In this specific case of course it's now up to the trustees!
You may enjoy reading The Future Eaters (Flannery 2004), as an ecological history of Australia and the region - covering the period before first human settlement, the arrival of Indigenous peoples, and later European colonization.
I think this post would be substantially improved by adding a one-to-four-sentence abstract at the beginning.
Substantively, I think your proposal is already part of the standard Drake Equation and dicing the sequence of steps a little differently doesn't affect the result. There's also some recent research which many people on LessWrong think explains away the paradox: Dissolving the Fermi Paradox makes the case that we're simply very rare, and Grabby Aliens says that we're instead early. Between them, we have fairly precise quantitative bounds, and I'd suggest familiarizing yourself with the papers and follow-up research.
See e.g. Opportunities and Risks of LLMs for Scalable Deliberation with Polis, a recent collaboration between Anthropic and the Computational Democracy Project:
Polis is a platform that leverages machine intelligence to scale up deliberative processes. In this paper, we explore the opportunities and risks associated with applying Large Language Models (LLMs) towards challenges with facilitating, moderating and summarizing the results of Polis engagements. In particular, we demonstrate with pilot experiments using Anthropic's Claude that LLMs can indeed augment human intelligence to help more efficiently run Polis conversations. In particular, we find that summarization capabilities enable categorically new methods with immense promise to empower the public in collective meaning-making exercises. And notably, LLM context limitations have a significant impact on insight and quality of these results.
However, these opportunities come with risks. We discuss some of these risks, as well as principles and techniques for characterizing and mitigating them, and the implications for other deliberative or political systems that may employ LLMs. Finally, we conclude with several open future research directions for augmenting tools like Polis with LLMs.
I'm personally really excited by the potential for collective decision-making (and legitimizing, etc.) processes which are much richer than voting on candidates or proposals, but still scale up to very large groups of people. Starting as a non-binding advisory / elicitation process could facilitate adoption, too!
That said, it's very early days for such ideas - and there are enormous gaps between the first signs of life, and a system to which citizens can reasonably trust nations. Cybersecurity alone is an enormous challenge for any computerized form of democracy, and LLMs add further risks with failure modes nobody really understands yet...
(opinions my own, etc)
Thanks - https://blog.opensource.org/metas-llama-2-license-is-not-open-source/ is less detailed but as close to an authoritative source as you can get, if that helps.
And yes, this opinion is my own. More relevant than my employer is is my open source experience: eg I'm a Fellow of the Python Software Foundation, "nominated for their extraordinary efforts and impact upon Python, the community, and the broader Python ecosystem".
It sure does sound like that, but later he testifies that:
If you read his prepared opening statement carefully, he never actually claims that Llama is open source; just speaks at length about the virtues of openness and open-source. Easy to end up confused though!